Skip to main content

DMX — Delta Multiplexed Model Format. Near-lossless neural network weight compression.

Project description

DMX — Delta Multiplexed Model Format

A new compression format for neural network weights.

Original:  9.1 GB  (SVD-XT, FP32 — 80% includes FP32→FP16 conversion)
DMX:       1.8 GB
Original:  7.2 GB  (Wan 2.2 14B shard, FP32 — 79.5% includes FP32→FP16)
DMX:       1.5 GB  (142/142 tensors verified)
Original:  16 GB   (Llama 3 8B, FP16 — 55% pure FP16 compression)
DMX:       ~7.2 GB (+0.16% perplexity on wikitext-2)

Try it now

pip install dmx-compress
dmx compress your_model.safetensors compressed.dmx

Download pre-compressed models now

Model Original DMX Savings Verified
Wan 1.3B 2.7 GB 1.1 GB 60% 825/825 tensors
Wan 2.2 shard 7.2 GB 1.5 GB 79.5% 142/142 tensors
SVD-XT 9.1 GB 1.8 GB 80% Roundtrip verified

Why this matters for frontier training

From first principles: In high-precision training, a full checkpoint is effectively a near-complete copy of the model state — weights (BF16/FP32) plus optimizer states (often another 2-4x the size). Each save is massive. Teams are forced to make checkpoints sparse (every few thousand steps) to keep storage and I/O under control. This is not a bug in the tools — it is a direct consequence of the numbers involved.

The operational reality: Frontier training runs are now routinely $50M-$200M+ each. While accelerators dominate the budget, checkpoint storage, bandwidth, and recovery time are real, recurring costs. Infra teams track these "quiet" expenses closely.

Where DMX changes the equation: One high-precision baseline anchor + many tiny, exact integer deltas with zero error accumulation (see Test 8 below). 200 checkpoints of a Llama 70B-class model are projected to drop from ~28 TB raw to ~3 TB while remaining mathematically safe for resumption, branching, and analysis.

Aspect Current Status Quo With DMX Delta Chains Benefit
Checkpoint frequency Sparse (forced by cost) Dense and safe Better science and debugging
Storage for 200 ckpts (70B) ~28 TB ~3 TB (projected) ~9x reduction
Resumption fidelity Full copy required Exact integer chain Zero accumulation error (measured)
Fine-tune distribution Full copy per variant Small delta per variant 80% savings (measured on TinyLlama 1.1B)

This doesn't reduce the dominant cost (compute), but it meaningfully lowers a real operational friction point that every large lab deals with. It gives researchers and engineers far more usable history than was previously practical.


What is DMX?

DMX is a near-lossless post-training compression format for neural network weights, optimized for storage and distribution. It reduces model file sizes 55-80% while preserving model quality (+0.03-0.16% perplexity).

  • No retraining required — compress any pretrained safetensors model
  • Reversible — decompress back to the original format
  • Broad compatibility — tested on LLMs, diffusion models, video models, encoder-decoder

Storage and transfer comparison

DMX is focused on reducing model size for storage and network transfer — not runtime inference. Here's how a 140 GB model (Llama 3 70B, FP16) compares across compression approaches:

Method Compressed Size Savings Quality Loss Purpose
safetensors 140 GB 0% None Original format
gzip ~134 GB ~4% None Generic compression (barely helps on floats)
zstd-19 ~129 GB ~8% None Better generic compression (still limited)
DFloat11 ~98 GB ~30% None (lossless) Lossless NN weight compression
ZipNN ~94 GB ~33% None (lossless) Lossless NN weight compression
DMX M=7 ~63 GB ~55% +0.03% PPL Near-original quality, high compression
DMX M=6 ~56 GB ~60% +0.16% PPL Aggressive storage compression

For reference, quantized inference formats like GGUF Q8 (~50%) and Q4 (~75%) achieve similar or greater compression but are designed for a different purpose — running models directly at reduced precision with fused kernels. DMX and GGUF serve different needs and are not interchangeable.

If lossless is enough, use DFloat11 or ZipNN. If you need to run inference at lower precision, use GGUF. If you need high compression with near-original quality for storage and distribution, that's where DMX lives.

Without DMX With DMX
Llama 3 70B: 140 GB download ~36 GB download
4-5 models on 1 TB 10+ models on 1 TB

Training & DevOps use cases

Beyond individual model compression, DMX's aligned quantization enables delta encoding between related model files — useful for training infrastructure and model distribution at scale.

Use Case How DMX Helps
Checkpoint storage Delta-compress consecutive checkpoints (87.3% measured savings on GPT-2, validated on TinyLlama 1.1B). Both near-lossless (int16) and practically lossless (int32, error below FP32 noise floor) modes available.
Model distribution Distribute fine-tune variants as small deltas from a shared base model
Crash recovery Smaller checkpoints = faster reload from storage after GPU failure
Model versioning Aligned integer space enables meaningful diffs between model versions

Why not just use existing versioning tools?

Every existing ML versioning tool treats model files as opaque blobs:

Tool Version tracking Understands weight structure Delta between versions
Git LFS / DVC ✗ (full copy each version)
HuggingFace Hub ✗ (full copy each version)
W&B / MLflow ✗ (full copy each version)
xdelta (binary diff) 8.5% savings (noise)
DMX Planned 80-87% savings

The difference: subtracting two model files in raw float produces noise (IEEE 754 bit layout destroys numerical proximity). DMX's aligned quantization creates a coordinate system where subtraction produces clean, sparse integers — enabling meaningful diffs, efficient deltas, and 80-87% compression between related models.

These capabilities are under active development. See Research Directions for details and experimental results.

Key Properties

  • Up to 80% compression on FP32 models (SVD-XT: 9.1 GB -> 1.8 GB, verified roundtrip)
  • 60-74% compression on FP16 models (Llama 3 8B, Mistral 7B, Wan 1.3B)
  • 55-60% near-lossless compression on FP32 models (GPT-2, Phi-2 — +0.12-0.22% PPL)
  • GPU-accelerated decompression: 13.8x faster than CPU with --gpu flag
  • Tested on: LLMs (GPT-2, Llama 3, TinyLlama), diffusion (Wan, SVD-XT), encoder-decoder (T5)
  • No training required: pure post-training compression, works on any pretrained model

How It Works

BFP Mode (for FP16/BF16 models — recommended)

Standard FP16:  16 bits per weight (5-bit exponent wasted on unused dynamic range)
DMX BFP:        ~7 bits per weight (shared exponent per group + truncated mantissa + entropy coding)

Trained weights cluster in a narrow magnitude range — 74% use only 3 of 31 possible exponents. DMX shares one exponent per group of 32 values, eliminating wasted dynamic range, then entropy-codes the mantissa stream.

int16 Mode (for FP32 models — near-lossless)

Standard FP32:  32 bits per weight
DMX int16:      ~13 bits per weight (aligned cross-layer quantization + entropy coding)

Integer quantization as a preprocessing step (not a lossy final format) transforms float weights into a representation where entropy coding is effective. Aligned cross-layer quantization enforces a global coordinate system across layers, enabling structured compression.

Installation

pip install dmx-compress

Or from source:

git clone https://github.com/willjriley/dmx.git && cd dmx && pip install -e .

Requirements: Python 3.10+, PyTorch 2.0+. GPU (CUDA) is optional — automatically used when available for faster compression and decompression.

Quick Start

# Compress any safetensors model (auto-detects FP16 vs FP32)
dmx compress model.safetensors model.dmx --mode auto

# Practically lossless compression (FP32 models — error below FP32 noise floor)
dmx compress model.safetensors model.dmx --mode int32

# Decompress back to safetensors (auto-uses GPU if available)
dmx decompress model.dmx model.safetensors

# Verify roundtrip quality (with JSON report)
dmx verify model.safetensors model.dmx --report verify.json

# View compression info
dmx info model.dmx

Delta compression (checkpoint / model versioning)

# Delta-compress a checkpoint against a base (near-lossless, ~87% savings)
dmx delta-compress base.safetensors checkpoint.safetensors delta.dmxd

# Practically lossless delta (error below FP32 noise floor, ~87% savings)
dmx delta-compress base.safetensors checkpoint.safetensors delta.dmxd --precision int32

# Reconstruct checkpoint from base + delta
dmx delta-reconstruct base.safetensors delta.dmxd restored.safetensors

# View delta file info (sparsity, compression, per-component breakdown)
dmx delta-info delta.dmxd

Example: Compress and verify a model from HuggingFace

# Download a model
pip install huggingface_hub
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./wan_model

# Compress it
dmx compress ./wan_model/model.safetensors wan_compressed.dmx

# Decompress and verify
dmx verify ./wan_model/model.safetensors wan_compressed.dmx --report report.json

Decompression Speed

Model Mode CPU GPU (--gpu) Speedup
Wan 1.3B BFP 185s 13.4s 13.8x
SVD-XT BFP 281s 22.3s 12.5x
SVD-XT int16 10.5s -- CPU-bound

Benchmarked on RTX 4090 Laptop, Python 3.13. BFP CPU bottleneck is numpy bit manipulation; GPU path uses PyTorch CUDA ops. A native C/CUDA decoder would be 10-50x faster still.

Benchmarks

BFP Mode (FP16 models)

Model Type Original DMX Savings Quality
Llama 3 8B LLM 16 GB ~7.2 GB 55% +0.16% PPL (wikitext-2)
Wan 2.2 shard Video 7.2 GB 1.5 GB 79.5% 142/142 tensors pass
Wan 1.3B Diffusion 2.7 GB 1.1 GB 60% 825/825 tensors pass
SVD-XT Video 9.1 GB 1.8 GB 80% Verified roundtrip

Note: SVD-XT 80% includes FP32->FP16 conversion. Wan 2.2 79.5% is on FP32 source with BFP.

BFP Quality-per-Bit (Llama 3 8B, wikitext-2, 289K tokens)

Config Bits/Weight Perplexity vs FP16
FP16 baseline 16.0 5.4958 --
BFP(M=8) 9.25 5.4964 +0.01%
BFP(M=7) 8.25 5.4973 +0.03%
BFP(M=6) 7.25 5.5045 +0.16%
GGUF Q8_0 (ref) 8.50 ~5.55-5.58 ~1.0-1.5% (different purpose — inference format)

int16 Mode (FP32 models)

Model Type Original DMX Savings PPL Change
SVD-XT Video 8.9 GB 4.0 GB 55.5% Lossless
GPT-2 LLM 475 MB 201 MB 57.7% +0.22%
Phi-2 LLM 10.6 GB 4.2 GB 60.1% +0.12%

Why DMX beats generic compression

Method Bits/value Notes
gzip on safetensors ~15.5 Raw floats look like noise
zstd level 19 14.06 Dictionary matching, no prediction
DMX int16 + entropy 11.45 Aligned quantization enables structured entropy coding
DMX BFP + zstd ~4.2 Shared exponent eliminates wasted dynamic range

Pre-Compressed Models (Try It Now)

Download DMX-compressed models and decompress them yourself:

Model Original DMX Savings Verified Link
Wan 1.3B (Diffusion) 2.7 GB 1.1 GB 60% 825/825 tensors Download
Wan 2.2 14B Shard 6 7.2 GB 1.5 GB 79.5% 142/142 tensors Download
SVD-XT (Video) 9.1 GB 1.8 GB 80% Roundtrip verified Download

Each includes a JSON verification report with SHA-256 hashes and per-tensor cosine similarity scores.

Format Specification

See spec/dmx_spec_v1.md for the complete format specification.

Paper

DMX: Delta Multiplexed Compression for Neural Network Model Weights (PDF) — click to download

Background

DMX is based on the principle that floating-point weights should be transformed into multiple statistically distinct, independently modeled entropy domains prior to compression. Trained neural network weights exhibit extreme exponent clustering — 74% of FP16 values use only 3 of 31 possible exponents, wasting 2.4 bits per value. DMX decomposes the floating-point representation into separate exponent and mantissa streams, each with distinct statistical properties that benefit from independent entropy coding. For FP32 models, aligned cross-layer quantization enforces a global coordinate system across layers, enabling additional integer-domain compression. The format auto-profiles each model to select the optimal compression strategy per component.

Validated Results: Checkpoint Delta Compression

All results are measured on real data using an NVIDIA A100-SXM4-80GB. Scripts are in experiments/checkpoint_delta/.

Compression across architectures

Model Architecture Params Consecutive Delta Zeros Entropy (bits) Measured Savings
GPT-2 Decoder-only 163M 33-67% 1.76-3.02 87.3% (measured, 498→63 MB)
T5-small Encoder-decoder 110M 89-94% 0.49-0.85 Not yet measured in bytes
TinyLlama Decoder-only 1.1B 80% (measured, fine-tune base→chat)

Delta compression works across model architectures. T5 encoder-decoder shows higher sparsity than decoder-only models. Real-byte compression for T5 is pending.

Precision tiers

Both tiers achieve comparable compression — the aligned quantization produces similar entropy regardless of bit width:

Tier Consecutive Entropy Compression Error Use Case
int16 aligned 0.6-1.3 bits 87% +0.06% RelL2 Maximum compression
int32 aligned 1.0-1.2 bits ~87% 1.87e-7 RelL2 Practically lossless (error below FP32 noise floor)
Raw bit XOR (no alignment) 14-16 bits 8.5% Bit-exact Baseline — alignment is essential

Full checkpoint including optimizer states

Training checkpoints include model weights + Adam optimizer states (momentum + variance), typically 3x the weight size. Validated on GPT-2 124M, 1000 training steps:

Component % of Checkpoint Delta Sparsity Entropy Compression
Weights 33% 55-66% zeros 1.8-2.6 bits ~84%
Momentum (exp_avg) 33% 28-30% zeros 7.5-9.0 bits ~53%
Variance (exp_avg_sq) 33% 91-92% zeros 0.6 bits ~96%
Full checkpoint 100% ~79%

Safety for training resumption

Training from a DMX-reconstructed checkpoint produces 0.042% loss difference compared to the original — negligible for any practical purpose:

Step |   Original |  DMX Recon |       Diff
    1 |   0.783582 |   0.784023 | 0.00044072
   51 |   1.098088 |   1.098552 | 0.00046420
   91 |   0.537082 |   0.537364 | 0.00028241

Final avg loss (last 20 steps): 0.042% difference

Zero error accumulation in delta chains (Test 8)

Chained reconstruction (base → delta1 → delta2 → ... → deltaN) produces identical results to direct reconstruction (base + deltaN) — verified to 10 decimal places across both int16 and int32 modes. This is not an approximation: delta application is exact integer arithmetic, so error is mathematically constant regardless of chain length. Re-anchoring is needed only for delta size control, never for error control.

Fine-tune variant compression

TinyLlama 1.1B base → chat fine-tune: 80% savings (876 MB delta vs 4.4 GB full copy). Store the base model once, distribute each fine-tune variant as a small delta.

Projected savings at frontier scale

These are projections extrapolated from observed sparsity and scaling behavior across GPT-2 (163M), T5 (110M), and TinyLlama (1.1B). The underlying property — small per-step weight updates due to SGD dynamics — is scale-invariant, but real-byte validation at 70B+ scale is in progress.

Scenario Raw Storage Projected DMX Projected Savings
200 checkpoints of Llama 70B (weights only) 28 TB ~3.6 TB ~87%
200 checkpoints of Llama 70B (full w/ optimizer) 84 TB ~18 TB ~79%
20 fine-tune variants of Llama 70B 2.8 TB ~700 GB ~75%

Caveats: Validation on 8B+ models with frontier training schedules is in progress. Momentum compression (53%) was measured on wikitext-2; diverse training data may yield 40-45%, reducing full-checkpoint savings to ~70-73%. The 87% weight compression is measured on GPT-2; larger models may differ.


Research Directions

DMX's underlying compression technique applies to structured floating-point data beyond individual model files. These are active research areas, not yet proven at scale. We welcome collaboration.

1. Training checkpoint compression (highest priority). Frontier training produces hundreds of near-identical high-precision checkpoints. Aligned cross-layer quantization enables efficient delta encoding between them. Early results are in the Validated Results section above. Key finding: alignment is critical — without it, deltas show almost no sparsity and compress poorly.

2. Model family distribution. Storing fine-tuned variants (chat, code, reasoning, etc.) as small deltas from a shared base model. Early result: TinyLlama base → chat = 80% savings (876 MB vs 4.4 GB).

3. Scientific and sensor data. Early tests on NOAA weather data show similar exponent clustering, suggesting potential applications in climate, seismic, and satellite data.

License & Patent

Code: MIT License — free to use, modify, and distribute.

Methods: Patent Pending (U.S. Provisional Applications filed April 2026). The patented methods cover aligned cross-layer quantization for neural network weight compression and stream-separated block floating point encoding with independent entropy coding. Personal, academic, and open-source use is unrestricted. Commercial use of the patented methods may require a license from the inventor — contact bill.riley@gmail.com.

Citation

@software{riley2026dmx,
  author = {Riley, William J},
  title = {DMX: Delta Multiplexed Model Format},
  year = {2026},
  url = {https://github.com/willjriley/dmx}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dmx_compress-0.3.0.tar.gz (35.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dmx_compress-0.3.0-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file dmx_compress-0.3.0.tar.gz.

File metadata

  • Download URL: dmx_compress-0.3.0.tar.gz
  • Upload date:
  • Size: 35.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for dmx_compress-0.3.0.tar.gz
Algorithm Hash digest
SHA256 fa11d5e03341d4743751d43e8a219b31172347383c9dceeba3abb7065264b33e
MD5 291c804cec785c40486bb8d5adb67bcc
BLAKE2b-256 e56e1953c3e673a99b9c845b6136eb00233bc9136f6cecfc575a877f15ed879e

See more details on using hashes here.

File details

Details for the file dmx_compress-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: dmx_compress-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for dmx_compress-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c238d812ca5e16dc6819ceece5f791d27a9c026d8bacf380711332232078932b
MD5 ba57a40409ecd7f9c8871e97bfaebffa
BLAKE2b-256 204917c27d64b65cea77a30502f20288b6d8152e6094a6603975875dfb67a83c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page