Structure-aware neural network weight compression. 87% checkpoint delta encoding, not generic file compression.
Project description
DMX — Delta Multiplexed Model Format
Structure-aware compression for neural networks.
DMX transforms weight tensors into aligned integer representations, enabling efficient storage and distribution of model variants. The headline capability is 55-87% compression of safetensors / checkpoint files / checkpoint deltas with practically lossless reconstruction.
Safe for production training. Resuming from DMX-reconstructed checkpoints produces 0.15% loss difference after 50 chain resumes over 10,000 training steps — verified on GPU with reproducible scripts. Delta chains use exact integer arithmetic with zero error accumulation regardless of chain length.
Original: 9.1 GB (SVD-XT, FP32 — 80% includes FP32→FP16 conversion)
DMX: 1.8 GB
Original: 7.2 GB (Wan 2.2 14B shard, FP32 — 79.5% includes FP32→FP16)
DMX: 1.5 GB (142/142 tensors verified)
Original: 16 GB (Llama 3 8B, FP16 — 55% pure FP16 compression)
DMX: ~7.2 GB (+0.16% perplexity on wikitext-2)
Try it now
pip install dmx-compress
dmx compress your_model.safetensors compressed.dmx
Download pre-compressed models
| Model | Original | DMX | Savings | Verified |
|---|---|---|---|---|
| Wan 1.3B | 2.7 GB | 1.1 GB | 60% | 825/825 tensors |
| Wan 2.2 shard | 7.2 GB | 1.5 GB | 79.5% | 142/142 tensors |
| SVD-XT | 9.1 GB | 1.8 GB | 80% | Roundtrip verified |
What is DMX?
DMX is a structure-aware compression system for neural networks. It reduces model file sizes by 55-80% while preserving quality (+0.03-0.16% perplexity), with reversible decompression back to the original format.
DMX also supports delta-based model storage with deterministic reconstruction and ROI-driven adaptive rebasing — enabling efficient versioning across model families.
- No retraining required — compress any pretrained safetensors model
- Reversible — decompress back to the original format
- Broad compatibility — tested on LLMs, diffusion models, video models, encoder-decoder
DMX reduces the cost of storing, moving, and resuming large models — without breaking training.
Compression & Delta Storage
| Capability | Evidence |
|---|---|
| Single-file compression (55-80%) | 6+ models, Llama 3 8B through SVD-XT |
| Checkpoint delta chains (87%) | GPT-2, T5, TinyLlama, Qwen 3B |
| Full checkpoint w/ optimizer (79%) | GPT-2 1000-step, weights + momentum + variance |
| Zero chain accumulation error | Exact integer arithmetic, 10K steps / 50 resumes |
| Fine-tune variant distribution (80%) | Qwen 2.5 3B delta on HuggingFace |
How It Works
BFP Mode (for FP16/BF16 models — recommended)
Standard FP16: 16 bits per weight (5-bit exponent wasted on unused dynamic range)
DMX BFP: ~7 bits per weight (shared exponent per group + truncated mantissa + entropy coding)
Trained weights cluster in a narrow magnitude range — 74% use only 3 of 31 possible exponents. DMX shares one exponent per group of 32 values, eliminating wasted dynamic range, then entropy-codes the mantissa stream.
int16 Mode (for FP32 models — near-lossless)
Standard FP32: 32 bits per weight
DMX int16: ~13 bits per weight (aligned cross-layer quantization + entropy coding)
Integer quantization as a preprocessing step (not a lossy final format) transforms float weights into a representation where entropy coding is effective. Aligned cross-layer quantization enforces a global coordinate system across layers, enabling structured compression.
Adaptive per-tensor compression
DMX automatically picks the best compression for each tensor in your model — you don't choose a compressor, DMX does, per tensor, every time. Each tensor gets the strongest compression the candidate set can deliver, capturing the maximum benefit available without any manual tuning.
Actual savings depend on model architecture, source precision (FP16 / BF16 / FP32), and the quantization mode you select. Across the model families we have measured, savings typically fall in the 50–80% range vs the original safetensors file, with no manual tuning required.
Why DMX beats generic compression
| Method | Bits/value | Notes |
|---|---|---|
| gzip on safetensors | ~15.5 | Raw floats look like noise |
| zstd level 19 | 14.06 | Dictionary matching, no prediction |
| DMX int16 + entropy | 11.45 | Aligned quantization enables structured entropy coding |
| DMX BFP + zstd | ~4.2 | Shared exponent eliminates wasted dynamic range |
Installation
pip install dmx-compress
Or from source:
git clone https://github.com/willjriley/dmx.git && cd dmx && pip install -e .
Requirements: Python 3.10+, PyTorch 2.0+. GPU (CUDA) is optional — automatically used when available for faster compression and decompression.
Quick Start
# Compress any safetensors model (auto-detects FP16 vs FP32)
dmx compress model.safetensors model.dmx --mode auto
# Practically lossless compression (FP32 models — error below FP32 noise floor)
dmx compress model.safetensors model.dmx --mode int32
# Compress with explicit parallel encoding (defaults to min(8, cpu_count) on CPU,
# 1 on GPU). zstd releases the GIL so threads give real parallelism.
dmx compress model.safetensors model.dmx --parallel-workers 8
# Decompress back to safetensors (auto-uses GPU if available)
dmx decompress model.dmx model.safetensors
# Verify roundtrip quality (with JSON report)
dmx verify model.safetensors model.dmx --report verify.json
# View compression info
dmx info model.dmx
Delta compression (checkpoint / model versioning)
# Delta-compress a checkpoint against a base (near-lossless, ~87% savings)
dmx delta-compress base.safetensors checkpoint.safetensors delta.dmxd
# Practically lossless delta (error below FP32 noise floor, ~87% savings)
dmx delta-compress base.safetensors checkpoint.safetensors delta.dmxd --precision int32
# Reconstruct checkpoint from base + delta
dmx delta-reconstruct base.safetensors delta.dmxd restored.safetensors
# View delta file info (sparsity, compression, per-component breakdown)
dmx delta-info delta.dmxd
Chain compression (training-run checkpoints, every-N-step cadences)
DMX chain compression takes a sequence of related checkpoints (training run, fine-tune steps, branch variants) and stores them as one or more anchors plus deltas, with an automatic anchor-promotion policy that keeps the chain mathematically guaranteed to be no larger than storing each checkpoint with dmx compress independently.
# Chain-compress a sequence of checkpoints into one output directory
dmx chain-compress step_1000.safetensors step_2000.safetensors step_3000.safetensors \
--output-dir ./compressed_chain
# Reconstruct every checkpoint in the chain back to safetensors
dmx chain-reconstruct ./compressed_chain --output-dir ./restored
# Reconstruct only specific entries by index
dmx chain-reconstruct ./compressed_chain --output-dir ./restored --indices 0 2
The auto-anchor policy promotes a checkpoint to a fresh anchor whenever its delta would be larger than re-encoding the checkpoint from scratch, so the chain is self-calibrating across source dtypes and cadences. No manual tuning required.
Example: Compress and verify a model from HuggingFace
# Download a model
pip install huggingface_hub
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./wan_model
# Compress it
dmx compress ./wan_model/model.safetensors wan_compressed.dmx
# Decompress and verify
dmx verify ./wan_model/model.safetensors wan_compressed.dmx --report report.json
Benchmarks
Storage and transfer comparison
How a 140 GB model (Llama 3 70B, FP16) compares across compression approaches:
| Method | Compressed Size | Savings | Quality Loss | Purpose |
|---|---|---|---|---|
| safetensors | 140 GB | 0% | None | Original format |
| gzip | ~134 GB | ~4% | None | Generic compression (barely helps on floats) |
| zstd-19 | ~129 GB | ~8% | None | Better generic compression (still limited) |
| DFloat11 | ~98 GB | ~30% | None (lossless) | Lossless NN weight compression |
| ZipNN | ~94 GB | ~33% | None (lossless) | Lossless NN weight compression |
| DMX M=7 | ~63 GB | ~55% | +0.03% PPL | Near-original quality, high compression |
| DMX M=6 | ~56 GB | ~60% | +0.16% PPL | Aggressive storage compression |
For reference, quantized inference formats like GGUF Q8 (~50%) and Q4 (~75%) achieve similar or greater compression but are designed for a different purpose — running models directly at reduced precision with fused kernels. DMX and GGUF serve different needs and are not interchangeable.
If lossless is enough, use DFloat11 or ZipNN. If you need to run inference at lower precision, use GGUF. If you need high compression with near-original quality for storage and distribution, that's where DMX lives.
| Without DMX | With DMX |
|---|---|
| Llama 3 70B: 140 GB download | ~36 GB download |
| 4-5 models on 1 TB | 10+ models on 1 TB |
BFP Mode (FP16 models)
| Model | Type | Original | DMX | Savings | Quality |
|---|---|---|---|---|---|
| Llama 3 8B | LLM | 16 GB | ~7.2 GB | 55% | +0.16% PPL (wikitext-2) |
| Wan 2.2 shard | Video | 7.2 GB | 1.5 GB | 79.5% | 142/142 tensors pass |
| Wan 1.3B | Diffusion | 2.7 GB | 1.1 GB | 60% | 825/825 tensors pass |
| SVD-XT | Video | 9.1 GB | 1.8 GB | 80% | Verified roundtrip |
Note: SVD-XT 80% includes FP32->FP16 conversion. Wan 2.2 79.5% is on FP32 source with BFP.
BFP Quality-per-Bit (Llama 3 8B, wikitext-2, 289K tokens)
| Config | Bits/Weight | Perplexity | vs FP16 |
|---|---|---|---|
| FP16 baseline | 16.0 | 5.4958 | -- |
| BFP(M=8) | 9.25 | 5.4964 | +0.01% |
| BFP(M=7) | 8.25 | 5.4973 | +0.03% |
| BFP(M=6) | 7.25 | 5.5045 | +0.16% |
| GGUF Q8_0 (ref) | 8.50 | ~5.55-5.58 | ~1.0-1.5% (different purpose — inference format) |
int16 Mode (FP32 models)
| Model | Type | Original | DMX | Savings | PPL Change |
|---|---|---|---|---|---|
| SVD-XT | Video | 8.9 GB | 4.0 GB | 55.5% | Lossless |
| GPT-2 | LLM | 475 MB | 201 MB | 57.7% | +0.22% |
| Phi-2 | LLM | 10.6 GB | 4.2 GB | 60.1% | +0.12% |
Decompression Speed
| Model | Mode | CPU | GPU (--gpu) | Speedup |
|---|---|---|---|---|
| Wan 1.3B | BFP | 185s | 13.4s | 13.8x |
| SVD-XT | BFP | 281s | 22.3s | 12.5x |
| SVD-XT | int16 | 10.5s | -- | CPU-bound |
Benchmarked on RTX 4090 Laptop, Python 3.13. GPU path uses PyTorch CUDA ops.
Native CUDA kernels are available in kernel/dmx_kernels_v2.cu — 12 kernels covering the full compression and decompression pipeline (quantize, delta compute, BFP compress/decompress, dequantize, delta apply). Compiled and tested on A100. int32 roundtrip error: 9.3e-10.
Why DMX Matters for Training
Frontier training runs are $50M-$200M+ each. Checkpoint storage, bandwidth, and crash recovery are recurring operational costs that compound across every experiment and team. DMX addresses these directly.
What DMX enables
- Safe resumption from compressed checkpoints — 0.15% loss difference after 50 chain resumes over 10K training steps
- 87% checkpoint storage reduction — 200 checkpoints of Llama 70B: ~28 TB raw → ~3 TB (projected)
- Full checkpoint compression including optimizer — weights + momentum + variance: 79% savings (measured)
- Dense checkpoint history — save 5-10x more often without the storage penalty
- Fine-tune distribution — store base model once, each variant as a small delta (80% savings)
- Weight-shift analytics — per-layer diffs show exactly what changed between any two checkpoints
No other tool does all of this
| Tool | Delta between versions | Chain safety demonstrated |
|---|---|---|
| DMX | 87% savings (structure-aware) | 0.15% loss diff after 50 resumes / 10K steps |
| ZipNN | XOR-based delta (~44% savings) | Not published |
| DFloat11 | ✗ (per-file only, ~30%) | N/A |
| Git LFS / DVC | ✗ (full copy each version) | N/A |
| HuggingFace Hub | ✗ (full copy each version) | N/A |
| W&B / MLflow | ✗ (full copy each version) | N/A |
| xdelta (binary diff) | ~8.5% savings | Not published |
Byte-level delta tools (xdelta, ZipNN XOR) operate on raw float bits, where IEEE 754 layout destroys numerical proximity. DMX produces dramatically sparser deltas (87% vs 44%) by encoding in a structure-aware representation where similar values map to similar integers.
The operational impact
| Aspect | Current Status Quo | With DMX | Benefit |
|---|---|---|---|
| Checkpoint frequency | Sparse (forced by cost) | Dense and safe | Better science and debugging |
| Storage for 200 ckpts (70B) | ~28 TB | ~3 TB (projected) | ~9x reduction |
| Crash recovery | Reload full checkpoint | Reload small delta | Minutes instead of hours |
| Fine-tune distribution | Full copy per variant | Small delta per variant | 80% savings (measured) |
| Experimentation | Branching is expensive | Branch via small delta | 5-10x more experimental forks |
DMX transforms checkpoint management into an operational advantage. It enables safe, multi-step training resumptions, preserves per-layer diffs, and drastically reduces the storage cost of model snapshots. Engineers and researchers gain usable model history that was previously impractical, minimizing wasted GPU time, improving training continuity, and lowering cloud storage costs.
Efficient model distribution with deltas
DMX enables a new distribution model: send the base model once, then distribute only small aligned deltas for every variant.
Llama 70B base: 140 GB (stored/downloaded once)
→ chat fine-tune: ~28 GB (delta only)
→ code fine-tune: ~28 GB (delta only)
→ medical fine-tune: ~28 GB (delta only)
Traditional: 4 × 140 GB = 560 GB
With DMX: 140 + 3 × 28 = 224 GB (60% savings)
This applies to model hubs (HuggingFace, CivitAI), enterprise model management, and any workflow where multiple variants share a common base. Reconstruction from deltas is verified safe across 10K-step training chains (0.15% loss difference after 50 resumes).
Where this matters today (estimated scale):
| Platform | Hosted Models | Est. Fine-Tunes | Redundant Storage | Potential Savings with Deltas |
|---|---|---|---|---|
| HuggingFace | 800K+ | ~500K (est. 60%) | Petabytes of duplicated base weights | ~60-80% bandwidth reduction |
| CivitAI | 100K+ | Tens of thousands of SD variants | Each a full 2-4 GB copy of SD base | ~80% per variant |
| Enterprise (per company) | 10-100 variants | Per-customer or per-use-case fine-tunes | Full copy per deployment | ~80% storage per variant |
Estimates based on public model counts and observed fine-tune ratios. Actual savings depend on how much each fine-tune diverges from its base.
Validated: Qwen 2.5 3B model family
Measured on real HuggingFace models — reconstructable delta available:
Qwen/Qwen2.5-3B (base) 13.6 GB — stored once
→ Qwen2.5-3B-Instruct 2.88 GB delta (78.8% savings)
→ Qwen2.5-Coder-3B 5.60 GB delta (58.8% savings — fork, heavier retrain)
| Variant | int16 Zeros | int16 Savings | int32 Savings | RelL2 from Base |
|---|---|---|---|---|
| Instruct (SFT+RLHF) | 29.2% | 90.7% | 67.7% | 0.014 |
| Coder (domain retrain) | 0.2% | 58.8% | 14.9% | 0.828 |
The Coder variant has diverged significantly from the base (RelL2 = 0.83). When a variant drifts this far, DMX supports auto-forking — promoting it to a new base and restarting the delta chain. Coder → Coder-Instruct would delta efficiently from the Coder anchor.
Reconstruction quality (verified roundtrip):
| Method | Precision Loss | Industry Acceptance |
|---|---|---|
| FP32 → FP16 conversion | Measurable (~1e-3) | Standard practice everywhere |
| GGUF Q8 quantization | ~1% PPL increase | Widely deployed in production |
| DMX int16 delta | +0.06% RelL2 | Less loss than FP32→FP16 |
| DMX int32 delta | 1.87e-7 RelL2 | Below FP32 arithmetic noise |
Try the distribution workflow yourself:
pip install dmx-compress
# Download base + delta from HuggingFace (base: 13.6 GB, delta: 2.9 GB)
huggingface-cli download Senat1/dmx-qwen2.5-3b-instruct-delta --local-dir ./qwen-delta
# Reconstruct the full Instruct model from base + delta
dmx delta-reconstruct ./qwen-delta/qwen2.5-3b-base.safetensors ./qwen-delta/instruct.dmxd qwen2.5-3b-instruct.safetensors
If you already have the base model locally, you only need the 2.9 GB delta — not the full 13.6 GB Instruct model.
DMX enables multi-million-dollar savings in storage and bandwidth for hubs and enterprises that maintain many fine-tuned model variants, because only small deltas need to be stored and transmitted instead of full checkpoints.
Validated Results: Checkpoint Delta Compression
All results are measured on real data using an NVIDIA A100-SXM4-80GB. Full result data is in benchmarks/.
Compression across architectures
| Model | Architecture | Params | Consecutive Delta Zeros | Entropy (bits) | Measured Savings |
|---|---|---|---|---|---|
| GPT-2 | Decoder-only | 163M | 33-67% | 1.76-3.02 | 87.3% (measured, 498→63 MB) |
| T5-small | Encoder-decoder | 110M | 89-94% | 0.49-0.85 | Not yet measured in bytes |
| TinyLlama | Decoder-only | 1.1B | 16-63% | 1.69-3.73 | 80% (measured, fine-tune base→chat) |
Delta compression works across model architectures and scales. T5 encoder-decoder shows highest sparsity. TinyLlama 1.1B confirms the pattern holds at scale — sparsity increases as training progresses (16% → 63% zeros). int32 aligned entropy matches int16 at all scales tested (1.71 vs 1.69 bits at 1.1B).
Precision tiers
Both tiers achieve comparable compression — the aligned quantization produces similar entropy regardless of bit width:
| Tier | Consecutive Entropy | Compression | Error | Use Case |
|---|---|---|---|---|
| int16 aligned | 0.6-1.3 bits | 87% | +0.06% RelL2 | Maximum compression |
| int32 aligned | 1.0-1.2 bits | ~87% | 1.87e-7 RelL2 | Practically lossless (error below FP32 noise floor) |
| Raw bit XOR (no alignment) | 14-16 bits | 8.5% | Bit-exact | Baseline — alignment is essential |
Full checkpoint including optimizer states
Training checkpoints include model weights + Adam optimizer states (momentum + variance), typically 3x the weight size. Validated on GPT-2 124M, 1000 training steps:
| Component | % of Checkpoint | Delta Sparsity | Entropy | Compression |
|---|---|---|---|---|
| Weights | 33% | 55-66% zeros | 1.8-2.6 bits | ~84% |
| Momentum (exp_avg) | 33% | 28-30% zeros | 7.5-9.0 bits | ~53% |
| Variance (exp_avg_sq) | 33% | 91-92% zeros | 0.6 bits | ~96% |
| Full checkpoint | 100% | — | — | ~79% |
Safety for training resumption
Training from DMX-reconstructed checkpoints is safe for production use:
| Test | Steps | DMX Resumes | Final Loss Diff | Result |
|---|---|---|---|---|
| Single resume (100 steps) | 100 | 1 | 0.042% | Negligible |
| Long-run chain (10K steps) | 10,000 | 50 | 0.15% | Production-safe |
The 10K-step test reconstructed from a DMX delta chain every 200 steps — 50 total resumes over 10,000 training steps. Final loss tracks the clean baseline within 0.15%, with no divergence trend over time.
Zero error accumulation in delta chains (Test 8)
Chained reconstruction (base → delta1 → delta2 → ... → deltaN) produces identical results to direct reconstruction (base + deltaN) — verified to 10 decimal places across both int16 and int32 modes. This is not an approximation: delta application is exact integer arithmetic, so error is mathematically constant regardless of chain length. Re-anchoring is needed only for delta size control, never for error control.
Fine-tune variant compression
TinyLlama 1.1B base → chat fine-tune: 80% savings (876 MB delta vs 4.4 GB full copy). Store the base model once, distribute each fine-tune variant as a small delta.
Projected Savings at Scale
These projections are extrapolated from observed sparsity and scaling behavior on GPT-2 (163M), T5 (110M), and TinyLlama (1.1B). The core property — very small per-step weight updates under SGD — appears scale-invariant, but we are actively validating on 8B+ models with frontier-scale schedules.
| Scenario | Raw Storage | Projected DMX | Projected Savings |
|---|---|---|---|
| 200 checkpoints of Llama 70B (weights only) | 28 TB | ~3.6 TB | ~87% |
| 200 checkpoints of Llama 70B (full + optimizer) | 84 TB | ~18 TB | ~79% |
| 20 fine-tune variants of Llama 70B | 2.8 TB | ~700 GB | ~75% |
Key Caveats
- Validation on 8B+ models with real frontier training schedules is in progress.
- Optimizer state compression (currently ~53%) may drop to 40–45% on highly diverse data, reducing full-checkpoint savings to ~70–73%.
- All projections assume continued zero error accumulation (exact integer arithmetic), as demonstrated in long-chain tests.
These numbers suggest DMX could reduce checkpoint storage and I/O pressure by nearly an order of magnitude while keeping training resumption safe.
Research Directions
- Multi-framework integration — DeepSpeed, FSDP, and Megatron-LM callbacks for production training pipelines
- Checkpoint-efficient continual learning — delta chains for long-running training with minimal storage overhead
We welcome collaboration — reach out via GitHub Issues or Discussions.
Format Specification
See spec/dmx_spec_v1.md for the complete format specification.
Paper
DMX: Delta Multiplexed Compression for Neural Network Model Weights (PDF) — click to download
Background
DMX is based on the principle that floating-point weights should be transformed into multiple statistically distinct, independently modeled entropy domains prior to compression. Trained neural network weights exhibit extreme exponent clustering — 74% of FP16 values use only 3 of 31 possible exponents, wasting 2.4 bits per value. DMX decomposes the floating-point representation into separate exponent and mantissa streams, each with distinct statistical properties that benefit from independent entropy coding. For FP32 models, aligned cross-layer quantization enforces a global coordinate system across layers, enabling additional integer-domain compression. The format auto-profiles each model to select the optimal compression strategy per component.
License & Patent
Code: MIT License — free to use, modify, and distribute.
Methods: Patent Pending (U.S. Provisional Applications filed April 2026). The patented methods cover aligned cross-layer quantization for neural network weight compression and stream-separated block floating point encoding with independent entropy coding. Personal, academic, and open-source use is unrestricted. Commercial use of the patented methods may require a license from the inventor — contact bill.riley@gmail.com.
Citation
@software{riley2026dmx,
author = {Riley, William J},
title = {DMX: Delta Multiplexed Model Format},
year = {2026},
url = {https://github.com/willjriley/dmx}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dmx_compress-0.4.0.tar.gz.
File metadata
- Download URL: dmx_compress-0.4.0.tar.gz
- Upload date:
- Size: 55.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1bf9644915cec6d907b7e2446fb23a7d01e58074ea55c2e0199a2e8118139c5
|
|
| MD5 |
288fc65d9362027a5cfe91f3bedcb4fa
|
|
| BLAKE2b-256 |
c4a117f6d7d6f3d22bf030bdceb0626c54c7d4485000ef6bd3de9446cd1a1f43
|
File details
Details for the file dmx_compress-0.4.0-py3-none-any.whl.
File metadata
- Download URL: dmx_compress-0.4.0-py3-none-any.whl
- Upload date:
- Size: 45.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e6ec28cd24713fc34c09cbf140ccc3f608f466526d8ae09978c725af54b9c15
|
|
| MD5 |
f7911084eb055a992a3ac29d8a266ab0
|
|
| BLAKE2b-256 |
03e7aac28f7a1cac370f69e6653dab800f1794e2395d053a1f31e528fb1a5a2a
|