Lossless compression codec tuned for neural-network model weights
Project description
z4ai
A lossless compression codec for AI model checkpoints.
Keep model checkpoints small for storage and transfer - bit-for-bit reversible, with per-tensor random access. Most useful on collections of related checkpoints (training runs, fine-tune families, model registries), and in environments the Hugging Face Hub's Xet backend doesn't cover - self-hosted registries, internal MLOps, plain object storage.
Documentation - quickstart, full usage, how it works, and the API reference.
pip install z4ai
import z4ai
blob = z4ai.compress(weights_bytes) # smaller, self-describing
data = z4ai.decompress(blob) # byte-identical original
assert data == weights_bytes
Its strongest case is a sequence of related checkpoints - consecutive ones are ~95-99% identical, so each is stored as a tiny delta from the one before:
# store checkpoint N as the bit-exact delta from checkpoint N-1
delta = z4ai.compress_delta(step_2000, reference=step_1000)
restored = z4ai.decompress_delta(delta, reference=step_1000) # exact == step_2000
# only the changed bytes cost anything - often 10-100x smaller than a full compress
Or from the command line, on files:
z4ai compress weights.bin -o weights.z4ai
z4ai decompress weights.z4ai -o weights.bin
z4ai info weights.z4ai # ratio + per-plane breakdown
Requires Python >= 3.9; pure Python (NumPy + zstandard), with optional native
acceleration.
TL;DR
z4ai is a codec built for the byte structure of float tensors. Compared with ZipNN (the closest weights-specific codec):
- Ties on dense weights, with a slight edge on real models (distilgpt2 +1.6-8.2%, pythia-70m +11-29%) from an order-1 context exponent coder.
- Wins big on repeated structure - tied embeddings, duplicated layers, multi-shard concatenations - which z4ai dedups across the whole tensor.
- 2.3-3.0x on reduced-precision fp32 files (fp16/bf16-origin values), automatically.
- 2.4-10.8x on quantized weights shipped in a wide container - INT4/INT8/FP8
(GPTQ / AWQ /
compressed-tensors) dequantised into bf16/fp16/fp32 - via an automatic lossless palette transform. This is the common deployed format, and z4ai beats ZipNN on every case here (e.g. INT8-in-fp32 4.72x vs 1.94x; INT4-in-fp32 10.8x vs 4.6x). - Big wins on sparse / pruned weights, and 10-180x on checkpoint sequences
via the lossless
compress_deltamode - which ZipNN has no equivalent for. - Slower to compress than ZipNN's compiled-C core; decompress is competitive.
Honest ceiling. On a dense checkpoint a trained float's mantissa is near-random and its exponent carries only ~2.6 bits, capping any lossless codec at ~1.5x (bf16) / ~1.2x (fp32). ZipNN already hits that wall, so z4ai can't meaningfully out-ratio it there - it wins by a hair via order-1 rANS on the exponent. The large wins come from redundancy the entropy bound assumes away: reduced precision, sparsity, structure, and cross-checkpoint deltas.
All numbers below are measured on this repo and reproducible with one command.
Benchmarks vs ZipNN
Machine: 16 cores | Python 3.14 | zstandard 0.25.0 | zipnn (latest) |
32 MB per dtype | best-of-3 timing. Every codec is verified byte-exact (lossless).
Compression ratio (higher is better)
| Scenario | dtype | z4ai | ZipNN | zstd‑3 | z4ai vs ZipNN |
|---|---|---|---|---|---|
| Dense / i.i.d. weights | bf16 | 1.413 | 1.417 | 1.227 | -0.3% | tie |
| Dense / i.i.d. weights | fp32 | 1.171 | 1.172 | 1.061 | -0.1% | tie |
| Structured (repeated/duplicated) | bf16 | 58.1 | 1.51 | 16.97 | +3750% |
| Structured (repeated/duplicated) | fp32 | 47.3 | 1.20 | 14.24 | +3831% |
| Sparse (50% zeros) | bf16 | 2.47 | 2.20 | 1.88 | +12.5% |
| Sparse (50% zeros) | fp32 | 2.21 | 1.86 | 1.79 | +18.9% |
| Quantized INT8 (dequantised to…) | bf16 | 2.39 | 2.07 | 1.79 | +15.6% |
| Quantized INT8 (dequantised to…) | fp32 | 4.72 | 1.94 | 3.07 | +143% |
| Quantized INT4 (dequantised to…) | bf16 | 5.41 | 3.87 | 3.91 | +39.9% |
| Quantized INT4 (dequantised to…) | fp32 | 10.77 | 4.59 | 5.13 | +135% |
Quantized rows: per-tensor INT4/INT8 weights dequantised back into a wide
float container (the format most quantized models ship in), same 32 MB/dtype
config as above. z4ai auto-selects the lossless palette transform; several ZipNN
entries here are not byte-exact on the tested build. Reproduce with
python benchmarks/bench_palette.py (which reports a stronger
zstd-19 baseline, so z4ai's margin there is conservative).
Real & production workloads
| Workload | z4ai | ZipNN | Note |
|---|---|---|---|
| Real checkpoint - bert-tiny, 17.7 MB fp32 (downloaded) | 1.188 | 1.202 | -1.2% - a single dense checkpoint ~ i.i.d., so a small loss. The win is on redundancy, not dense noise. |
| Production .safetensors - 201 MB BF16 with a tied embedding | 1.525 | 1.510 | +1.0% vs per-tensor ZipNN - z4ai dedups the tied embed_tokens/lm_head that ZipNN's 256 KiB chunking can't. |
| Realistic full checkpoint - 107 MB BF16 (tied embeddings, shared blocks, optimizer state, 50% pruned layer) | 2.93 | 1.67 | +75.7% - z4ai's whole-buffer LZ dedups the structure real checkpoints carry; ZipNN's chunked Huffman cannot see across chunks. |
| Checkpoint delta - bert-tiny BF16, 5% of weights changed | 51.1 | ~1.7 | 30x smaller than from-scratch. compress_delta stores only what changed (1% → 184x; 20% → 18x). ZipNN has no delta mode. |
Reproduce
python benchmarks/benchmark.py --mb 32 --dtypes bf16 fp32 --scenario iid
python benchmarks/benchmark.py --mb 32 --dtypes bf16 fp32 --scenario structured
python benchmarks/benchmark.py --mb 32 --dtypes bf16 fp32 --scenario sparse
python benchmarks/bench_real_checkpoint.py # downloads a real .bin checkpoint
python benchmarks/bench_safetensors.py --layers 8 --d 1024
python benchmarks/checkpoint_bench.py --mb 96 # realistic structured checkpoint
Throughput
| Codec | compress | decompress |
|---|---|---|
| z4ai (i.i.d. bf16, MB/s) | 1420 | 16700 |
| ZipNN | 8125 | 20020 |
z4ai compresses ~6x slower and decompresses ~1.2x slower than ZipNN's compiled-C
core - the deliberate trade for a write-once, read-many artifact. A fused
multithreaded native codec (z4ai.chunked) and effort="fast"/"max" tiers
trade decode latency against file size.
How it works
A float is [ sign | exponent | mantissa ]: in trained weights the exponent bits
repeat heavily while the mantissa looks like noise, and the two are interleaved
byte-by-byte - so a general-purpose zip can't separate them (plain zstd on raw
fp32 barely reaches ~1.06x). z4ai pulls the bytes apart, matches redundancy across
the whole tensor, then entropy-codes each part near its floor:
flowchart TD
IN["Float tensor bytes<br/>(bf16 / fp16 / fp32 / fp64)"]
IN -->|split by dtype| SPLIT["Plane / bit-field split"]
SPLIT --> EXP["Exponent / sign plane<br/>(low entropy)"]
SPLIT --> MAN["Mantissa plane<br/>(noise-like)"]
EXP --> ENT["Entropy coding<br/>(rANS / zstd)"]
MAN --> STORE["Store verbatim, or zstd"]
ENT --> LDM["Whole-tensor long-distance matching<br/>dedups tied / repeated weights"]
STORE --> LDM
LDM --> BEST["Best-of selection: keep the smallest<br/>never worse than plain zstd"]
BEST --> OUT["Self-describing container<br/>(.z4ai / .zstn)"]
Decoding is the exact inverse, driven entirely by the self-describing header - no side-channel metadata, and the output is never larger than the input. Pruned weights take a zero-aware path; the safetensors/ZSTN container adds a per-tensor index for random-access reads and stores tied weights once.
References
z4ai's building blocks are well-studied; the contribution is applying them to the byte structure of model weights and matching across the whole tensor - and across checkpoints. On dense weights it cannot meaningfully out-ratio ZipNN (the entropy ceiling binds every lossless codec equally); the wins come from structure, reduced precision, sparsity, and cross-checkpoint deltas.
- Float field decorrelation - Lindstrom & Isenburg, fpzip, IEEE TVCG 2006 (project); applied to NN weights by ZipNN (the codec benchmarked against here).
- Long-distance LZ matching - Ziv & Lempel, LZ77 (1977), via Zstandard (RFC 8878).
- Entropy coding (rANS) - Duda, Asymmetric Numeral Systems (2013).
- Cross-checkpoint / cross-model delta - ZipLLM
(NSDI 2026) - the basis for
compress_delta/model_delta.
The full survey (DFloat11, NeuZip, DietGPU, ECF8, ALP, Pcodec, the BF16 entropy ceiling) is in the docs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file z4ai-0.2.0.tar.gz.
File metadata
- Download URL: z4ai-0.2.0.tar.gz
- Upload date:
- Size: 164.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
340d467ab58a518f3f3696050fa0ecd4271aec3a6d5673e21cf64bd7fa84ceab
|
|
| MD5 |
5316beb2a1f2068dd738af5b1eb13bff
|
|
| BLAKE2b-256 |
585812f84d85dd8428e0ba0d4fe8d3088fb261aa1bc4122d87b78f905fb9928a
|