Lossless compression codec tuned for neural-network model weights
Project description
z4ai
A lossless storage-and-distribution layer for AI model checkpoints.
Keep model checkpoints small for storage and transfer - bit-for-bit reversible, with per-tensor random access. Most useful on collections of related checkpoints (training runs, fine-tune families, model registries), and in environments the Hugging Face Hub's Xet backend doesn't cover - self-hosted registries, internal MLOps, plain object storage.
Documentation - quickstart, full usage, how it works, and the API reference.
import z4ai
blob = z4ai.compress(weights_bytes, dtype="bf16") # smaller, self-describing
data = z4ai.decompress(blob) # byte-identical original
assert data == weights_bytes
Its strongest case is a sequence of related checkpoints - consecutive ones are ~95-99% identical, so each is stored as a tiny delta from the one before:
# store checkpoint N as the bit-exact delta from checkpoint N-1
delta = z4ai.compress_delta(step_2000, reference=step_1000, dtype="bf16")
restored = z4ai.decompress_delta(delta, reference=step_1000) # exact == step_2000
# only the changed bytes cost anything - often 10-100x smaller than a full compress
Install
pip install z4ai
Requires Python >= 3.9. Pure Python (NumPy + zstandard); the native entropy core
is optional and z4ai degrades gracefully without it. Full
installation guide (optional
extras, native acceleration) in the docs.
TL;DR
z4ai is a codec built for the byte structure of float tensors. Compared with ZipNN (the closest weights-specific codec):
- Ties on dense weights, with a slight edge on real models (distilgpt2 +1.6-8.2%, pythia-70m +11-29%) from an order-1 context exponent coder.
- Wins big on repeated structure - tied embeddings, duplicated layers, multi-shard concatenations - which z4ai dedups across the whole tensor.
- 2.3-3.0x on reduced-precision fp32 files (fp16/bf16-origin values), automatically.
- 2.4-10.8x on quantized weights shipped in a wide container - INT4/INT8/FP8
(GPTQ / AWQ /
compressed-tensors) dequantised into bf16/fp16/fp32 - via an automatic lossless palette transform. This is the common deployed format, and z4ai beats ZipNN on every case here (e.g. INT8-in-fp32 4.72x vs 1.94x; INT4-in-fp32 10.8x vs 4.6x). - Big wins on sparse / pruned weights, and 10-180x on checkpoint sequences
via the lossless
compress_deltamode - which ZipNN has no equivalent for. - Slower to compress than ZipNN's compiled-C core; decompress is competitive.
Honest ceiling. On a dense checkpoint a trained float's mantissa is near-random and its exponent carries only ~2.6 bits, capping any lossless codec at ~1.5x (bf16) / ~1.2x (fp32). ZipNN already hits that wall, so z4ai can't meaningfully out-ratio it there - it wins by a hair via order-1 rANS on the exponent. The large wins come from redundancy the entropy bound assumes away: reduced precision, sparsity, structure, and cross-checkpoint deltas.
All numbers below are measured on this repo and reproducible with one command.
Benchmarks vs ZipNN
Machine: 16 cores | Python 3.14 | zstandard 0.25.0 | zipnn (latest) |
32 MB per dtype | best-of-3 timing. Every codec is verified byte-exact (lossless).
Compression ratio (higher is better)
| Scenario | dtype | z4ai | ZipNN | zstd‑3 | z4ai vs ZipNN |
|---|---|---|---|---|---|
| Dense / i.i.d. weights | bf16 | 1.413 | 1.417 | 1.227 | -0.3% | tie |
| Dense / i.i.d. weights | fp32 | 1.171 | 1.172 | 1.061 | -0.1% | tie |
| Structured (repeated/duplicated) | bf16 | 58.1 | 1.51 | 16.97 | +3750% |
| Structured (repeated/duplicated) | fp32 | 47.3 | 1.20 | 14.24 | +3831% |
| Sparse (50% zeros) | bf16 | 2.47 | 2.20 | 1.88 | +12.5% |
| Sparse (50% zeros) | fp32 | 2.21 | 1.86 | 1.79 | +18.9% |
| Quantized INT8 (dequantised to…) | bf16 | 2.39 | 2.07 | 1.79 | +15.6% |
| Quantized INT8 (dequantised to…) | fp32 | 4.72 | 1.94 | 3.07 | +143% |
| Quantized INT4 (dequantised to…) | bf16 | 5.41 | 3.87 | 3.91 | +39.9% |
| Quantized INT4 (dequantised to…) | fp32 | 10.77 | 4.59 | 5.13 | +135% |
Quantized rows: per-tensor INT4/INT8 weights dequantised back into a wide
float container (the format most quantized models ship in), same 32 MB/dtype
config as above. z4ai auto-selects the lossless palette transform; several ZipNN
entries here are not byte-exact on the tested build. Reproduce with
python benchmarks/bench_palette.py (which reports a stronger
zstd-19 baseline, so z4ai's margin there is conservative).
Real & production workloads
| Workload | z4ai | ZipNN | Note |
|---|---|---|---|
| Real checkpoint - bert-tiny, 17.7 MB fp32 (downloaded) | 1.188 | 1.202 | -1.2% - a single dense checkpoint ~ i.i.d., so a small loss. The win is on redundancy, not dense noise. |
| Production .safetensors - 201 MB BF16 with a tied embedding | 1.525 | 1.510 | +1.0% vs per-tensor ZipNN - z4ai dedups the tied embed_tokens/lm_head that ZipNN's 256 KiB chunking can't. |
| Realistic full checkpoint - 107 MB BF16 (tied embeddings, shared blocks, optimizer state, 50% pruned layer) | 2.93 | 1.67 | +75.7% - z4ai's whole-buffer LZ dedups the structure real checkpoints carry; ZipNN's chunked Huffman cannot see across chunks. |
| Checkpoint delta - bert-tiny BF16, 5% of weights changed | 51.1 | ~1.7 | 30x smaller than from-scratch. compress_delta stores only what changed (1% → 184x; 20% → 18x). ZipNN has no delta mode. |
Reproduce
python benchmarks/benchmark.py --mb 32 --dtypes bf16 fp32 --scenario iid
python benchmarks/benchmark.py --mb 32 --dtypes bf16 fp32 --scenario structured
python benchmarks/benchmark.py --mb 32 --dtypes bf16 fp32 --scenario sparse
python benchmarks/bench_real_checkpoint.py # downloads a real .bin checkpoint
python benchmarks/bench_safetensors.py --layers 8 --d 1024
python benchmarks/checkpoint_bench.py --mb 96 # realistic structured checkpoint
Throughput
| Codec | compress | decompress |
|---|---|---|
| z4ai (i.i.d. bf16, MB/s) | 1420 | 16700 |
| ZipNN | 8125 | 20020 |
z4ai compresses ~6x slower and decompresses ~1.2x slower than ZipNN's compiled-C
core - the deliberate trade for a write-once, read-many artifact. A fused
multithreaded native codec (z4ai.chunked) and effort="fast"/"max" tiers
trade decode latency against file size.
Documentation
Full docs - quickstart, usage, how it works, CLI, and the API reference - live at z4ai.github.io/z4ai.
| Page | What's there |
|---|---|
| Quickstart | Compress a buffer, an ndarray, or a .safetensors file in a few lines. |
| Usage | Effort tiers, sparse/quantized weights, checkpoint & model deltas, per-tensor random access, the high-throughput native path. |
| How it works | Field decorrelation, whole-tensor matching, the best-of selector, and where the codec pays off. |
| CLI | z4ai compress / decompress / info, pipe-friendly. |
| Background & references | Prior art (ZipNN, DFloat11, NeuZip, ZipLLM, fpzip, rANS/FSE ...) and the honest entropy-ceiling framing. |
| API reference | Every public function, generated from the source. |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file z4ai-0.1.0.tar.gz.
File metadata
- Download URL: z4ai-0.1.0.tar.gz
- Upload date:
- Size: 159.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
003e49d236a54b56a7c71f23758eac9a2808f68743e2da29ff3f5657830020d2
|
|
| MD5 |
2e9a923591e52334cad57e5866c4fde8
|
|
| BLAKE2b-256 |
ce757133e7621d3d1c7d41640ecc56e65e37a239d981a22007238e8f0ab576de
|