Lossless 5-bit transformer compression — 8 architectures (1.7B–70B, dense + MoE) at sub-1% PPL degradation. Customer-distributable via `uc pack v0.3`.
Project description
UltraCompress
First mathematically lossless 5-bit transformer compression library, validated end-to-end across 11+ architectures including state-space models.
Current state (2026-05-08)
- 11 architectures validated end-to-end cumulative through this morning — 10 transformer + Mamba-2.8B SSM, bit-identical W_base reconstruction at 5 bpw.
- 4 more dense archs added today on GPU 1: SmolLM2-1.7B, TinyLlama-1.1B-Chat, Qwen3-0.6B, OLMo-2-0425-1B (queued retry after streaming-fix patch).
- Hermes-3-Llama-3.1-405B compression in flight on GPU 0: 53/126 layers complete, ETA tonight.
- 2 public HuggingFace artifacts uc-verify-PASS (qwen3-1.7b, mistral-7b-v0.3); 8 more in-flight upload-pending after local PASS.
- Multi-arch PPL ratios at 5 bpw (representative): Mistral-7B
1.0100, Llama-3.1-8B1.0125, Mamba-2.8B1.0119. - 10/10 local production packs PASS
uc verify(bit-identical W_base reconstruction).
Live verification status — every public artifact, file hash, and verifier exit code in one place: docs/PUBLIC_VERIFICATION_DASHBOARD_2026_05_08.md.
Quick start
pip install -U ultracompress # 0.5.2
hf download SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5 --local-dir ./mistral
uc verify ./mistral # CLI entry point
# OR — when `uc` isn't on PATH (Jupyter, CI, Docker, post-install hooks):
python -m ultracompress verify ./mistral # equivalent fallback
uc verify reconstructs W_base = absmax × grid[codes] from the persisted k-means grid + per-block scales + bit-packed integer codes and confirms it is bit-identical to the dequantized weight the trainer used during distillation.
uc serve ./mistral exposes an OpenAI-compatible API at http://localhost:8080. See docs/CUSTOMER_ONBOARDING_FLOW_v3_2026_05_08.md for the full deploy walkthrough.
What's new in v0.5.2 (publishing today)
python -m ultracompressfallback support via newultracompress/__main__.py— unblocks Jupyter, CI, Docker images, post-install hooks where theucconsole script is missing from PATH.- SSM (Mamba / state-space-model) Linear naming added to
TARGET_SUBSinpack.py—in_proj,x_proj,dt_proj,out_projnow packed alongside the standard transformer Linear set. - Single-file safetensors support in
stream_compress— unblocks <2B-param models that ship without an index shard (TinyLlama, SmolLM2, Qwen3-0.6B, OLMo-2). olmo/olmo2model_type dispatch added tostreaming_teacherandstreaming_compression_runner.- Full notes: docs/RELEASE_NOTES_v0.5.2.md.
Architecture matrix
End-to-end validated at 5 bpw with bit-identical W_base reconstruction. Checkmark = uc verify PASS, pending = local PASS, HF upload in flight, retry = re-running on v0.5.2 after streaming-fix.
| Architecture | Params | Layers | bpw | PPL ratio | uc verify | HF repo |
|---|---|---|---|---|---|---|
| Qwen3-1.7B | 1.7B | 28 | 5 | 1.0078 | PASS | SipsaLabs/qwen3-1.7b-uc-v3-bpw5 |
| Mistral-7B-v0.3 | 7.2B | 32 | 5 | 1.0100 | PASS | SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5 |
| Llama-3.1-8B | 8.0B | 32 | 5 | 1.0125 | PASS local | (upload pending) |
| Qwen3-8B | 8.0B | 36 | 5 | 1.0044 | PASS local | (upload pending) |
| Qwen3-14B | 14.0B | 40 | 5 | 1.0040 | PASS local | (upload pending) |
| Mixtral-8x7B-v0.1 | 47B (MoE 8 exp) | 32 | 5 | (PPL 5.88) | PASS local | (upload pending) |
| Phi-3.5-MoE-instruct | 42B (MoE 16 exp) | 32 | 5 | (PPL 6.95) | PASS local | (upload pending) |
| Llama-3.1-70B | 70B | 80 | 5 | (PPL 6.02) | PASS local | (upload pending) |
| Qwen2.5-72B | 72B | 80 | 5 | 1.0162 | PASS local | (upload pending) |
| Mamba-2.8B (SSM) | 2.8B | 64 | 5 | 1.0119 | PASS local | (upload pending) |
| Hermes-3-Llama-3.1-405B | 405B | 126 | 5 | (in flight) | 53/126 layers | (compressing GPU 0) |
| SmolLM2-1.7B | 1.7B | 24 | 5 | (in flight) | pending | (added today, GPU 1) |
| TinyLlama-1.1B-Chat | 1.1B | 22 | 5 | (in flight) | pending | (added today, GPU 1) |
| Qwen3-0.6B | 0.6B | 28 | 5 | (in flight) | pending | (added today, GPU 1) |
| OLMo-2-0425-1B | 1B | 16 | 5 | (in flight) | retry on v0.5.2 | (added today, queued) |
PPL ratios listed against the model's own bf16 baseline on a 100-sample held-out slice. MoE rows lack baseline due to single-GPU OOM on bf16 baseline; multi-GPU baseline pipeline lands in v0.6.
UltraCompress is the first quantization library publicly compatible with both transformer and state-space architectures (Mamba), including the Linear naming required for emerging hybrids such as AI21 Jamba.
How v0.3 lossless works
uc pack v0.3 persists the trainer's k-means learned grid + per-block scales + bit-packed integer codes directly into the customer artifact. Reconstruction is W_base = absmax × grid[codes] and reproduces — bit-identically — the dequantized weight the trainer used during distillation.
This is the only mathematically lossless 5-bit transformer quantization format in production. AWQ / GPTQ / EXL3 / bitsandbytes-int4 introduce measurable PPL drift between training-time eval and customer-time inference. UltraCompress v0.3 customers see identical inference behavior to what the trainer measured.
| Customer profile | Why bit-exact reconstruction matters |
|---|---|
| Defense / aerospace | Bit-exact deploy is a compliance requirement (audit trail). |
| Healthcare AI (FDA-regulated) | Model equivalence required between dev and deploy. |
| Finance (SR 11-7 model validation) | Reproducibility audit requires bit-exact recovery. |
| Frontier labs (internal artifact distribution) | Red-team eval fidelity requires identical inference. |
| Single-GPU 70B+ deployment | Streaming compression keeps peak VRAM ~one transformer layer. |
Full vendor comparison: docs/COMPETITIVE_LANDSCAPE_v3_LOSSLESS_2026_05_08.md.
Streaming compression — single-GPU large-model headline
Per-layer streaming validated end-to-end across 8B → 72B with peak VRAM bounded by ~one transformer layer regardless of total depth.
| Model | Layers | PPL ratio | Peak VRAM |
|---|---|---|---|
| Qwen3-8B | 36 | 1.0278 | 2.26 GB |
| Qwen3-14B | 40 | 1.0111 | 3.37 GB |
| Qwen3-32B | 64 | 1.0367 | 4.85 GB |
| Qwen2.5-72B | 80 | 1.0162 | 8.98 GB |
Recipe: GSQ scalar 5 bpw + per-block (B=64) absmax + V18-C rank-32 low-rank correction + 200-step KL distillation per layer. Process: lazy-load layer fp16 weights via safetensors → cache teacher hidden output → quantize → fit V18-C against cache → save → free → next layer. Compression time ~1 min/layer.
Reproduce on a 5090 (~9 min for 8B):
python scripts/overlay/streaming_compression_runner.py \
--model qwen3-8b --bpw 5 --block_size 64 --rank 32 \
--train_steps 200 --n_calib 100 --n_eval 50
Earlier research tracks
Three independent compression mechanisms compose multiplicatively:
- Track A — streaming compression (above): single-GPU 72B at PPL 1.0162.
- Track B — Fractal Residual Recursion (Claims 1–16): shared-block architectural compression at 311–734× on Qwen3-1.7B (HQ5 h256 reaches 70.0% T10). See docs/HQ5_RESULTS.md, REPRODUCE.md.
- Track C — row-overlay sub-3-bpw quantization (Claims 17–20): beats bitsandbytes-nf4 at 30% fewer bits on a 6-model cohort (n=500 LAMBADA). Zero catastrophic failures across 48 measurements vs. HQQ's 6/6 at 2-bit g64. See RESULTS.md, docs/claim20_summary.txt.
Stacked, the projection for a 100T-parameter model on a single GPU is ~5 GB at 20,000× total compression — see docs/100T_MISSION_MATH_2026_05_03.md. Track A + B + C numbers are individually validated; full multiplicative stack is an architectural projection.
Repository layout
ultracompress/
├── ultracompress/ Core library (pack v0.3, FractalModel, pipeline, __main__)
├── scaling/ Cross-model teacher loaders (Qwen3 / Llama / Mistral / Mamba / OLMo)
├── scripts/overlay/ Track A (row-overlay + streaming compression)
├── scripts/frr/ Track B (FRR architectural compression)
├── tools/ Model download, quantization utilities
├── tests/ Regression tests
├── results/ Measurement JSONs (indexed by claim)
├── logs/ Run logs
└── docs/ Patents, dashboards, customer flow, competitive landscape
Index: RESULTS.md, PATENT_CLAIMS.md, REPRODUCE.md.
Patent disclosure
USPTO provisionals 64/049,511 and 64/049,517 filed 2026-04-25 covering the row-overlay quantization, FRR architectural compression, streaming-compression mechanism, and v0.3 lossless pack format.
License
- Apache-2.0 for the CLI, verifier, and customer-facing pack format — see LICENSE.
- Sipsa Labs Research Evaluation License v1.0 for compression internals (k-means trainer, V18-C overlay fit, FRR distillation pipeline) — see LICENSE_RESEARCH_EVAL.md.
Citation
@misc{ultracompress2026,
title = {UltraCompress: Mathematically Lossless 5-bit Transformer
Compression Across 11+ Architectures},
author = {Sipsa Labs},
year = {2026},
url = {https://github.com/sipsalabs/ultracompress}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ultracompress-0.5.4.tar.gz.
File metadata
- Download URL: ultracompress-0.5.4.tar.gz
- Upload date:
- Size: 319.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c49b7bb440a26a0c5c3a59ff688ca0ba16858b8be747b7a6d81dd70babe108a
|
|
| MD5 |
fcaa2acc4ce2bfff7168c1e32af5c650
|
|
| BLAKE2b-256 |
3e693fe261c73e0cf9b0c1a414abb5209c63064d12fd24b5683518e2b1f7965f
|
File details
Details for the file ultracompress-0.5.4-py3-none-any.whl.
File metadata
- Download URL: ultracompress-0.5.4-py3-none-any.whl
- Upload date:
- Size: 365.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
299ed5af674499f9be765a79b7da8973afd881990f5195c85c13579abe9e05c9
|
|
| MD5 |
64859d0f74fa381e0044c9dd4d695d0f
|
|
| BLAKE2b-256 |
d2c0951c4b941e1a19f7c4b621fa0a6b09c48fdbe5aeff014683f9c529915d8e
|