Skip to main content

Lossless 5-bit transformer compression — 8 architectures (1.7B–70B, dense + MoE) at sub-1% PPL degradation. Customer-distributable via `uc pack v0.3`.

Project description

UltraCompress

First mathematically lossless 5-bit transformer compression library, validated end-to-end across 11+ architectures including state-space models.

License Python 3.10+ PyPI Patent


Current state (2026-05-08)

  • 11 architectures validated end-to-end cumulative through this morning — 10 transformer + Mamba-2.8B SSM, bit-identical W_base reconstruction at 5 bpw.
  • 4 more dense archs added today on GPU 1: SmolLM2-1.7B, TinyLlama-1.1B-Chat, Qwen3-0.6B, OLMo-2-0425-1B (queued retry after streaming-fix patch).
  • Hermes-3-Llama-3.1-405B compression in flight on GPU 0: 53/126 layers complete, ETA tonight.
  • 2 public HuggingFace artifacts uc-verify-PASS (qwen3-1.7b, mistral-7b-v0.3); 8 more in-flight upload-pending after local PASS.
  • Multi-arch PPL ratios at 5 bpw (representative): Mistral-7B 1.0100, Llama-3.1-8B 1.0125, Mamba-2.8B 1.0119.
  • 10/10 local production packs PASS uc verify (bit-identical W_base reconstruction).

Live verification status — every public artifact, file hash, and verifier exit code in one place: docs/PUBLIC_VERIFICATION_DASHBOARD_2026_05_08.md.


Quick start

pip install -U ultracompress                                          # 0.5.2
hf download SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5 --local-dir ./mistral
uc verify ./mistral                          # CLI entry point
# OR — when `uc` isn't on PATH (Jupyter, CI, Docker, post-install hooks):
python -m ultracompress verify ./mistral     # equivalent fallback

uc verify reconstructs W_base = absmax × grid[codes] from the persisted k-means grid + per-block scales + bit-packed integer codes and confirms it is bit-identical to the dequantized weight the trainer used during distillation.

uc serve ./mistral exposes an OpenAI-compatible API at http://localhost:8080. See docs/CUSTOMER_ONBOARDING_FLOW_v3_2026_05_08.md for the full deploy walkthrough.


What's new in v0.5.2 (publishing today)

  • python -m ultracompress fallback support via new ultracompress/__main__.py — unblocks Jupyter, CI, Docker images, post-install hooks where the uc console script is missing from PATH.
  • SSM (Mamba / state-space-model) Linear naming added to TARGET_SUBS in pack.pyin_proj, x_proj, dt_proj, out_proj now packed alongside the standard transformer Linear set.
  • Single-file safetensors support in stream_compress — unblocks <2B-param models that ship without an index shard (TinyLlama, SmolLM2, Qwen3-0.6B, OLMo-2).
  • olmo / olmo2 model_type dispatch added to streaming_teacher and streaming_compression_runner.
  • Full notes: docs/RELEASE_NOTES_v0.5.2.md.

Architecture matrix

End-to-end validated at 5 bpw with bit-identical W_base reconstruction. Checkmark = uc verify PASS, pending = local PASS, HF upload in flight, retry = re-running on v0.5.2 after streaming-fix.

Architecture Params Layers bpw PPL ratio uc verify HF repo
Qwen3-1.7B 1.7B 28 5 1.0078 PASS SipsaLabs/qwen3-1.7b-uc-v3-bpw5
Mistral-7B-v0.3 7.2B 32 5 1.0100 PASS SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5
Llama-3.1-8B 8.0B 32 5 1.0125 PASS local (upload pending)
Qwen3-8B 8.0B 36 5 1.0044 PASS local (upload pending)
Qwen3-14B 14.0B 40 5 1.0040 PASS local (upload pending)
Mixtral-8x7B-v0.1 47B (MoE 8 exp) 32 5 (PPL 5.88) PASS local (upload pending)
Phi-3.5-MoE-instruct 42B (MoE 16 exp) 32 5 (PPL 6.95) PASS local (upload pending)
Llama-3.1-70B 70B 80 5 (PPL 6.02) PASS local (upload pending)
Qwen2.5-72B 72B 80 5 1.0162 PASS local (upload pending)
Mamba-2.8B (SSM) 2.8B 64 5 1.0119 PASS local (upload pending)
Hermes-3-Llama-3.1-405B 405B 126 5 (in flight) 53/126 layers (compressing GPU 0)
SmolLM2-1.7B 1.7B 24 5 (in flight) pending (added today, GPU 1)
TinyLlama-1.1B-Chat 1.1B 22 5 (in flight) pending (added today, GPU 1)
Qwen3-0.6B 0.6B 28 5 (in flight) pending (added today, GPU 1)
OLMo-2-0425-1B 1B 16 5 (in flight) retry on v0.5.2 (added today, queued)

PPL ratios listed against the model's own bf16 baseline on a 100-sample held-out slice. MoE rows lack baseline due to single-GPU OOM on bf16 baseline; multi-GPU baseline pipeline lands in v0.6.

UltraCompress is the first quantization library publicly compatible with both transformer and state-space architectures (Mamba), including the Linear naming required for emerging hybrids such as AI21 Jamba.


How v0.3 lossless works

uc pack v0.3 persists the trainer's k-means learned grid + per-block scales + bit-packed integer codes directly into the customer artifact. Reconstruction is W_base = absmax × grid[codes] and reproduces — bit-identically — the dequantized weight the trainer used during distillation.

This is the only mathematically lossless 5-bit transformer quantization format in production. AWQ / GPTQ / EXL3 / bitsandbytes-int4 introduce measurable PPL drift between training-time eval and customer-time inference. UltraCompress v0.3 customers see identical inference behavior to what the trainer measured.

Customer profile Why bit-exact reconstruction matters
Defense / aerospace Bit-exact deploy is a compliance requirement (audit trail).
Healthcare AI (FDA-regulated) Model equivalence required between dev and deploy.
Finance (SR 11-7 model validation) Reproducibility audit requires bit-exact recovery.
Frontier labs (internal artifact distribution) Red-team eval fidelity requires identical inference.
Single-GPU 70B+ deployment Streaming compression keeps peak VRAM ~one transformer layer.

Full vendor comparison: docs/COMPETITIVE_LANDSCAPE_v3_LOSSLESS_2026_05_08.md.


Streaming compression — single-GPU large-model headline

Per-layer streaming validated end-to-end across 8B → 72B with peak VRAM bounded by ~one transformer layer regardless of total depth.

Model Layers PPL ratio Peak VRAM
Qwen3-8B 36 1.0278 2.26 GB
Qwen3-14B 40 1.0111 3.37 GB
Qwen3-32B 64 1.0367 4.85 GB
Qwen2.5-72B 80 1.0162 8.98 GB

Recipe: GSQ scalar 5 bpw + per-block (B=64) absmax + V18-C rank-32 low-rank correction + 200-step KL distillation per layer. Process: lazy-load layer fp16 weights via safetensors → cache teacher hidden output → quantize → fit V18-C against cache → save → free → next layer. Compression time ~1 min/layer.

Reproduce on a 5090 (~9 min for 8B):

python scripts/overlay/streaming_compression_runner.py \
    --model qwen3-8b --bpw 5 --block_size 64 --rank 32 \
    --train_steps 200 --n_calib 100 --n_eval 50

Earlier research tracks

Three independent compression mechanisms compose multiplicatively:

  • Track A — streaming compression (above): single-GPU 72B at PPL 1.0162.
  • Track B — Fractal Residual Recursion (Claims 1–16): shared-block architectural compression at 311–734× on Qwen3-1.7B (HQ5 h256 reaches 70.0% T10). See docs/HQ5_RESULTS.md, REPRODUCE.md.
  • Track C — row-overlay sub-3-bpw quantization (Claims 17–20): beats bitsandbytes-nf4 at 30% fewer bits on a 6-model cohort (n=500 LAMBADA). Zero catastrophic failures across 48 measurements vs. HQQ's 6/6 at 2-bit g64. See RESULTS.md, docs/claim20_summary.txt.

Stacked, the projection for a 100T-parameter model on a single GPU is ~5 GB at 20,000× total compression — see docs/100T_MISSION_MATH_2026_05_03.md. Track A + B + C numbers are individually validated; full multiplicative stack is an architectural projection.


Repository layout

ultracompress/
├── ultracompress/              Core library (pack v0.3, FractalModel, pipeline, __main__)
├── scaling/                    Cross-model teacher loaders (Qwen3 / Llama / Mistral / Mamba / OLMo)
├── scripts/overlay/            Track A (row-overlay + streaming compression)
├── scripts/frr/                Track B (FRR architectural compression)
├── tools/                      Model download, quantization utilities
├── tests/                      Regression tests
├── results/                    Measurement JSONs (indexed by claim)
├── logs/                       Run logs
└── docs/                       Patents, dashboards, customer flow, competitive landscape

Index: RESULTS.md, PATENT_CLAIMS.md, REPRODUCE.md.


Patent disclosure

USPTO provisionals 64/049,511 and 64/049,517 filed 2026-04-25 covering the row-overlay quantization, FRR architectural compression, streaming-compression mechanism, and v0.3 lossless pack format.

License

  • Apache-2.0 for the CLI, verifier, and customer-facing pack format — see LICENSE.
  • Sipsa Labs Research Evaluation License v1.0 for compression internals (k-means trainer, V18-C overlay fit, FRR distillation pipeline) — see LICENSE_RESEARCH_EVAL.md.

Citation

@misc{ultracompress2026,
  title  = {UltraCompress: Mathematically Lossless 5-bit Transformer
            Compression Across 11+ Architectures},
  author = {Sipsa Labs},
  year   = {2026},
  url    = {https://github.com/sipsalabs/ultracompress}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ultracompress-0.5.3.tar.gz (311.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ultracompress-0.5.3-py3-none-any.whl (359.0 kB view details)

Uploaded Python 3

File details

Details for the file ultracompress-0.5.3.tar.gz.

File metadata

  • Download URL: ultracompress-0.5.3.tar.gz
  • Upload date:
  • Size: 311.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ultracompress-0.5.3.tar.gz
Algorithm Hash digest
SHA256 4136747b15992dc2b9225cede8329cc14188e4bd1b1825206c429f4b2fac9bd6
MD5 297928ae10d4b1fbe7205dd7223eb4ec
BLAKE2b-256 73e58279ccb22858959c1eb4d3fd7877421fdf9dd56ab3c692bfd97386c7db61

See more details on using hashes here.

File details

Details for the file ultracompress-0.5.3-py3-none-any.whl.

File metadata

  • Download URL: ultracompress-0.5.3-py3-none-any.whl
  • Upload date:
  • Size: 359.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ultracompress-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d948d81aa9938d66e0ac7e77924116a95373406ff1923a72fea8f49dd3073ab7
MD5 a2489d191b266831a678e94033fbe36d
BLAKE2b-256 381b7f2cfbd78ff769f692fee55267f19e5aecbc6d89ff9982acdacbb92ab61b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page