Skip to main content

Lossless 5-bit transformer compression — 8 architectures (1.7B–70B, dense + MoE) at sub-1% PPL degradation. Customer-distributable via `uc pack v0.3`.

Project description

UltraCompress

First mathematically lossless 5-bit transformer compression library, validated end-to-end across 11+ architectures including state-space models.

License Python 3.10+ PyPI Patent


Current state (2026-05-08)

  • 11 architectures validated end-to-end cumulative through this morning — 10 transformer + Mamba-2.8B SSM, bit-identical W_base reconstruction at 5 bpw.
  • 4 more dense archs added today on GPU 1: SmolLM2-1.7B, TinyLlama-1.1B-Chat, Qwen3-0.6B, OLMo-2-0425-1B (queued retry after streaming-fix patch).
  • Hermes-3-Llama-3.1-405B compression in flight on GPU 0: 53/126 layers complete, ETA tonight.
  • 2 public HuggingFace artifacts uc-verify-PASS (qwen3-1.7b, mistral-7b-v0.3); 8 more in-flight upload-pending after local PASS.
  • Multi-arch PPL ratios at 5 bpw (representative): Mistral-7B 1.0100, Llama-3.1-8B 1.0125, Mamba-2.8B 1.0119.
  • 10/10 local production packs PASS uc verify (bit-identical W_base reconstruction).

Live verification status — every public artifact, file hash, and verifier exit code in one place: docs/PUBLIC_VERIFICATION_DASHBOARD_2026_05_08.md.


Quick start

pip install -U ultracompress                                          # 0.5.2
hf download SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5 --local-dir ./mistral
uc verify ./mistral                          # CLI entry point
# OR — when `uc` isn't on PATH (Jupyter, CI, Docker, post-install hooks):
python -m ultracompress verify ./mistral     # equivalent fallback

uc verify reconstructs W_base = absmax × grid[codes] from the persisted k-means grid + per-block scales + bit-packed integer codes and confirms it is bit-identical to the dequantized weight the trainer used during distillation.

uc serve ./mistral exposes an OpenAI-compatible API at http://localhost:8080. See docs/CUSTOMER_ONBOARDING_FLOW_v3_2026_05_08.md for the full deploy walkthrough.


What's new in v0.5.2 (publishing today)

  • python -m ultracompress fallback support via new ultracompress/__main__.py — unblocks Jupyter, CI, Docker images, post-install hooks where the uc console script is missing from PATH.
  • SSM (Mamba / state-space-model) Linear naming added to TARGET_SUBS in pack.pyin_proj, x_proj, dt_proj, out_proj now packed alongside the standard transformer Linear set.
  • Single-file safetensors support in stream_compress — unblocks <2B-param models that ship without an index shard (TinyLlama, SmolLM2, Qwen3-0.6B, OLMo-2).
  • olmo / olmo2 model_type dispatch added to streaming_teacher and streaming_compression_runner.
  • Full notes: docs/RELEASE_NOTES_v0.5.2.md.

Architecture matrix

End-to-end validated at 5 bpw with bit-identical W_base reconstruction. Checkmark = uc verify PASS, pending = local PASS, HF upload in flight, retry = re-running on v0.5.2 after streaming-fix.

Architecture Params Layers bpw PPL ratio uc verify HF repo
Qwen3-1.7B 1.7B 28 5 1.0078 PASS SipsaLabs/qwen3-1.7b-uc-v3-bpw5
Mistral-7B-v0.3 7.2B 32 5 1.0100 PASS SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5
Llama-3.1-8B 8.0B 32 5 1.0125 PASS local (upload pending)
Qwen3-8B 8.0B 36 5 1.0044 PASS local (upload pending)
Qwen3-14B 14.0B 40 5 1.0040 PASS local (upload pending)
Mixtral-8x7B-v0.1 47B (MoE 8 exp) 32 5 (PPL 5.88) PASS local (upload pending)
Phi-3.5-MoE-instruct 42B (MoE 16 exp) 32 5 (PPL 6.95) PASS local (upload pending)
Llama-3.1-70B 70B 80 5 (PPL 6.02) PASS local (upload pending)
Qwen2.5-72B 72B 80 5 1.0162 PASS local (upload pending)
Mamba-2.8B (SSM) 2.8B 64 5 1.0119 PASS local (upload pending)
Hermes-3-Llama-3.1-405B 405B 126 5 (in flight) 53/126 layers (compressing GPU 0)
SmolLM2-1.7B 1.7B 24 5 (in flight) pending (added today, GPU 1)
TinyLlama-1.1B-Chat 1.1B 22 5 (in flight) pending (added today, GPU 1)
Qwen3-0.6B 0.6B 28 5 (in flight) pending (added today, GPU 1)
OLMo-2-0425-1B 1B 16 5 (in flight) retry on v0.5.2 (added today, queued)

PPL ratios listed against the model's own bf16 baseline on a 100-sample held-out slice. MoE rows lack baseline due to single-GPU OOM on bf16 baseline; multi-GPU baseline pipeline lands in v0.6.

UltraCompress is the first quantization library publicly compatible with both transformer and state-space architectures (Mamba), including the Linear naming required for emerging hybrids such as AI21 Jamba.


How v0.3 lossless works

uc pack v0.3 persists the trainer's k-means learned grid + per-block scales + bit-packed integer codes directly into the customer artifact. Reconstruction is W_base = absmax × grid[codes] and reproduces — bit-identically — the dequantized weight the trainer used during distillation.

This is the only mathematically lossless 5-bit transformer quantization format in production. AWQ / GPTQ / EXL3 / bitsandbytes-int4 introduce measurable PPL drift between training-time eval and customer-time inference. UltraCompress v0.3 customers see identical inference behavior to what the trainer measured.

Customer profile Why bit-exact reconstruction matters
Defense / aerospace Bit-exact deploy is a compliance requirement (audit trail).
Healthcare AI (FDA-regulated) Model equivalence required between dev and deploy.
Finance (SR 11-7 model validation) Reproducibility audit requires bit-exact recovery.
Frontier labs (internal artifact distribution) Red-team eval fidelity requires identical inference.
Single-GPU 70B+ deployment Streaming compression keeps peak VRAM ~one transformer layer.

Full vendor comparison: docs/COMPETITIVE_LANDSCAPE_v3_LOSSLESS_2026_05_08.md.


Streaming compression — single-GPU large-model headline

Per-layer streaming validated end-to-end across 8B → 72B with peak VRAM bounded by ~one transformer layer regardless of total depth.

Model Layers PPL ratio Peak VRAM
Qwen3-8B 36 1.0278 2.26 GB
Qwen3-14B 40 1.0111 3.37 GB
Qwen3-32B 64 1.0367 4.85 GB
Qwen2.5-72B 80 1.0162 8.98 GB

Recipe: GSQ scalar 5 bpw + per-block (B=64) absmax + V18-C rank-32 low-rank correction + 200-step KL distillation per layer. Process: lazy-load layer fp16 weights via safetensors → cache teacher hidden output → quantize → fit V18-C against cache → save → free → next layer. Compression time ~1 min/layer.

Reproduce on a 5090 (~9 min for 8B):

python scripts/overlay/streaming_compression_runner.py \
    --model qwen3-8b --bpw 5 --block_size 64 --rank 32 \
    --train_steps 200 --n_calib 100 --n_eval 50

Earlier research tracks

Three independent compression mechanisms compose multiplicatively:

  • Track A — streaming compression (above): single-GPU 72B at PPL 1.0162.
  • Track B — Fractal Residual Recursion (Claims 1–16): shared-block architectural compression at 311–734× on Qwen3-1.7B (HQ5 h256 reaches 70.0% T10). See docs/HQ5_RESULTS.md, REPRODUCE.md.
  • Track C — row-overlay sub-3-bpw quantization (Claims 17–20): beats bitsandbytes-nf4 at 30% fewer bits on a 6-model cohort (n=500 LAMBADA). Zero catastrophic failures across 48 measurements vs. HQQ's 6/6 at 2-bit g64. See RESULTS.md, docs/claim20_summary.txt.

Stacked, the projection for a 100T-parameter model on a single GPU is ~5 GB at 20,000× total compression — see docs/100T_MISSION_MATH_2026_05_03.md. Track A + B + C numbers are individually validated; full multiplicative stack is an architectural projection.


Repository layout

ultracompress/
├── ultracompress/              Core library (pack v0.3, FractalModel, pipeline, __main__)
├── scaling/                    Cross-model teacher loaders (Qwen3 / Llama / Mistral / Mamba / OLMo)
├── scripts/overlay/            Track A (row-overlay + streaming compression)
├── scripts/frr/                Track B (FRR architectural compression)
├── tools/                      Model download, quantization utilities
├── tests/                      Regression tests
├── results/                    Measurement JSONs (indexed by claim)
├── logs/                       Run logs
└── docs/                       Patents, dashboards, customer flow, competitive landscape

Index: RESULTS.md, PATENT_CLAIMS.md, REPRODUCE.md.


Patent disclosure

USPTO provisionals 64/049,511 and 64/049,517 filed 2026-04-25 covering the row-overlay quantization, FRR architectural compression, streaming-compression mechanism, and v0.3 lossless pack format.

License

  • Apache-2.0 for the CLI, verifier, and customer-facing pack format — see LICENSE.
  • Sipsa Labs Research Evaluation License v1.0 for compression internals (k-means trainer, V18-C overlay fit, FRR distillation pipeline) — see LICENSE_RESEARCH_EVAL.md.

Citation

@misc{ultracompress2026,
  title  = {UltraCompress: Mathematically Lossless 5-bit Transformer
            Compression Across 11+ Architectures},
  author = {Sipsa Labs},
  year   = {2026},
  url    = {https://github.com/sipsalabs/ultracompress}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ultracompress-0.5.5.tar.gz (328.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ultracompress-0.5.5-py3-none-any.whl (373.7 kB view details)

Uploaded Python 3

File details

Details for the file ultracompress-0.5.5.tar.gz.

File metadata

  • Download URL: ultracompress-0.5.5.tar.gz
  • Upload date:
  • Size: 328.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ultracompress-0.5.5.tar.gz
Algorithm Hash digest
SHA256 faae8361abf0539beab860a33f7ca8f7aa5348ffd5e0fe1d90dd48672453f72d
MD5 f4ffd4decb47f33185fc0430460247af
BLAKE2b-256 be0b18820c61d19e42fd276d261211cb5738b615a59e46d84915d4a6388a2993

See more details on using hashes here.

File details

Details for the file ultracompress-0.5.5-py3-none-any.whl.

File metadata

  • Download URL: ultracompress-0.5.5-py3-none-any.whl
  • Upload date:
  • Size: 373.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ultracompress-0.5.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b93b171b190a5e972a7e5636550d4b838f2bfb5b474b932ca95f74beca4733d7
MD5 06ddbd42db2f5fd0ccea5ce4e998e0ce
BLAKE2b-256 c0f292d96d01563a3af41c8a957e125cf09741eaa4e98f8585071f2991029077

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page