Skip to main content

Near-lossless 5-bit transformer compression (~1% perplexity vs the bf16 reference; lossy), with reproducible, cryptographically verifiable reconstruction — 22 PPL-verified architectures (17 dense + 4 MoE + 1 SSM) plus 1 ViT cosine-verified — across 4 architecture classes (0.6B–405B). Public CLI does pack structure + download-integrity checks; the codec is patent-pending and not distributed. BUSL-1.1 + Additional Use Grant.

Project description

UltraCompress

Near-lossless 5-bit transformer compression (~1% perplexity vs the bf16 reference — it is lossy). Published model artifacts ship with reproducible, cryptographically verifiable reconstruction: SHA-256-pinned packs an auditor can deterministically decode back to the validated artifact.

PyPI License Python 3.10+ Patent

v0.6.25: the public package is intentionally minimal — a small, dependency-free CLI that lets you (a) generate text against a Sipsa- hosted compressed model in 30 seconds (uc try), (b) browse the full catalog with tiers and PPL ratios (uc catalog), and (c) verify pack structure and download integrity (uc verify) on any pack you download from HuggingFace. It contains no compression or reconstruction code: that methodology is patent-pending and is not distributed. Cryptographically verifiable reconstruction of a pack — a deterministic decode to its SHA-256-pinned validated artifact — is performed by Sipsa Labs under engagement.

Hermes-3-Llama-3.1-405B compressed at 5 bpw, near-lossless: 1.0066x PPL ratio vs streaming bf16 teacher (5.0692 / 5.0358, n=50, seq_len=1024, FineWeb-edu held-out tail, seed=42). A 405B-class transformer compressed end-to-end on a single 32 GB consumer GPU.

UltraCompress takes a transformer at fp16/bf16 and produces a 5-bit pack that is a near-lossless, lossy approximation of the original — on the order of 1% perplexity drift — but with a property other quantizers don't offer: reproducible, cryptographically verifiable reconstruction. The decode is deterministic and every pack is SHA-256-pinned, so an auditor can re-derive the exact validated artifact from the pack, byte-for-byte, and confirm they hold the same weights Sipsa Labs evaluated. That is a reproducible, auditable reconstruction of the validated artifact — not bit-identity with the original bf16 (the 5-bit pack is lossy). Sipsa Labs runs that verification with you under engagement. The codec is patent-pending.

It exists because the bf16-equivalent quality bar matters in places where "good enough on MMLU" isn't enough — defense, FDA-regulated healthcare, SR 11-7 model validation, internal red-team eval at frontier labs. And as a side-effect of the streaming compression path, it lets us put a 405B-parameter model through a single 32 GB consumer GPU without renting an H100 cluster.

We're a small lab shipping this in public while the patents are pending. Most days the lab notebook gets longer than the marketing site does.


Regulated AI deployment? Phase 0 POC is $5K / 5 business days / customer-picked model — full details in Who this is for below. Direct: founder@sipsalabs.com. Verticals: healthcare · defense · legal · quant.


Quick start (30 seconds, no GPU, no signup)

pip install ultracompress
uc try sipsa-qwen3-0.6b

That prints a recorded reference response from our 5-bit-compressed Qwen3-0.6B pack plus the compression numbers, and points you at the next step. With a free key from sipsalabs.com/get-access (60-second signup), the same command goes live against api.sipsalabs.com and streams real output from whichever compressed model you pick.

uc catalog

Lists the 22 PPL-verified architectures (17 dense + 4 MoE + 1 SSM with comparator-note caveat) plus 1 ViT cosine-verified (DINOv2-Large) — across 4 architecture classes — with their published PPL ratios and tier (free / request / POC).

The public CLI (what pip install gives you)

uc try [model]         generate text against a Sipsa-hosted compressed model
uc catalog             list the full compressed-model catalog + tiers
uc verify <pack_dir>   pack structure + download-integrity self-check
uc info                what this package is + links/contact
uc version             print version

uc try calls api.sipsalabs.com/v1/chat/completions when you pass --key sk-sps-... or set $SIPSA_API_KEY; without a key, it prints a recorded reference response so you see what compressed output looks like without signup.

uc verify confirms a downloaded pack is well-formed (manifest present and parseable, declared layer count matches the files on disk, no zero-byte layers) and prints a stable SHA-256 pack fingerprint so you can confirm you hold a byte-identical download, or compare against a fingerprint we publish out of band. It does not reconstruct weights and contains no codec knowledge by design.

hf download SipsaLabs/qwen3-1.7b-base-uc-v3-bpw5 --local-dir ./pack
uc verify ./pack
bpw:             5
layer files:     28
SHA-256 (spot-check; use --full for all):
  manifest.json:f3a1…
  layer_000.uc:7c2b…
  layer_014.uc:9d4f…
  layer_027.uc:1ab8…
pack fingerprint (sha256 of sorted file digests):
  4e9c… (64 hex)

→ STRUCTURE OK — download integrity verified; pack is well-formed
  and the fingerprint above is the per-file SHA-256 reference.
  End-to-end cryptographically verifiable reconstruction (deterministic
  decode to the validated artifact + PPL re-eval) is delivered via `uc audit`
  under engagement (founder@sipsalabs.com); see
  docs/reference/audit-receipt-schema.md for the audit-receipt schema.

Full cryptographically verifiable reconstruction (a deterministic decode to the validated artifact, plus PPL re-evaluation against the bf16 baseline) is an auditor-grade deliverable Sipsa Labs runs with you under engagement — it is deliberately not shipped in the public package.


What's verified (with JSON receipts)

22 architectures independently PPL-verified end-to-end (0.6B → 405B, 17 dense + 4 MoE + 1 SSM) against each model's own bf16 baseline on the FineWeb-edu held-out tail at seq_len=1024, seed=42 — plus 1 Vision Transformer cosine-verified (DINOv2-Large, 304M, ViT-L/14), bringing the catalog to 23 architectures across 4 classes. 21 are transformer (17 dense + 4 MoE); the 22nd PPL-verified row is Mamba-2.8B (state-space model) at 1.00593× canonical PPL, with an explicit comparator-note caveat in the registry: our canonical transformer pipeline (RoPE / attention masks / KV-cache semantics) is architecture-incompatible with SSMs, so the Mamba record uses an SSM-compatible comparator that matches what's in the HF pack. The 23rd (DINOv2-Large) uses CLS-token cosine similarity instead of PPL (encoder-only ViT has no autoregressive likelihood). TinyLlama-1.1B (1.003×) and Llama-3.1-70B (1.009×) graduated to PPL-verified in v0.6.23. DeepSeek-32B and other queued packs remain SHA-256-verified pending canonical re-eval before formal registry entry. Every published number traces to a published result JSON. A small set of packs is publicly downloadable; the full catalog is available to customers under engagement.

Model Params Class PPL ratio HF artifact Status
Hermes-3-Llama-3.1-405B 405B 405B-class near-lossless on a single 32 GB consumer GPU 1.0066 SipsaLabs/hermes-3-llama-3.1-405b-uc-v3-bpw5 live
Mistral-7B-v0.3 7.2B sub-0.6% drift 1.00548 SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5 live
Qwen3-1.7B-Base 1.7B sub-0.5% drift 1.00401 SipsaLabs/qwen3-1.7b-base-uc-v3-bpw5 live
Qwen3-14B 14.0B sub-0.5% drift 1.00403 SipsaLabs/qwen3-14b-uc-v3-bpw5 live
Qwen3-8B 8.0B sub-0.5% drift 1.00440 SipsaLabs/qwen3-8b-uc-v3-bpw5 live
Mixtral-8x7B-v0.1 (MoE) 47B (13B active) sub-0.5% drift 1.00368 SipsaLabs/mixtral-8x7b-v0.1-uc-v3-bpw5 live
Phi-3-mini-4k-instruct 3.8B sub-0.3% drift (seq_len=128, not apples-to-apples) 1.00262 SipsaLabs/phi-3-mini-4k-instruct-uc-v3-bpw5 live

Hermes-3-405B is the headline. The 1.0066x ratio is 5.0692 / 5.0358 — both halves measured under the same per-layer streaming reconstruction comparator (n=50, seq_len=1024, FineWeb-edu held-out tail, seed=42). The bf16 teacher took 7.7 hours on cuda:1; the 5-bpw pack took 14.3 hours. The Mistral-7B 1.00548× row is the tightest dense 7B-class near-lossless 5-bit ratio we currently publish.

  • SSM result: Mamba-2.8B compressed with SHA-256-verified reconstruction (download integrity + deterministic decode to the validated artifact) — first public near-lossless 5-bit canonical-PPL result on a state-space model that we know of, at 1.00593× canonical ratio. Counted as the 20th verified architecture with an explicit comparator-note caveat in the registry: our canonical transformer pipeline (RoPE / attention masks / KV-cache semantics that don't apply to SSMs) is architecture-incompatible, so the Mamba record uses an SSM-compatible comparator that matches what's in the HF pack. The comparator is documented in the registry.
  • HuggingFace: a small public verification set under huggingface.co/SipsaLabs; full catalog under engagement.
  • PyPI: pypi.org/project/ultracompress.

What doesn't work yet

Things people sometimes assume work because the rest of it does. They don't, and we'd rather you know:

  • Long-context evaluation past seq_len=1024. Every PPL number above is at seq_len=1024 on the FineWeb-edu held-out tail. We have not yet run controlled evals at 4K/8K/32K context.
  • State-space models past the current SSM pack. Mamba-2.8B ships + SHA-256-verified + canonical PPL claimed at 1.00593× (with comparator-note caveat documented in registry; our canonical transformer pipeline is architecture-incompatible with SSMs). We tried two tighter paths on top — both made it worse.
  • Qwen3-32B PPL ratio. Stale or suspect baseline PPL number we won't republish. Apples-to-apples re-eval is queued.
  • Below 1.0040× on Qwen3-1.7B-Base. This is our tightest dense floor; we tried 5 different paths to break it. Three were within noise; two were catastrophic regressions. 1.0040× stands as the empirical floor at the current configuration.

Why this isn't AWQ / GPTQ / EXL3

Every other 4–5 bit compression library targets a quality threshold ("sub-1% PPL on WikiText"). UltraCompress hits a competitive PPL ratio too (~1% drift; it is lossy) but adds a reproducibility contract: the published artifact comes with reproducible, cryptographically verifiable reconstruction — a deterministic decode to its SHA-256-pinned validated artifact. Codec internals are patent-pending and deliberately not described here.

This matters when "the model picks a slightly-wrong variable name" is a regulatory finding rather than a cosmetic complaint. Defense / aerospace reproducible, verifiable deployment is a compliance requirement. FDA-regulated healthcare AI requires model equivalence between dev and deploy. SR 11-7 (Federal Reserve model validation) requires reproducible audit recovery.

For pure-throughput inference on a fixed prompt distribution that matches your AWQ calibration set, with no downstream fine-tuning, AWQ at 4 bpw on vLLM is genuinely fine and we'll say so on a sales call.

As of mid-2026 we are not aware of another published library targeting a reproducible, cryptographically verifiable reconstruction contract — a deterministic, SHA-256-auditable decode on top of a competitive PPL ratio — for 5-bit transformer compression on the public HuggingFace Hub. If you find one, tell us — we'd rather benchmark against it than claim a gap that isn't there.


Honest negative results

Most projects hide their failures. We catalogue them at the same level of detail as the wins.

  • An initialization shortcut we tried — made PPL 0.07 pp WORSE on Mamba and was discarded. Method specifics withheld (patent-pending).
  • A multi-pass variant we hypothesized would help — produced a catastrophic 13.7× regression vs. the single-pass baseline. CLOSED.
  • Importing an AWQ-style pre-scaling step — produced a catastrophic +13% regression and was ruled out. CLOSED.
  • Pushing the training schedule past the current configuration — gained nothing (within noise). The floor stands.
  • "Base models compress tighter than instruct" hypothesis — refuted 2/3 of architectures. Dropped.

Detailed methodology for any specific failure is available to design partners under NDA.


Who this is for

  • If you serve LLMs in production and your VRAM bill is the constraint, this might help. It scales to a 405B-class model on a single 32 GB consumer GPU (the how is patent-pending). Email founder@sipsalabs.com with your stack and a target latency/quality bar; we'll tell you honestly whether UC fits.
  • If you're in a regulated domain (defense, FDA-regulated healthcare, SR 11-7 model validation, frontier lab red-team), the reproducible, cryptographically verifiable reconstruction contract is the reason to talk to us. Phase 0 POC ($5K, 5 business days, customer-picked model) gets you a pack plus a Sipsa-run reconstruction + PPL audit you can review. Email founder@sipsalabs.com.

If your workload is "MMLU has to stay above X" and you're not pushing the model into long-tail or downstream-fine-tuning territory, AWQ at 4 bpw is probably a better answer than this. We'll say so.


We're a small company looking for design partners

Sipsa Labs is a small lab shipping in public. Our compression methods are patent-pending; details are in PATENT_NOTICE.md. The CLI source is BUSL-1.1 with an Additional Use Grant — free for companies under $1M ARR, research, and individuals, auto-converting to Apache 2.0 four years post-release. If you're building a derivative product whose core value depends on the underlying invention, email founder@sipsalabs.com.

  • Paid Phase 0 POCfounder@sipsalabs.com, $5K / 5 business days / customer-picked model. Deliverable: a pack plus a Sipsa-run reconstruction + PPL audit on your eval set.
  • GitHub Sponsorsgithub.com/sponsors/sipsalabs.
  • Press / commentarypress@sipsalabs.com.

License

  • Released under BUSL-1.1 with an Additional Use Grant (free for companies under $1M ARR, research, and individuals; auto-converts to Apache 2.0 four years post-release). See LICENSE.
  • The license grant does not extend to the patent-pending compression methodology that produces the artifacts. See PATENT_NOTICE.md.
  • Pre-compressed model artifacts on HuggingFace carry the upstream teacher model's license plus this project's patent terms.

Citation

@software{sipsa_ultracompress_2026,
  author = {{Sipsa Labs, Inc.}},
  title  = {UltraCompress: Near-Lossless 5-bit Transformer Compression},
  year   = {2026},
  url    = {https://github.com/sipsalabs/ultracompress}
}

Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ultracompress-0.6.25.tar.gz (41.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ultracompress-0.6.25-py3-none-any.whl (38.7 kB view details)

Uploaded Python 3

File details

Details for the file ultracompress-0.6.25.tar.gz.

File metadata

  • Download URL: ultracompress-0.6.25.tar.gz
  • Upload date:
  • Size: 41.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ultracompress-0.6.25.tar.gz
Algorithm Hash digest
SHA256 2060ca7f2ca02e33a3d94d2ff45a18d9eb73157cf70afbeeabf4976b841e0ce2
MD5 02b1e238a1b097e0ec6e403ef287f3ac
BLAKE2b-256 8c50759b447600e6ceee06dfeaeaca12b13d14998ae5a8a349beaa1e47673067

See more details on using hashes here.

File details

Details for the file ultracompress-0.6.25-py3-none-any.whl.

File metadata

  • Download URL: ultracompress-0.6.25-py3-none-any.whl
  • Upload date:
  • Size: 38.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ultracompress-0.6.25-py3-none-any.whl
Algorithm Hash digest
SHA256 eba224f90322548701280605876e1d1526cfc534bd97fd56f191fa7039e2510e
MD5 0eaa196f16e7911968aaac667a64a56a
BLAKE2b-256 d757544a62f49a11c935acdfc9f2fd9b44091d21f45a00eff9c63b0164c12b3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page