Skip to main content

Lossless 5-bit transformer compression — 14 architectures independently PPL-verified end-to-end (0.6B–405B; dense, MoE, SSM). Public CLI does pack structure + download-integrity checks; the codec is patent-pending and not distributed. BUSL-1.1 + Additional Use Grant.

Project description

UltraCompress

Lossless 5-bit transformer compression. Published model artifacts are bit-identical to their bf16 reference.

PyPI License Python 3.10+ Patent

v0.6.15: the public package is intentionally minimal — a small, dependency-free CLI that checks pack structure and download integrity and prints project info. It contains no compression or reconstruction code: that methodology is patent-pending and is not distributed. Bit-identical reconstruction verification of a pack is performed by Sipsa Labs under engagement. The compressed model artifacts themselves remain public on HuggingFace.

Hermes-3-Llama-3.1-405B compressed at 5 bpw lossless: 1.0066x PPL ratio vs streaming bf16 teacher (5.0692 / 5.0358, n=50, seq_len=1024, FineWeb-edu held-out tail, seed=42). A 405B-class transformer compressed end-to-end on a single 32 GB consumer GPU.

UltraCompress takes a transformer at fp16/bf16 and produces a 5-bit pack that reconstructs bit-identically to the reference bf16 checkpoint — not "1% PPL drift on WikiText," but a deterministic reconstruction. That is the honest definition of lossless we care about: an auditor can re-derive every weight from the pack, and Sipsa Labs verifies that bit-identity under engagement. The codec is patent-pending.

It exists because the bf16-equivalent quality bar matters in places where "good enough on MMLU" isn't enough — defense, FDA-regulated healthcare, SR 11-7 model validation, internal red-team eval at frontier labs. And as a side-effect of the streaming compression path, it lets us put a 405B-parameter model through a single 32 GB consumer GPU without renting an H100 cluster.

We're a small lab shipping this in public while the patents are pending. Most days the lab notebook gets longer than the marketing site does.


The public CLI (what pip install gives you)

pip install ultracompress
uc info
uc verify <pack_dir>   pack structure + download-integrity self-check
uc info                what this package is + links/contact
uc version             print version

uc verify confirms a downloaded pack is well-formed (manifest present and parseable, declared layer count matches the files on disk, no zero-byte layers) and prints a stable SHA-256 pack fingerprint so you can confirm you hold a byte-identical download, or compare against a fingerprint we publish out of band. It does not reconstruct weights and contains no codec knowledge by design.

hf download SipsaLabs/qwen3-1.7b-base-uc-v3-bpw5 --local-dir ./pack
uc verify ./pack
uc_pack_version: 3
bpw:             5
layer files:     28
SHA-256 (spot-check; use --full for all):
  manifest.json:f3a1…
  layer_000.uc:7c2b…
  layer_014.uc:9d4f…
  layer_027.uc:1ab8…
pack fingerprint (sha256 of sorted file digests):
  4e9c… (64 hex)

→ STRUCTURE OK — pack is well-formed; fingerprint above is the
  download-integrity reference. This is NOT a reconstruction proof;
  bit-identical reconstruction verification is provided by Sipsa
  Labs under engagement (founder@sipsalabs.com).

Full bit-identical reconstruction verification (and PPL re-evaluation against the bf16 baseline) is an auditor-grade deliverable Sipsa Labs runs with you under engagement — it is deliberately not shipped in the public package.


What's verified (with JSON receipts)

14 architectures independently PPL-verified end-to-end (0.6B → 405B, dense + MoE + state-space) against each model's own bf16 baseline on the FineWeb-edu held-out tail at seq_len=1024, seed=42. Every published number traces to a published result JSON. A small set of packs is publicly downloadable; the full catalog is available to customers under engagement.

Model Params Class PPL ratio HF artifact Status
Hermes-3-Llama-3.1-405B 405B 405B-class lossless on a single 32 GB consumer GPU 1.0066 SipsaLabs/hermes-3-llama-3.1-405b-uc-v3-bpw5 live
Mistral-7B-v0.3 7.2B sub-0.6% drift 1.00548 SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5 live
Qwen3-1.7B-Base 1.7B sub-0.5% drift 1.00401 SipsaLabs/qwen3-1.7b-base-uc-v3-bpw5 live
Qwen3-14B 14.0B sub-0.5% drift 1.00403 SipsaLabs/qwen3-14b-uc-v3-bpw5 live
Qwen3-8B 8.0B sub-0.5% drift 1.00440 SipsaLabs/qwen3-8b-uc-v3-bpw5 live
Mixtral-8x7B-v0.1 (MoE) 47B (13B active) sub-0.5% drift 1.00368 SipsaLabs/mixtral-8x7b-v0.1-uc-v3-bpw5 live
Phi-3-mini-4k-instruct 3.8B sub-0.3% drift (seq_len=128, not apples-to-apples) 1.00262 SipsaLabs/phi-3-mini-4k-instruct-uc-v3-bpw5 live

Hermes-3-405B is the headline. The 1.0066x ratio is 5.0692 / 5.0358 — both halves measured under the same per-layer streaming reconstruction comparator (n=50, seq_len=1024, FineWeb-edu held-out tail, seed=42). The bf16 teacher took 7.7 hours on cuda:1; the 5-bpw pack took 14.3 hours. The Mistral-7B 1.00548× row is the tightest dense 7B-class lossless 5-bit ratio we currently publish.


What doesn't work yet

Things people sometimes assume work because the rest of it does. They don't, and we'd rather you know:

  • Long-context evaluation past seq_len=1024. Every PPL number above is at seq_len=1024 on the FineWeb-edu held-out tail. We have not yet run controlled evals at 4K/8K/32K context.
  • State-space models past the current SSM result. Mamba-2.8B at 1.0119 is the SSM number, full stop. We tried two tighter paths on top — both made it worse.
  • TinyLlama-1.1B-Chat PPL eval. The pack itself is well-formed and the HF artifact uploaded, but the PPL eval forward pass throws a CUDA device-side assert that we haven't traced yet. Shown as deferred, not a fabricated number.
  • Qwen3-32B and Llama-3.1-70B PPL ratios. Both have stale or suspect baseline PPL numbers we won't republish. Apples-to-apples re-evals are queued.
  • Below 1.0040× on Qwen3-1.7B-Base. This is our tightest dense floor; we tried 5 different paths to break it. Three were within noise; two were catastrophic regressions. 1.0040× stands as the empirical floor at the current configuration.

Why this isn't AWQ / GPTQ / EXL3

Every other 4–5 bit compression library targets a quality threshold ("sub-1% PPL on WikiText"). UltraCompress targets a reconstruction contract: the published artifact reconstructs bit-identically to the reference bf16 checkpoint. Codec internals are patent-pending and deliberately not described here.

This matters when "the model picks a slightly-wrong variable name" is a regulatory finding rather than a cosmetic complaint. Defense / aerospace deploy-bit-exactness is a compliance requirement. FDA-regulated healthcare AI requires model equivalence between dev and deploy. SR 11-7 (Federal Reserve model validation) requires reproducible audit recovery.

For pure-throughput inference on a fixed prompt distribution that matches your AWQ calibration set, with no downstream fine-tuning, AWQ at 4 bpw on vLLM is genuinely fine and we'll say so on a sales call.

As of mid-2026 we are not aware of another published library targeting a bit-identical reconstruction contract (as opposed to a PPL-threshold) for 5-bit transformer compression on the public HuggingFace Hub. If you find one, tell us — we'd rather benchmark against it than claim a gap that isn't there.


Honest negative results

Most projects hide their failures. We catalogue them at the same level of detail as the wins.

  • An initialization shortcut we tried — made PPL 0.07 pp WORSE on Mamba and was discarded. Method specifics withheld (patent-pending).
  • A multi-pass variant we hypothesized would help — produced a catastrophic 13.7× regression vs. the single-pass baseline. CLOSED.
  • Importing an AWQ-style pre-scaling step — produced a catastrophic +13% regression and was ruled out. CLOSED.
  • Pushing the training schedule past the current configuration — gained nothing (within noise). The floor stands.
  • "Base models compress tighter than instruct" hypothesis — refuted 2/3 of architectures. Dropped.

Detailed methodology for any specific failure is available to design partners under NDA.


Who this is for

  • If you serve LLMs in production and your VRAM bill is the constraint, this might help. It scales to a 405B-class model on a single 32 GB consumer GPU (the how is patent-pending). Email founder@sipsalabs.com with your stack and a target latency/quality bar; we'll tell you honestly whether UC fits.
  • If you're in a regulated domain (defense, FDA-regulated healthcare, SR 11-7 model validation, frontier lab red-team), the bit-identical reconstruction contract is the reason to talk to us. Phase 0 POC ($5K, 5 business days, customer-picked model) gets you a pack plus a Sipsa-run bit-identity + PPL audit you can review. Email founder@sipsalabs.com.

If your workload is "MMLU has to stay above X" and you're not pushing the model into long-tail or downstream-fine-tuning territory, AWQ at 4 bpw is probably a better answer than this. We'll say so.


We're a small company looking for design partners

Sipsa Labs is a small lab shipping in public. Our compression methods are patent-pending; details are in PATENT_NOTICE.md. The CLI source is BUSL-1.1 with an Additional Use Grant — free for companies under $1M ARR, research, and individuals, auto-converting to Apache 2.0 four years post-release. If you're building a derivative product whose core value depends on the underlying invention, email founder@sipsalabs.com.

  • Paid Phase 0 POCfounder@sipsalabs.com, $5K / 5 business days / customer-picked model. Deliverable: a pack plus a Sipsa-run bit-identity + PPL audit on your eval set.
  • GitHub Sponsorsgithub.com/sponsors/sipsalabs.
  • Press / commentarypress@sipsalabs.com.

License

  • Released under BUSL-1.1 with an Additional Use Grant (free for companies under $1M ARR, research, and individuals; auto-converts to Apache 2.0 four years post-release). See LICENSE.
  • The license grant does not extend to the patent-pending compression methodology that produces the artifacts. See PATENT_NOTICE.md.
  • Pre-compressed model artifacts on HuggingFace carry the upstream teacher model's license plus this project's patent terms.

Citation

@software{sipsa_ultracompress_2026,
  author = {{Sipsa Labs, Inc.}},
  title  = {UltraCompress: Lossless 5-bit Transformer Compression},
  year   = {2026},
  url    = {https://github.com/sipsalabs/ultracompress}
}

Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ultracompress-0.6.15.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ultracompress-0.6.15-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file ultracompress-0.6.15.tar.gz.

File metadata

  • Download URL: ultracompress-0.6.15.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ultracompress-0.6.15.tar.gz
Algorithm Hash digest
SHA256 60bd5d83cd7aab710b4007f1f24fb6be77cb80be001e8f3ed95ef2446e513497
MD5 5969e28dd484f1d9ae9737d604dfc52f
BLAKE2b-256 d31e3e55b0c994ec4af3e99a9e6c3be895ee34c3c5c4b5d38e2c646979abba25

See more details on using hashes here.

File details

Details for the file ultracompress-0.6.15-py3-none-any.whl.

File metadata

  • Download URL: ultracompress-0.6.15-py3-none-any.whl
  • Upload date:
  • Size: 17.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ultracompress-0.6.15-py3-none-any.whl
Algorithm Hash digest
SHA256 862afef4ad687e44b06d5ad4201dc395b9c08f721f4dce8c8fe8edb398f5d2c1
MD5 daab7b28f12e6ed2d812735dfdc3809f
BLAKE2b-256 0d47aa8e42a06e7e1ff686b80c4941a7c01a11a4c545d7b3ddfc9b1a11190382

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page