Skip to main content

Lossless 5-bit transformer compression — 8 architectures (1.7B–70B, dense + MoE) at sub-1% PPL degradation. Customer-distributable via `uc pack v0.3`.

Reason this release was yanked:

Broken import in 0.5.0 (track_a_adaptive missing). Use 0.5.1 or later.

Project description

UltraCompress

Extreme LLM compression — three independent mechanisms that compose multiplicatively: (A) Per-layer streaming compression — Qwen2.5-72B → 8.98 GB peak VRAM on a single RTX 5090, PPL ratio 1.0162×. (B) Sub-3-bpw row-overlay weight compression (Claims 17–20) — beats bitsandbytes nf4 at 30% fewer bits on a 6-model cohort. (C) Fractal Residual Recursion (FRR) (Claims 1–16) — shared-block architectural compression at 311–734×.

License Python 3.10+ Hardware Patent


⭐ Latest — Streaming compression: full Qwen scaling curve, 72B on a single GPU (2026-05-04)

Per-layer streaming compression validated end-to-end across 8B → 72B with peak VRAM bounded by ~one transformer layer regardless of total model depth. Production-grade quality (PPL ratio ≤ 1.05) at every scale; Qwen2.5-72B compressed to 8.98 GB peak VRAM on a single RTX 5090 with 1.6% PPL drift.

Model Layers Baseline PPL Compressed PPL PPL ratio Peak VRAM Status
Qwen3-8B 36 16.79 17.26 1.0278× 2.26 GB PROD
Qwen3-14B 40 15.44 15.61 1.0111× 3.37 GB PROD (best)
Qwen3-32B 64 13.77 14.27 1.0367× 4.85 GB PROD
Qwen2.5-72B 80 8.92 9.07 1.0162× 8.98 GB PROD (headline)

Recipe: GSQ scalar 5 bpw + per-block (B=64) absmax + V18-C rank-32 low-rank correction overlay + 200-step KL distillation per layer. Process: load layer fp16 weights via safetensors lazy load → cache teacher hidden output → quantize → fit V18-C against cache → save → free → next layer. Compression time scales linearly: ~1 min/layer overhead.

Bigger models compress at least as well as smaller ones — empirically consistent with arxiv 2505.02214 within the Qwen family. The 100T-on-1-GPU target is now a math problem (multiplicative composition with Track B substrate sharing + Track C inference streaming), not a prayer.

Reproduce on Qwen3-8B (~9 min on a 5090):

python scripts/overlay/streaming_compression_runner.py \
    --model qwen3-8b --bpw 5 --block_size 64 --rank 32 \
    --train_steps 200 --n_calib 100 --n_eval 50

Result JSONs under scripts/overlay/artifacts/streaming_compression_{8b,14b,32b,72b}_smoke.json. Patent supplement covering streaming-compression mechanism filed 2026-05.


Latest — Claim 20: Row-overlay vs external quantizers (n=500, 6-model cohort)

Head-to-head LAMBADA benchmark against two independent external quantization families (bitsandbytes + HQQ). 48 measurements = 6 models × 8 methods × n=500 samples.

method bpw cohort T1-retention median ppl-ratio
bnb_int8 8.000 99.75% 1.005
bnb_nf4 4.000 98.31% 1.054
hqq_4bit_g64 4.500 97.72% 1.078
our_mixed_2p79 2.798 95.63% 1.131
our_fp8_2p79 2.795 95.57% 1.128
hqq_3bit_g64 3.500 72.46% 1.608
hqq_2bit_g16 4.000 34.82% 17.14
hqq_2bit_g64 2.500 3.46% 5284.48

Production tier ladder (Qwen3-8B, validated 2026-05-02):

Operating point T1 retention PPL ratio Compression Verdict
6 bpw (GSQ k-means) 96.72% 1.0024 2.67x Zero-degradation tier
5 bpw (GSQ + low-rank correction) 94.39% 1.003 3.2x Production-grade (default)
5 bpw + additional compression on correction overhead 94.40% 1.0029 3.2x weights + 1.30x correction Composable stack proof
4 bpw 90.14% 1.014 4.0x Light degradation
3 bpw 80.97% 1.084 5.3x Aggressive

The 6 bpw tier is effectively lossless (PPL ratio 1.0024). Three independent compression mechanisms compose multiplicatively without quality loss -- the correction overhead itself compresses an additional 1.30x at storage level with no T1 impact.

Scaling validation (Qwen3-14B, 4 bpw + correction compression + teacher distillation): 88.41% T1 at PPL ratio 0.9752. Full production stack holds at scale.

Headline results at 7–8B scale:

model ours @ 2.80 bpw vs bnb_nf4 @ 4.00 bpw vs hqq_4bit @ 4.50 bpw bits saved
Qwen3-8B 97.57% T1-ret −0.67 pp −1.17 pp 30–38%
Mistral-7B 98.03% T1-ret −1.32 pp −0.73 pp 30–38%

Qualitative differentiator. HQQ produces catastrophic failures (ppl-ratio > 10×) on 6/6 models at 2-bit g64 and 4/6 at 2-bit g16. Our row-overlay produces zero catastrophic failures across all 48 measurements.

Full results: RESULTS.md § Claim 20 · PATENT_CLAIMS.md § Claim 20 · raw data: results/h2h_n500_full.json · analysis: docs/claim20_summary.txt.

Reproduce: python scripts/overlay/benchmark_head_to_head.py --methods our_fp8_2p79,our_mixed_2p79,bnb_nf4,bnb_int8,hqq_4bit_g64,hqq_3bit_g64,hqq_2bit_g64,hqq_2bit_g16 --n 500.


Track B — FRR architectural compression (held-out, 1000 samples, seed 42)

Independent re-evaluation on a held-out region of FineWeb-Edu that was least-touched during training. Protocol: 1000 samples, 128-token context, seed 42, bootstrap 95% CIs. Reproduce in ~15 minutes on a single 32GB GPU: python scripts/frr/hires_eval.py --tags hq5_h256 hq5_h128 --n 1000.

Variant Trainable Compression all-T1 all-T10 last-T10 Quality PPL ratio
HQ5 h256 1,509,916 311× 55.40% 69.64% 64.24% 75.94% 1.216
HQ5 h128 640,284 734× 53.78% 68.00% 62.36% 73.86% 1.254

Interpretation. The h256 student has 0.088% of the teacher's trainable parameters and reproduces its top-10 next-token set 69.64% of the time on unseen text. The h128 student has 0.037% of the teacher's parameters and still reproduces 68.00%. For reference, the typical distillation baseline (DistilBERT / TinyBERT family) achieves 2–7× compression at similar quality; FRR-HQ5 is ~50× beyond that frontier.

Full results: results/hires_results_hq5.json. Pitch for business use: docs/PITCH.md.

Rigor & reproducibility

The numbers above are in-distribution held-out (training samples from the full 500M-token range, eval samples from the tail 50M with a different seed). To defend against a stricter reviewer we also ship:

  • ⭐ Fully-disjoint eval on WikiText-103 test split — DONE. python scripts/overlay/wikitext_eval.py --tags hq5_h256 hq5_h128 --n 1000. WikiText-103 test was never touched during training and is a standard public benchmark. Result: on WT103 HQ5-h256 scores T1 = 55.53% (vs 55.40% in-domain) and T10 = 66.82% (vs 69.64% in-domain). Top-1 agreement is within 0.13 percentage points of the in-domain number — strong evidence the student learned the teacher's distribution rather than just the FineWeb-Edu surface statistics. Raw data: results/wikitext_results.json.
  • Matched-parameter standard-KD baselinepython scripts/frr/run_baseline_distill.py --h 256 --n_layers 2 --steps 80000 --tag baseline_h256_L2. Trains a vanilla transformer student at the same ~1.5M trainable params using classical Hinton-2015 distillation. Head-to-head delta proves the nested-fractal + entropy-weighted loss is load-bearing.
  • Pinned dependencies — see requirements.txt for exact versions (torch 2.11.0+cu128, transformers 4.57.2, datasets 4.8.4, numpy 2.2.6).
  • Full reproduce guide — see REPRODUCE.md for step-by-step.

15-minute interactive demo

python demo.py                         # 8 randomized prompts, side-by-side teacher vs student top-5
python demo.py --prompt "your text"    # single-prompt mode
python demo.py --tag hq5_h128          # 734× model instead of 311×

Training Results (per-run training-eval ceilings, 80K steps each)

Variant Trainable Compression Best T1 Best all-T10 Peak T1 Peak all-T10 Quality
HQ5 h256 1.51 M 311× 55.1% 70.0% 57.0% 70.0% 70.0%
HQ5 h128 0.64 M 734× 54.0% 68.4% 54.4% 68.4% 68.4%
HQ4 h256 1.51 M 311× 54.3% 69.2% 55.7% 69.6% 68.9%
HQ4 h128 0.64 M 734× 53.4% 68.0% 55.7% 68.6% 66.9%
HQ3 h256 1.51 M 311× 54.1% 68.2% 54.7% 68.2% 68.1%
HQ3 h128 0.64 M 734× 54.2% 68.0% 54.2% 68.0% 67.7%

HQ5 h256 is the current flagship. First checkpoint to cross 70% quality on Qwen3-1.7B distillation. Details: docs/HQ5_RESULTS.md, docs/HQ4_RESULTS.md, docs/HQ3_RESULTS.md. Currently training: HQ6 (dual GPU, ENT_POW=2.0) and HQ7 long-horizon (160K steps).

ASVD head fine-tuning (trained separately — stackable with FRR body)

Rank (r) Head compression T1 T10 PPL ratio
r=1024 2.0× 91.66% 92.57% 1.345
r=512 3.9× 87.73% 88.93% 2.570
r=256 7.9× 83.22% 82.83% 3.885

r=1024 exceeds the user's 70% T1 / 90% T10 goal on head-only evaluation — see docs/STATUS.md.

⭐ End-to-end stack: FRR body + ASVD head combined (1000 samples, seed 42)

Full end-to-end compression — the actual deployment artifact. FRR body (HQ5 h256) with its output projection replaced by a rank-reduced ASVD head, then fine-tuned.

Config Params Compression all-T1 all-T10 last-T10 PPL ratio Quality
Teacher (Qwen3-1.7B body) 1092.1 M 1.0× 100% 100% 100% 1.000 100%
HQ5 h256 + full head 312.67 M 3.5× 55.40% 69.64% 64.24% 1.216 75.94%
HQ5 h256 + ASVD r=1024 159.19 M 6.9× 54.91% 69.51% 64.03% 1.410 70.22%
HQ5 h256 + ASVD r=512 80.35 M 13.6× 54.46% 68.98% 64.05% 2.400 55.33%
HQ5 h256 + ASVD r=256 40.93 M 26.7× 53.88% 68.32% 63.40% 3.172 49.92%

Interpretation. The FRR+ASVD end-to-end stack at 26.7× total compression still reproduces 68.32% of the teacher's top-10 next-token set — within 1.3 percentage points of the uncompressed-head baseline. This is the number to compare against GPTQ/AWQ/pruning in public benchmarks. Raw data: results/combined_stack_results_hq5.json.

Pareto frontier — pick your operating point

Compression vs Fidelity

Customer picks where on the curve to land. Existing compression methods (GPTQ, AWQ, SparseGPT, DistilBERT) all cluster at 1.5–7.5× compression with >95% fidelity. FRR+ASVD extends the frontier by an order of magnitude into the 3–27× regime, with a graceful quality–compression trade-off rather than a cliff:

  • Quality-first deployment (3–7× compression): hq5_h256+full_head or hq5_h256+asvd_r1024_ft — 70% quality with 7× fewer parameters. Appropriate for latency-critical production inference.
  • Balanced deployment (8–14× compression): hq5_h128 or hq5_h256+asvd_r512_ft — 68–69% T10 with under 80M parameters. Appropriate for edge GPU boxes, 8GB Apple Silicon.
  • Aggressive deployment (27× compression): hq5_h256+asvd_r256_ft — the 40.9M-parameter model; targets phones, Raspberry Pi class hardware. Quality drops to 50% — appropriate for offline / retrieval-augmented / constrained-vocabulary use cases only.

Raw Pareto data: docs/pareto_frontier.json. Reproduce the chart: python scripts/frr/make_pareto_chart.py.


Cross-model generality (scaling the method)

The method is architecture-agnostic. This release includes:

  • scaling/teacher_loader.py — auto-detecting Qwen3-family loader. Point it at any cached Qwen3 state dict; it infers hidden size, layer count, head counts, and intermediate dim from the tensors.
  • scripts/frr/run_frr_generic.py — generic trainer with --teacher_cache flag. Drop-in replacement for the hardcoded 1.7B trainer.
  • scripts/frr/scale_eval.py — model-agnostic eval with bootstrap CIs.
  • tests/test_sanity.py — 6-test regression guard (teacher auto-detect on both 0.6B and 1.7B caches, forward determinism, flagship checkpoint reproducibility, random-init floor, ckpt roundtrip).

Verified on both Qwen3-0.6B (hidden=1024) and Qwen3-1.7B (hidden=2048) state dicts. See docs/SCALING_PLAN.md for the cross-scale experimental matrix and docs/KNOWN_ISSUES.md for honest disclosures.


How It Works

Most compression asks: "How do I make these weights smaller?" FRR asks: "Do I even need different weights per layer?"

Adjacent transformer layers show near-zero weight cosine similarity (~0.001) but CKA > 0.9 (functional similarity). FRR learns the shared functional form once and uses lightweight per-scale modulation to induce layer-specific behavior.

Traditional Transformer           FRR Compressed Model
========================          ==========================

Input                             Input
  │                                 │
  ▼                                 ▼
[Layer 0 weights: 54 MB]          [Shared Block: 0.64–1.51 M params]
  │                                 │ + γ₀, β₀ (per-scale)
  ▼                                 ▼
[Layer 1 weights: 54 MB]          [Same Shared Block]
  │                                 │ + γ₁, β₁
  ▼                                 ▼
  ...  (28 layers)                  ...  (4 scales × 7 iterations)
  │                                 │
  ▼                                 ▼
Output                            Output

Total body: 1,410 MB              Total body: 2.56–6.04 MB

Shared-weight (looped) transformers are Turing-complete (Giannou et al., 2023).


Training Objective — the HQ4/HQ5 Ceiling Break

HQ3 plateaued at T1 ≈ 54% because its confidence-weighted CE + margin loss concentrated gradient on tokens the student had already saturated. HQ4 inverts that signal; HQ5 sharpens it further:

hard_weight  = (1 + H(teacher_logits)) ^ entropy_power
total_loss   = hard_weight · fkl
             + 0.3 · rkl
             + latent_w(step) · latent_mse        # 1.0 → 0.1 across steps 20K→50K
             + 0.5 · ce_ramp(step) · ce           # 0.5 → 1.0 across 16K→48K
             + 0.3 · ce_ramp · hard_weight · margin_loss

Two mechanisms working together:

  1. Inverted weighting forces gradient into high-entropy positions — exactly where T10 gains live.
  2. Latent decay releases the mean-seeking attractor so the ce+margin signal can shape the output distribution rather than just the intermediate latents.

Experiment Timeline

Stage Compression T1 all-T10 Status Notes
Baseline 52× 47% 62–65% Done Pure-KL distillation
TinyFRR 311–2200× 43–46% 60–64% Done Compression sweep h=16…1024
HQ2 311–734× ~50% 67% Done Adds hidden-state latent alignment
HQ3 311–734× 54.2% 68.2% Done 5-loss w/ confidence-weighted CE+margin
HQ4 311–734× 54.3% 69.2% Done Inverted entropy weighting + latent decay
HQ5 311–734× 55.1% 70.0% Done, public Stronger entropy_power (1.5) + per-width latent floor
HQ6 311–734× TBD TBD Training ENT_POW=2.0 (h256) + h384 capacity test

Full training logs: logs/ (hq{3,4,5,6}_h{128,256,384}.log).


Quick Start

git clone https://github.com/mounnar/ultracompress.git
cd ultracompress
pip install -r requirements.txt

# 1. Cache the teacher (one-time, ~7 GB for Qwen3-1.7B)
python tools/download_models.py

# 2. Pre-tokenize training data (one-time, ~2 GB for 500M tokens)
python prepare_500M_tokens.py

# 3. Train TinyFRR body with the HQ4 ceiling-break objective
python scripts/frr/run_hq4_ceiling_break.py --h 256 --steps 80000 --tag my_run

# 4. (Optional) Dual-GPU detached launch
python scripts/frr/launch_hq4_detached.py     # spawns h=128 on GPU 0, h=256 on GPU 1

# 5. Fine-tune an ASVD-factored lm_head
python finetune_asvd_head.py --r 1024 --steps 20000 --tag asvd_r1024_ft

Resume support

All run_hq*.py scripts save {ckpt_dir}/latest.pt every 2000 steps. Relaunching the same command auto-resumes.

Detached training on Windows

scripts/frr/launch_hq4_detached.py / scripts/frr/launch_hq5_detached.py use subprocess.Popen with DETACHED_PROCESS | CREATE_BREAKAWAY_FROM_JOB | CREATE_NEW_PROCESS_GROUP so training survives terminal closure, VS Code restart, and parent-shell kills.


Repository Layout

ultracompress/
├── README.md                     This file
├── RESULTS.md                    Per-claim measurement record (Claims 1-20)
├── PATENT_CLAIMS.md              Full patent claims file (20 claims)
├── REPRODUCE.md                  Step-by-step reproduction guide
├── CONTRIBUTING.md               Contribution guide
├── LICENSE                       Apache 2.0
├── requirements.txt              Pinned deps (torch 2.11+cu128, transformers 4.57)
├── pyproject.toml                Package metadata
├── demo.py                       Interactive teacher-vs-student demo
├── serve.py                      Minimal inference server
├── ultracompress.py              CLI entry point
│
├── ultracompress/                Core library (FractalModel, pipeline, coding)
├── scaling/                      Cross-model teacher loaders (Qwen3 family)
├── lib/                          Shared utilities
├── tools/                        Model download, quantization utilities
├── tests/                        Regression tests
│
├── scripts/overlay/              ★ Track A — row-overlay (Claims 17-20)
│   ├── benchmark_head_to_head.py   Unified bnb + HQQ + ours harness
│   ├── _analyze_claim20.py         Claim-20 merge + summary generator
│   ├── lambada_overlay*.py         Overlay drivers (sparse / fp8 / mixed)
│   ├── fit_v17_hifi.py             v17 weight-row fit driver
│   ├── pack_all_v17.py, pack_v17.py, verify_all_v17.py
│   └── ...
├── scripts/frr/                  Track B — FRR architectural compression
│   ├── run_hq4_ceiling_break.py    Flagship HQ4 trainer
│   ├── launch_hq{4,5,6,7}_*.py     Windows detached dual-GPU launchers
│   ├── hires_eval.py               Held-out eval driver
│   └── ...
│
├── results/                      All measurement JSONs (indexed by claim)
├── logs/                         Run logs (indexed by claim)
├── archive/                      Obsolete compress_v8..v18 iteration scripts
└── docs/                         Paper, patent drafts, pitch, claim figures

Key Findings

  1. Functional similarity enables weight sharing. Adjacent layers have CKA > 0.9 despite zero weight cosine similarity.
  2. FRR is Pareto-optimal across 311–2200× compression. Quality degrades gracefully (−0.8 to −2.6 pp last-T10 at 734× vs. baseline).
  3. Hard-token focus beats easy-token focus. HQ3's confidence-weighted loss plateaued at T1=54.2%; HQ4's inverted weighting broke through to 55.7% peak / 69.6% all-T10.
  4. Latent alignment is an on-ramp, not a destination. Keeping latent_w = 1.0 throughout training caps quality; decaying it after step 20K lets the output-space signal dominate and breaks the ceiling.
  5. ASVD head + FRR body compose cleanly. 92.57% T10 head + 68% T10 body predicts a joint end-to-end quality ceiling that has not yet been measured on a unified stack — this is the next milestone.
  6. Reproducibility. All 80K-step runs reproduce to within ±1.5 pp on identical seeds (validated across HQ3 → HQ4 → HQ5).

Competitive Position

Method Year Arch. compression Approach
GPTQ / AWQ 2023 4–8× Post-training quantization
SparseGPT 2023 2–4× Unstructured pruning
Relaxed Recursive (Google) 2025 ~2× Shared block + LoRA
Ouroboros V2 2026 ~2× Controller hypernetwork
UltraCompress FRR (HQ4) 2026 311–734× Fractal recursive block + entropy-aware distillation

Stacked with Q2 + entropy coding, the total compression reaches ~7,500× on quantized weights.


Projection: 100T-parameter model on a single GPU

Stack 100T-param size Compression ratio
FRR 311× + Q2 + entropy ≈ 12 GB ≈ 8,300×
FRR 734× + Q2 + entropy ≈ 5 GB ≈ 20,000×

These are architectural projections; the 734× FRR body has been trained end-to-end; Q2 + entropy coding have been validated at pipeline scope on Qwen3-0.6B (959× total, 35% T1 / 53% T10).


Citation

@misc{ultracompress2026,
  title  = {Fractal Residual Recursion: Extreme Transformer Compression
            via Shared Recursive Blocks},
  author = {Mounir},
  year   = {2026},
  url    = {https://github.com/mounnar/ultracompress}
}

Status & Contact

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ultracompress-0.5.0.tar.gz (316.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ultracompress-0.5.0-py3-none-any.whl (358.1 kB view details)

Uploaded Python 3

File details

Details for the file ultracompress-0.5.0.tar.gz.

File metadata

  • Download URL: ultracompress-0.5.0.tar.gz
  • Upload date:
  • Size: 316.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ultracompress-0.5.0.tar.gz
Algorithm Hash digest
SHA256 a2c2aae62d1697c16757984470e663a081284833246c363be1f8ec1d0da06c85
MD5 92b202fff8dff49e6f72f7d0fe5e88b2
BLAKE2b-256 2ed1939c1d2335db00918edb8c5f4681551267a51b3e9cb4c3944bf9af1bd72b

See more details on using hashes here.

File details

Details for the file ultracompress-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: ultracompress-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 358.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ultracompress-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eb0d559cc9d0ee9346ae75e0ec66f3b9a12bec7cd2c5e6b0a3ec78d756f0ed74
MD5 1ac2f12d7b4fef776c2410d983636a9d
BLAKE2b-256 2dde673d1ae6ea1e3d531516f6df37bd567dc8e18d2dc40c0da2f327a5f2c2cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page