Lossless 5-bit transformer compression — 8 architectures (1.7B–70B, dense + MoE) at sub-1% PPL degradation. Customer-distributable via `uc pack v0.3`.
Reason this release was yanked:
Broken import in 0.5.0 (track_a_adaptive missing). Use 0.5.1 or later.
Project description
UltraCompress
Extreme LLM compression — three independent mechanisms that compose multiplicatively: (A) Per-layer streaming compression — Qwen2.5-72B → 8.98 GB peak VRAM on a single RTX 5090, PPL ratio 1.0162×. (B) Sub-3-bpw row-overlay weight compression (Claims 17–20) — beats bitsandbytes nf4 at 30% fewer bits on a 6-model cohort. (C) Fractal Residual Recursion (FRR) (Claims 1–16) — shared-block architectural compression at 311–734×.
⭐ Latest — Streaming compression: full Qwen scaling curve, 72B on a single GPU (2026-05-04)
Per-layer streaming compression validated end-to-end across 8B → 72B with peak VRAM bounded by ~one transformer layer regardless of total model depth. Production-grade quality (PPL ratio ≤ 1.05) at every scale; Qwen2.5-72B compressed to 8.98 GB peak VRAM on a single RTX 5090 with 1.6% PPL drift.
| Model | Layers | Baseline PPL | Compressed PPL | PPL ratio | Peak VRAM | Status |
|---|---|---|---|---|---|---|
| Qwen3-8B | 36 | 16.79 | 17.26 | 1.0278× | 2.26 GB | PROD |
| Qwen3-14B | 40 | 15.44 | 15.61 | 1.0111× | 3.37 GB | PROD (best) |
| Qwen3-32B | 64 | 13.77 | 14.27 | 1.0367× | 4.85 GB | PROD |
| Qwen2.5-72B | 80 | 8.92 | 9.07 | 1.0162× | 8.98 GB | PROD (headline) |
Recipe: GSQ scalar 5 bpw + per-block (B=64) absmax + V18-C rank-32 low-rank correction overlay + 200-step KL distillation per layer. Process: load layer fp16 weights via safetensors lazy load → cache teacher hidden output → quantize → fit V18-C against cache → save → free → next layer. Compression time scales linearly: ~1 min/layer overhead.
Bigger models compress at least as well as smaller ones — empirically consistent with arxiv 2505.02214 within the Qwen family. The 100T-on-1-GPU target is now a math problem (multiplicative composition with Track B substrate sharing + Track C inference streaming), not a prayer.
Reproduce on Qwen3-8B (~9 min on a 5090):
python scripts/overlay/streaming_compression_runner.py \
--model qwen3-8b --bpw 5 --block_size 64 --rank 32 \
--train_steps 200 --n_calib 100 --n_eval 50
Result JSONs under scripts/overlay/artifacts/streaming_compression_{8b,14b,32b,72b}_smoke.json. Patent supplement covering streaming-compression mechanism filed 2026-05.
Latest — Claim 20: Row-overlay vs external quantizers (n=500, 6-model cohort)
Head-to-head LAMBADA benchmark against two independent external quantization families (bitsandbytes + HQQ). 48 measurements = 6 models × 8 methods × n=500 samples.
| method | bpw | cohort T1-retention | median ppl-ratio |
|---|---|---|---|
| bnb_int8 | 8.000 | 99.75% | 1.005 |
| bnb_nf4 | 4.000 | 98.31% | 1.054 |
| hqq_4bit_g64 | 4.500 | 97.72% | 1.078 |
| our_mixed_2p79 | 2.798 | 95.63% | 1.131 |
| our_fp8_2p79 | 2.795 | 95.57% | 1.128 |
| hqq_3bit_g64 | 3.500 | 72.46% | 1.608 |
| hqq_2bit_g16 | 4.000 | 34.82% | 17.14 |
| hqq_2bit_g64 | 2.500 | 3.46% | 5284.48 |
Production tier ladder (Qwen3-8B, validated 2026-05-02):
| Operating point | T1 retention | PPL ratio | Compression | Verdict |
|---|---|---|---|---|
| 6 bpw (GSQ k-means) | 96.72% | 1.0024 | 2.67x | Zero-degradation tier |
| 5 bpw (GSQ + low-rank correction) | 94.39% | 1.003 | 3.2x | Production-grade (default) |
| 5 bpw + additional compression on correction overhead | 94.40% | 1.0029 | 3.2x weights + 1.30x correction | Composable stack proof |
| 4 bpw | 90.14% | 1.014 | 4.0x | Light degradation |
| 3 bpw | 80.97% | 1.084 | 5.3x | Aggressive |
The 6 bpw tier is effectively lossless (PPL ratio 1.0024). Three independent compression mechanisms compose multiplicatively without quality loss -- the correction overhead itself compresses an additional 1.30x at storage level with no T1 impact.
Scaling validation (Qwen3-14B, 4 bpw + correction compression + teacher distillation): 88.41% T1 at PPL ratio 0.9752. Full production stack holds at scale.
Headline results at 7–8B scale:
| model | ours @ 2.80 bpw | vs bnb_nf4 @ 4.00 bpw | vs hqq_4bit @ 4.50 bpw | bits saved |
|---|---|---|---|---|
| Qwen3-8B | 97.57% T1-ret | −0.67 pp | −1.17 pp | 30–38% |
| Mistral-7B | 98.03% T1-ret | −1.32 pp | −0.73 pp | 30–38% |
Qualitative differentiator. HQQ produces catastrophic failures (ppl-ratio > 10×) on 6/6 models at 2-bit g64 and 4/6 at 2-bit g16. Our row-overlay produces zero catastrophic failures across all 48 measurements.
Full results: RESULTS.md § Claim 20 · PATENT_CLAIMS.md § Claim 20 · raw data: results/h2h_n500_full.json · analysis: docs/claim20_summary.txt.
Reproduce: python scripts/overlay/benchmark_head_to_head.py --methods our_fp8_2p79,our_mixed_2p79,bnb_nf4,bnb_int8,hqq_4bit_g64,hqq_3bit_g64,hqq_2bit_g64,hqq_2bit_g16 --n 500.
Track B — FRR architectural compression (held-out, 1000 samples, seed 42)
Independent re-evaluation on a held-out region of FineWeb-Edu that was least-touched during training. Protocol: 1000 samples, 128-token context, seed 42, bootstrap 95% CIs. Reproduce in ~15 minutes on a single 32GB GPU: python scripts/frr/hires_eval.py --tags hq5_h256 hq5_h128 --n 1000.
| Variant | Trainable | Compression | all-T1 | all-T10 | last-T10 | Quality | PPL ratio |
|---|---|---|---|---|---|---|---|
| HQ5 h256 | 1,509,916 | 311× | 55.40% | 69.64% | 64.24% | 75.94% | 1.216 |
| HQ5 h128 | 640,284 | 734× | 53.78% | 68.00% | 62.36% | 73.86% | 1.254 |
Interpretation. The h256 student has 0.088% of the teacher's trainable parameters and reproduces its top-10 next-token set 69.64% of the time on unseen text. The h128 student has 0.037% of the teacher's parameters and still reproduces 68.00%. For reference, the typical distillation baseline (DistilBERT / TinyBERT family) achieves 2–7× compression at similar quality; FRR-HQ5 is ~50× beyond that frontier.
Full results: results/hires_results_hq5.json. Pitch for business use: docs/PITCH.md.
Rigor & reproducibility
The numbers above are in-distribution held-out (training samples from the full 500M-token range, eval samples from the tail 50M with a different seed). To defend against a stricter reviewer we also ship:
- ⭐ Fully-disjoint eval on WikiText-103 test split — DONE.
python scripts/overlay/wikitext_eval.py --tags hq5_h256 hq5_h128 --n 1000. WikiText-103 test was never touched during training and is a standard public benchmark. Result: on WT103 HQ5-h256 scores T1 = 55.53% (vs 55.40% in-domain) and T10 = 66.82% (vs 69.64% in-domain). Top-1 agreement is within 0.13 percentage points of the in-domain number — strong evidence the student learned the teacher's distribution rather than just the FineWeb-Edu surface statistics. Raw data: results/wikitext_results.json. - Matched-parameter standard-KD baseline —
python scripts/frr/run_baseline_distill.py --h 256 --n_layers 2 --steps 80000 --tag baseline_h256_L2. Trains a vanilla transformer student at the same ~1.5M trainable params using classical Hinton-2015 distillation. Head-to-head delta proves the nested-fractal + entropy-weighted loss is load-bearing. - Pinned dependencies — see
requirements.txtfor exact versions (torch 2.11.0+cu128, transformers 4.57.2, datasets 4.8.4, numpy 2.2.6). - Full reproduce guide — see REPRODUCE.md for step-by-step.
15-minute interactive demo
python demo.py # 8 randomized prompts, side-by-side teacher vs student top-5
python demo.py --prompt "your text" # single-prompt mode
python demo.py --tag hq5_h128 # 734× model instead of 311×
Training Results (per-run training-eval ceilings, 80K steps each)
| Variant | Trainable | Compression | Best T1 | Best all-T10 | Peak T1 | Peak all-T10 | Quality |
|---|---|---|---|---|---|---|---|
| HQ5 h256 | 1.51 M | 311× | 55.1% | 70.0% | 57.0% | 70.0% | 70.0% |
| HQ5 h128 | 0.64 M | 734× | 54.0% | 68.4% | 54.4% | 68.4% | 68.4% |
| HQ4 h256 | 1.51 M | 311× | 54.3% | 69.2% | 55.7% | 69.6% | 68.9% |
| HQ4 h128 | 0.64 M | 734× | 53.4% | 68.0% | 55.7% | 68.6% | 66.9% |
| HQ3 h256 | 1.51 M | 311× | 54.1% | 68.2% | 54.7% | 68.2% | 68.1% |
| HQ3 h128 | 0.64 M | 734× | 54.2% | 68.0% | 54.2% | 68.0% | 67.7% |
HQ5 h256 is the current flagship. First checkpoint to cross 70% quality on Qwen3-1.7B distillation. Details: docs/HQ5_RESULTS.md, docs/HQ4_RESULTS.md, docs/HQ3_RESULTS.md. Currently training: HQ6 (dual GPU, ENT_POW=2.0) and HQ7 long-horizon (160K steps).
ASVD head fine-tuning (trained separately — stackable with FRR body)
| Rank (r) | Head compression | T1 | T10 | PPL ratio |
|---|---|---|---|---|
| r=1024 | 2.0× | 91.66% | 92.57% | 1.345 |
| r=512 | 3.9× | 87.73% | 88.93% | 2.570 |
| r=256 | 7.9× | 83.22% | 82.83% | 3.885 |
r=1024 exceeds the user's 70% T1 / 90% T10 goal on head-only evaluation — see docs/STATUS.md.
⭐ End-to-end stack: FRR body + ASVD head combined (1000 samples, seed 42)
Full end-to-end compression — the actual deployment artifact. FRR body (HQ5 h256) with its output projection replaced by a rank-reduced ASVD head, then fine-tuned.
| Config | Params | Compression | all-T1 | all-T10 | last-T10 | PPL ratio | Quality |
|---|---|---|---|---|---|---|---|
| Teacher (Qwen3-1.7B body) | 1092.1 M | 1.0× | 100% | 100% | 100% | 1.000 | 100% |
| HQ5 h256 + full head | 312.67 M | 3.5× | 55.40% | 69.64% | 64.24% | 1.216 | 75.94% |
| HQ5 h256 + ASVD r=1024 | 159.19 M | 6.9× | 54.91% | 69.51% | 64.03% | 1.410 | 70.22% |
| HQ5 h256 + ASVD r=512 | 80.35 M | 13.6× | 54.46% | 68.98% | 64.05% | 2.400 | 55.33% |
| HQ5 h256 + ASVD r=256 | 40.93 M | 26.7× | 53.88% | 68.32% | 63.40% | 3.172 | 49.92% |
Interpretation. The FRR+ASVD end-to-end stack at 26.7× total compression still reproduces 68.32% of the teacher's top-10 next-token set — within 1.3 percentage points of the uncompressed-head baseline. This is the number to compare against GPTQ/AWQ/pruning in public benchmarks. Raw data: results/combined_stack_results_hq5.json.
Pareto frontier — pick your operating point
Customer picks where on the curve to land. Existing compression methods (GPTQ, AWQ, SparseGPT, DistilBERT) all cluster at 1.5–7.5× compression with >95% fidelity. FRR+ASVD extends the frontier by an order of magnitude into the 3–27× regime, with a graceful quality–compression trade-off rather than a cliff:
- Quality-first deployment (3–7× compression):
hq5_h256+full_headorhq5_h256+asvd_r1024_ft— 70% quality with 7× fewer parameters. Appropriate for latency-critical production inference. - Balanced deployment (8–14× compression):
hq5_h128orhq5_h256+asvd_r512_ft— 68–69% T10 with under 80M parameters. Appropriate for edge GPU boxes, 8GB Apple Silicon. - Aggressive deployment (27× compression):
hq5_h256+asvd_r256_ft— the 40.9M-parameter model; targets phones, Raspberry Pi class hardware. Quality drops to 50% — appropriate for offline / retrieval-augmented / constrained-vocabulary use cases only.
Raw Pareto data: docs/pareto_frontier.json. Reproduce the chart: python scripts/frr/make_pareto_chart.py.
Cross-model generality (scaling the method)
The method is architecture-agnostic. This release includes:
scaling/teacher_loader.py— auto-detecting Qwen3-family loader. Point it at any cached Qwen3 state dict; it infers hidden size, layer count, head counts, and intermediate dim from the tensors.scripts/frr/run_frr_generic.py— generic trainer with--teacher_cacheflag. Drop-in replacement for the hardcoded 1.7B trainer.scripts/frr/scale_eval.py— model-agnostic eval with bootstrap CIs.tests/test_sanity.py— 6-test regression guard (teacher auto-detect on both 0.6B and 1.7B caches, forward determinism, flagship checkpoint reproducibility, random-init floor, ckpt roundtrip).
Verified on both Qwen3-0.6B (hidden=1024) and Qwen3-1.7B (hidden=2048) state dicts. See docs/SCALING_PLAN.md for the cross-scale experimental matrix and docs/KNOWN_ISSUES.md for honest disclosures.
How It Works
Most compression asks: "How do I make these weights smaller?" FRR asks: "Do I even need different weights per layer?"
Adjacent transformer layers show near-zero weight cosine similarity (~0.001) but CKA > 0.9 (functional similarity). FRR learns the shared functional form once and uses lightweight per-scale modulation to induce layer-specific behavior.
Traditional Transformer FRR Compressed Model
======================== ==========================
Input Input
│ │
▼ ▼
[Layer 0 weights: 54 MB] [Shared Block: 0.64–1.51 M params]
│ │ + γ₀, β₀ (per-scale)
▼ ▼
[Layer 1 weights: 54 MB] [Same Shared Block]
│ │ + γ₁, β₁
▼ ▼
... (28 layers) ... (4 scales × 7 iterations)
│ │
▼ ▼
Output Output
Total body: 1,410 MB Total body: 2.56–6.04 MB
Shared-weight (looped) transformers are Turing-complete (Giannou et al., 2023).
Training Objective — the HQ4/HQ5 Ceiling Break
HQ3 plateaued at T1 ≈ 54% because its confidence-weighted CE + margin loss concentrated gradient on tokens the student had already saturated. HQ4 inverts that signal; HQ5 sharpens it further:
hard_weight = (1 + H(teacher_logits)) ^ entropy_power
total_loss = hard_weight · fkl
+ 0.3 · rkl
+ latent_w(step) · latent_mse # 1.0 → 0.1 across steps 20K→50K
+ 0.5 · ce_ramp(step) · ce # 0.5 → 1.0 across 16K→48K
+ 0.3 · ce_ramp · hard_weight · margin_loss
Two mechanisms working together:
- Inverted weighting forces gradient into high-entropy positions — exactly where T10 gains live.
- Latent decay releases the mean-seeking attractor so the ce+margin signal can shape the output distribution rather than just the intermediate latents.
Experiment Timeline
| Stage | Compression | T1 | all-T10 | Status | Notes |
|---|---|---|---|---|---|
| Baseline | 52× | 47% | 62–65% | Done | Pure-KL distillation |
| TinyFRR | 311–2200× | 43–46% | 60–64% | Done | Compression sweep h=16…1024 |
| HQ2 | 311–734× | ~50% | 67% | Done | Adds hidden-state latent alignment |
| HQ3 | 311–734× | 54.2% | 68.2% | Done | 5-loss w/ confidence-weighted CE+margin |
| HQ4 | 311–734× | 54.3% | 69.2% | Done | Inverted entropy weighting + latent decay |
| HQ5 | 311–734× | 55.1% | 70.0% | Done, public | Stronger entropy_power (1.5) + per-width latent floor |
| HQ6 | 311–734× | TBD | TBD | Training | ENT_POW=2.0 (h256) + h384 capacity test |
Full training logs: logs/ (hq{3,4,5,6}_h{128,256,384}.log).
Quick Start
git clone https://github.com/mounnar/ultracompress.git
cd ultracompress
pip install -r requirements.txt
# 1. Cache the teacher (one-time, ~7 GB for Qwen3-1.7B)
python tools/download_models.py
# 2. Pre-tokenize training data (one-time, ~2 GB for 500M tokens)
python prepare_500M_tokens.py
# 3. Train TinyFRR body with the HQ4 ceiling-break objective
python scripts/frr/run_hq4_ceiling_break.py --h 256 --steps 80000 --tag my_run
# 4. (Optional) Dual-GPU detached launch
python scripts/frr/launch_hq4_detached.py # spawns h=128 on GPU 0, h=256 on GPU 1
# 5. Fine-tune an ASVD-factored lm_head
python finetune_asvd_head.py --r 1024 --steps 20000 --tag asvd_r1024_ft
Resume support
All run_hq*.py scripts save {ckpt_dir}/latest.pt every 2000 steps. Relaunching the same command auto-resumes.
Detached training on Windows
scripts/frr/launch_hq4_detached.py / scripts/frr/launch_hq5_detached.py use subprocess.Popen with DETACHED_PROCESS | CREATE_BREAKAWAY_FROM_JOB | CREATE_NEW_PROCESS_GROUP so training survives terminal closure, VS Code restart, and parent-shell kills.
Repository Layout
ultracompress/
├── README.md This file
├── RESULTS.md Per-claim measurement record (Claims 1-20)
├── PATENT_CLAIMS.md Full patent claims file (20 claims)
├── REPRODUCE.md Step-by-step reproduction guide
├── CONTRIBUTING.md Contribution guide
├── LICENSE Apache 2.0
├── requirements.txt Pinned deps (torch 2.11+cu128, transformers 4.57)
├── pyproject.toml Package metadata
├── demo.py Interactive teacher-vs-student demo
├── serve.py Minimal inference server
├── ultracompress.py CLI entry point
│
├── ultracompress/ Core library (FractalModel, pipeline, coding)
├── scaling/ Cross-model teacher loaders (Qwen3 family)
├── lib/ Shared utilities
├── tools/ Model download, quantization utilities
├── tests/ Regression tests
│
├── scripts/overlay/ ★ Track A — row-overlay (Claims 17-20)
│ ├── benchmark_head_to_head.py Unified bnb + HQQ + ours harness
│ ├── _analyze_claim20.py Claim-20 merge + summary generator
│ ├── lambada_overlay*.py Overlay drivers (sparse / fp8 / mixed)
│ ├── fit_v17_hifi.py v17 weight-row fit driver
│ ├── pack_all_v17.py, pack_v17.py, verify_all_v17.py
│ └── ...
├── scripts/frr/ Track B — FRR architectural compression
│ ├── run_hq4_ceiling_break.py Flagship HQ4 trainer
│ ├── launch_hq{4,5,6,7}_*.py Windows detached dual-GPU launchers
│ ├── hires_eval.py Held-out eval driver
│ └── ...
│
├── results/ All measurement JSONs (indexed by claim)
├── logs/ Run logs (indexed by claim)
├── archive/ Obsolete compress_v8..v18 iteration scripts
└── docs/ Paper, patent drafts, pitch, claim figures
Key Findings
- Functional similarity enables weight sharing. Adjacent layers have CKA > 0.9 despite zero weight cosine similarity.
- FRR is Pareto-optimal across 311–2200× compression. Quality degrades gracefully (−0.8 to −2.6 pp last-T10 at 734× vs. baseline).
- Hard-token focus beats easy-token focus. HQ3's confidence-weighted loss plateaued at T1=54.2%; HQ4's inverted weighting broke through to 55.7% peak / 69.6% all-T10.
- Latent alignment is an on-ramp, not a destination. Keeping latent_w = 1.0 throughout training caps quality; decaying it after step 20K lets the output-space signal dominate and breaks the ceiling.
- ASVD head + FRR body compose cleanly. 92.57% T10 head + 68% T10 body predicts a joint end-to-end quality ceiling that has not yet been measured on a unified stack — this is the next milestone.
- Reproducibility. All 80K-step runs reproduce to within ±1.5 pp on identical seeds (validated across HQ3 → HQ4 → HQ5).
Competitive Position
| Method | Year | Arch. compression | Approach |
|---|---|---|---|
| GPTQ / AWQ | 2023 | 4–8× | Post-training quantization |
| SparseGPT | 2023 | 2–4× | Unstructured pruning |
| Relaxed Recursive (Google) | 2025 | ~2× | Shared block + LoRA |
| Ouroboros V2 | 2026 | ~2× | Controller hypernetwork |
| UltraCompress FRR (HQ4) | 2026 | 311–734× | Fractal recursive block + entropy-aware distillation |
Stacked with Q2 + entropy coding, the total compression reaches ~7,500× on quantized weights.
Projection: 100T-parameter model on a single GPU
| Stack | 100T-param size | Compression ratio |
|---|---|---|
| FRR 311× + Q2 + entropy | ≈ 12 GB | ≈ 8,300× |
| FRR 734× + Q2 + entropy | ≈ 5 GB | ≈ 20,000× |
These are architectural projections; the 734× FRR body has been trained end-to-end; Q2 + entropy coding have been validated at pipeline scope on Qwen3-0.6B (959× total, 35% T1 / 53% T10).
Citation
@misc{ultracompress2026,
title = {Fractal Residual Recursion: Extreme Transformer Compression
via Shared Recursive Blocks},
author = {Mounir},
year = {2026},
url = {https://github.com/mounnar/ultracompress}
}
Status & Contact
- Active development — see
HQ5and docs/STATUS.md for the latest training run. - Full result write-ups in docs/HQ3_RESULTS.md, docs/HQ4_RESULTS.md.
- Paper draft: docs/PAPER_DRAFT.md. Patent draft: docs/PATENT_DRAFT.md.
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ultracompress-0.5.0.tar.gz.
File metadata
- Download URL: ultracompress-0.5.0.tar.gz
- Upload date:
- Size: 316.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2c2aae62d1697c16757984470e663a081284833246c363be1f8ec1d0da06c85
|
|
| MD5 |
92b202fff8dff49e6f72f7d0fe5e88b2
|
|
| BLAKE2b-256 |
2ed1939c1d2335db00918edb8c5f4681551267a51b3e9cb4c3944bf9af1bd72b
|
File details
Details for the file ultracompress-0.5.0-py3-none-any.whl.
File metadata
- Download URL: ultracompress-0.5.0-py3-none-any.whl
- Upload date:
- Size: 358.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb0d559cc9d0ee9346ae75e0ec66f3b9a12bec7cd2c5e6b0a3ec78d756f0ed74
|
|
| MD5 |
1ac2f12d7b4fef776c2410d983636a9d
|
|
| BLAKE2b-256 |
2dde673d1ae6ea1e3d531516f6df37bd567dc8e18d2dc40c0da2f327a5f2c2cb
|