TurboQuant KV cache compression for local LLM inference
Project description
tqai
TurboQuant KV cache compression for local LLM inference + a composable pipeline for experimenting with the latest compression papers locally.
Compresses the KV cache to ~3 bits per channel with byte-identical output to baseline on 8B+ models (Δppl = 0.00 across Qwen, Gemma, Llama). Supports both PyTorch (CPU/CUDA) and MLX (Apple Silicon). v0.4 adds a plugin-based pipeline so any paper — QuantSparse, DiTFastAttn, BSA, Sheaf, Fisher — can be added as a single file without touching core code, plus drop-in support for diffusers video pipelines (WAN 2.2, LTX-2).
Based on TurboQuant (Google Research, ICLR 2026).
Honest framing on the 80% number. The K4/V2 quantization produces a compressed-storage form that is ~5× smaller than the uncompressed bf16 KV tensors. This reduction is real for wire format / serialized KV / future fused dequant-attention kernels, but does not reduce peak runtime memory on Apple Silicon today because the v0.3.1 incremental cache strategy keeps a persistent dequantized buffer at full input precision (the design choice that enables O(1) per-token decode). Measured peak memory on Qwen 2.5-7B Q8 at 4K–128K context: tqai is within ±0.2 GB of baseline at every length. Full benchmark and explanation in
reports/kv_memory_finding.md. The actual unlock for runtime memory savings is a fused dequant-attention Metal kernel (KVQuant approach), which is on the v0.5 roadmap.
Installation
# PyPI
pip install tqai
# With PyTorch backend
pip install tqai[torch]
# With MLX backend (Apple Silicon)
pip install tqai[mlx]
# Global CLI install (no venv management)
pipx install tqai
pipx inject tqai mlx mlx-lm # add MLX backend
pipx inject tqai torch # or PyTorch backend
Quick Start
HuggingFace Transformers (PyTorch)
import tqai
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
# One line to enable KV cache compression
cache = tqai.patch(model, bits_k=4, bits_v=2)
inputs = tokenizer("Explain quantum entanglement:", return_tensors="pt")
output = model.generate(**inputs, past_key_values=cache, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))
MLX (Apple Silicon)
import tqai
import mlx_lm
model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-8bit")
# One line to enable KV cache compression
tqai.patch(model, bits_k=4, bits_v=2, backend="mlx")
response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement:", max_tokens=200)
print(response)
# Restore original behaviour when done
tqai.unpatch(model)
Benchmark Results
All results measured on Apple Silicon (MLX). Full data in benchmarks/results/.
Perplexity — zero change across every model and config
| Model | Baseline PPL | + tqai K4/V2 | + tqai K3/V2 | Δppl |
|---|---|---|---|---|
| Qwen2.5-0.5B bf16 | 4.34 | 4.34 | 4.34 | 0.00 |
| Qwen2.5-3B bf16 | 2.49 | 2.49 | 2.49 | 0.00 |
| Llama-3.1-8B Q4 | 2.95 | 2.95 | 2.95 | 0.00 |
| Qwen2.5-7B Q8 | 2.40 | 2.40 | 2.40 | 0.00 |
| Qwen2.5-14B Q4 | 2.22 | 2.22 | 2.22 | 0.00 |
| Gemma 4 E4B Q4 | 126.94 | 126.94 | — | 0.00 |
Δppl = 0.00 across all models and compression configs tested (6 models, 12 quantization variants).
Throughput (MLX, v0.3.1 — fused Metal kernels + incremental cache)
| Model | Baseline | kv-only | Retention | vs v0.2 |
|---|---|---|---|---|
| Qwen2.5-0.5B bf16 | 326 tok/s | 118 tok/s | 36% | was 10% |
| Qwen2.5-3B bf16 | 71 tok/s | 66 tok/s | 93% | was 26% |
| Llama-3.1-8B Q4 | 102 tok/s | 85 tok/s | 84% | was 22% |
| Qwen2.5-7B Q8 | 63 tok/s | 59 tok/s | 93% | was 38% |
| Qwen2.5-14B Q4 | 56 tok/s | 50 tok/s | 89% | was 25% |
| Gemma 4 E4B Q4 | 113 tok/s | 47 tok/s | 41% | new |
v0.3.1 eliminated the O(n²) per-token reconstruction overhead via an incremental dequantized buffer (KIVI-inspired). Fused Metal kernels (mx.fast.metal_kernel) handle quantize/dequantize in single GPU dispatches. Larger models (3B+) now retain 84–93% of baseline throughput.
v0.4 — Pipeline strategy quality (synthetic NMSE, mean across 7 model profiles)
| Config | NMSE | vs Baseline | Use case |
|---|---|---|---|
| baseline | 0.009290 | — | reference |
| palm+tiered | 0.009301 | 0.0% | Adaptive bit allocation |
| palm+delta | 0.003759 | −59.5% | Inter-step temporal redundancy |
| snr+delta2 | 0.003746 | −59.7% | Best for diffusion (cosine schedule) |
| sheaf+delta2 | 0.003763 | −59.5% | Spatial-temporal token graphs |
| palm+window | 0.018907 | +103.5% | Trades quality for speed |
| skip_layers | 0.009297 | 0.0% | Layer protection (arXiv:2504.10317) |
Delta strategies cut reconstruction distortion in half on synthetic streaming data — validates QuantSparse's arXiv:2509.23681 second-order residual insight.
v0.4 — Video generation (WAN 2.2 5B + LTX-2, MLX, 33 frames @ 480p, 15 steps)
| Model | Config | Time | Peak RSS | CFG share | Note |
|---|---|---|---|---|---|
| WAN 2.2 5B | baseline | 172.8s | 34.6 GB | 0% | reference |
| WAN 2.2 5B | tqai_full | 170.5s | 32.2 GB | 50% | CFG hooks fire correctly |
| LTX-2 | baseline | 88.3s | 73.9 GB | 0% | needs MPS float64 fix to run |
| LTX-2 | tqai_full | 90.9s | 74.4 GB | 100% | batched CFG fully shared |
The actual local win: without optimize_vae_memory(), the WAN 2.2 VAE decoder spikes >100 GB on an 81-frame video and OOMs on a 128 GB Mac. With it, peak stays at 32 GB and 20-second videos fit comfortably. See reports/where_tqai_shines.md for the full picture.
v0.4 — Long context (chunked attention, MLX)
| Model | Context | Baseline prefill | Chunked prefill | Slowdown | Output match |
|---|---|---|---|---|---|
| Qwen 3B bf16 | 8K | 1.1s | 3.3s | 3.0× | ✓ |
| Qwen 3B bf16 | 16K | 2.6s | 12.0s | 4.6× | ✓ |
| Qwen 7B Q8 | 16K | 5.8s | 27.5s | 4.7× | ✓ |
| Qwen 7B Q8 | 32K | 17.4s | 68.3s | 3.9× | ✓ |
Honest negative result: Output is bit-identical to baseline (math is correct), but mx.fast.scaled_dot_product_attention is a fused Metal kernel that already implements blocked attention internally. A Python-loop chunked path forces a kernel launch + sync per chunk and loses decisively. Don't enable chunk_attention=True on MLX. The implementation is shipped for CUDA users without FlashAttention, where the win is real.
Compression Configs
KV Cache
The "compressed-storage saved" column is the size reduction of the
serialized KV cache form (compressed indices vs bf16 tensors). It is
not a peak runtime memory reduction — see the honest framing
note at the top of this README and reports/kv_memory_finding.md.
| Config | Avg Bits | Compressed-Storage Saved | Recommended For |
|---|---|---|---|
bits_k=4, bits_v=2 |
3.0 | 80% | Production — best quality/storage balance, byte-identical output |
bits_k=3, bits_v=2 |
2.5 | 84% | Smallest serialized cache, still byte-identical on tested models |
bits_k=4, bits_v=3 |
3.5 | 78% | Quality-sensitive applications |
Named Configs (CLI)
| Config | KV | Hidden | FFN | Use Case |
|---|---|---|---|---|
kv-only |
K4/V2 | — | — | KV cache compression (storage form, quality-preserving) |
kv+hidden8 |
K4/V2 | 8-bit | — | KV + hidden state compression |
kv+hidden6 |
K4/V2 | 6-bit | — | More aggressive hidden compression |
kv+ffn8 |
K4/V2 | — | 8-bit | KV + FFN activation compression |
all8 |
K4/V2 | 8-bit | 8-bit | Full compression at 8-bit |
all6 |
K4/V2 | 6-bit | 6-bit | Full compression at 6-bit |
aggressive |
K3/V2 | 6-bit | 6-bit | Maximum compression |
How It Works
tqai implements two complementary quantizers, both using Lloyd-Max codebooks and the same norm-preservation trick:
PolarQuantizer (default) — TurboQuant Stage 1
- Random orthogonal rotation — Rotates KV vectors by a fixed Haar-distributed d×d matrix to spread information uniformly
- Lloyd-Max scalar quantization — Quantizes each coordinate using precomputed optimal codebooks for the known post-rotation distribution N(0, 1/d)
- Norm preservation — Stores the L2 norm separately in FP16
RotorQuantizer — RotorQuant (Pope 2026)
Replaces the d×d dense rotation with block-diagonal 3×3 quaternion rotations (Clifford rotors from Cl(3,0)). Each group of 3 dimensions is independently rotated by a random unit quaternion, costing O(d) operations instead of O(d²). Quality is identical to PolarQuantizer; the fused Metal kernel runs 2–5× faster on Apple Silicon at the rotation step.
| d | PolarQuant (Metal) | RotorQuant (Metal) | Speedup |
|---|---|---|---|
| 64 | 0.64 ms | 0.34 ms | 1.9× |
| 128 | 0.67 ms | 0.22 ms | 3.0× |
| 256 | 1.54 ms | 0.33 ms | 4.7× |
Both quantizers are fully data-oblivious — no training or calibration required.
Cache Strategies (v0.3.1)
tqai supports three cache reconstruction strategies to balance speed and quality:
| Strategy | Per-token cost | Quality | Use case |
|---|---|---|---|
incremental (default) |
O(1) | Same as full | Production — 2–3x faster than v0.2 |
residual |
O(1) | Better (recent tokens exact) | Quality-sensitive, long context |
full |
O(n) | Baseline | Debugging, compatibility |
# Use residual strategy — last 128 tokens kept uncompressed (KIVI-style)
tqai.patch(model, bits_k=4, bits_v=2, cache_strategy="residual", residual_window=128)
Codebook Solvers (v0.3.1)
Beyond the default Lloyd-Max solver, tqai offers evolutionary and fuzzy codebook optimizers for build-time codebook generation:
- CMA-ES (arXiv:1710.05311) — Evolutionary refinement of Lloyd-Max codebooks, ~0.5% MSE improvement
- Fuzzy C-means (arXiv:1908.05033) — Soft assignment with temperature annealing
- Attention-aware objective (arXiv:2402.14866) — Optimizes codebooks to preserve softmax attention scores rather than raw MSE
QJL Stage 2 (opt-in)
tqai optionally implements QJL (Johnson-Lindenstrauss residual sketch), which corrects the systematic inner-product bias left by Stage 1:
cache = tqai.patch(model, bits_k=4, bits_v=2, use_qjl=True, qjl_sketch_size=64)
QJL trades bias reduction for added variance. For softmax-based attention, variance typically dominates — this is why QJL is off by default. Enable it for very low bit-widths, non-softmax attention, or research use.
v0.4 — Composable Pipeline Architecture
Contributors welcome — bring your paper. The whole point of the v0.4 restructuring is to make integrating new compression research a one-file change. The core (
quantizer.py,cache/hf.py,cache/mlx.py,kernels/) is frozen and stable — every recent paper we tracked (TurboQuant, QuantSparse, DiTFastAttn, BSA, Sheaf, Copresheaf, Fisher, SparseDiT) ships as a plugin inscorers/,strategies/,monitors/, oradapters/without touching the proven core. If you're publishing or reproducing a KV/attention compression paper, the friction to ship it here is a single file plus one line of registration. We actively encourage PRs that add new papers — see the Paper Playground below for the existing references and the "Adding a new paper" example for the exact 5-minute workflow.
v0.4 introduces a plugin-based pipeline so adding a new compression paper
costs one file instead of refactoring quantizer.py/hf.py/mlx.py.
The default path (pipeline=None) is byte-identical to v0.3.1 — zero
overhead, all 293 pre-v0.4 tests pass unchanged.
tqai.patch(model, pipeline={"scorer": "...", "strategy": "...", "monitor": "..."})
│
├─ Scorer — per-token importance (palm, snr, fisher, fisher_static, sheaf, bsa)
├─ Strategy — how to compress (tiered, delta, delta2, window, cfg_sharing)
├─ Monitor — runtime adjustment (stability, lyapunov)
├─ Adapter — model family (llm, dit, wan)
└─ Optimization — GA policy search over pipeline configs
Adding a new paper takes one file
# src/tqai/scorers/my_paper.py
from tqai.pipeline.base import ScoredEntry
class MyPaperScorer:
name = "my_paper"
def score(self, x, layer_idx, step=None, context=None):
# ... compute importance score ...
return [ScoredEntry(data=x, score=0.7, tier=2, metadata={})]
def reset(self): ...
# src/tqai/scorers/__init__.py — one line
from tqai.scorers.my_paper import MyPaperScorer
register_scorer("my_paper", MyPaperScorer)
tqai plugins # Verify it's discoverable
tqai run "prompt" -m Qwen/Qwen2.5-7B --scorer my_paper --strategy delta
Paper Playground
Each paper from the recent KV/attention compression literature ships as a single plugin you can mix and match. The implementations are intentionally self-contained reference points — use them as is, fork them, or use the framework to drop in your own.
| Paper | Plugin | Module | Implements |
|---|---|---|---|
| TurboQuant (arXiv:2504.19874, Zandieh et al., ICLR 2026) | palm scorer |
scorers/palm.py |
EMA novelty/surprise scoring |
| RotorQuant (Pope 2026) | RotorQuantizer |
quantizer_rotor.py |
Block-diagonal Clifford rotor rotation; fused Metal kernel 2–5× faster than PolarQuantizer at equal quality |
| Min-SNR Weighting (arXiv:2303.09556, Hang et al., ICCV 2023) | snr scorer |
scorers/snr.py |
Cosine + linear diffusion schedule scoring |
| APTQ (arXiv:2402.14866, Guan et al., DAC 2024) + Fisher Information (arXiv:1906.08589, Frantar et al.) | fisher (runtime proxy, broken) + fisher_static (offline calibration, works) |
scorers/fisher.py, scorers/fisher_static.py, optimization/fisher_calibration.py |
Two implementations: the runtime fisher uses mean(x²) proxy and over-allocates bits (don't use); the offline calibrate_fisher() does real forward+backward passes on a calibration set, computes per-layer K/V Fisher diagonals, saves to JSON; the fisher_static scorer loads the JSON and serves precomputed scores at runtime (constant-time lookup, no per-call overhead) |
| Sheaf Attention (arXiv:2601.21207, AAAI 2026) | sheaf scorer |
scorers/sheaf.py |
Discrete Laplacian harmonicity classifier |
| BSA — Bidirectional Sparse Attention (arXiv:2509.01085) | bsa scorer |
scorers/bsa.py |
Block centroid saliency (KV side only) |
| TurboQuant tiered allocation | tiered strategy |
strategies/tiered.py |
Dual-quantizer routing by score |
| DiTFastAttn step sharing (arXiv:2406.08552, Yuan et al., NeurIPS 2024) | delta strategy |
strategies/delta.py |
First-order inter-step Δ |
| QuantSparse (arXiv:2509.23681) | delta2 strategy |
strategies/delta2.py |
Second-order Δ² with order-2 → 1 → full fallback |
| DiTFastAttn WA-RS | window strategy |
strategies/window.py |
Similarity-based attention output cache |
| DiTFastAttn CFG sharing + Classifier-Free Guidance (arXiv:2207.12598, Ho & Salimans) | cfg_sharing strategy |
strategies/cfg_sharing.py |
Split-pass (WAN) + batched (LTX) modes |
| Copresheaf Neural Networks (arXiv:2505.21251) | registry extension | codebook/registry.py |
head_type discriminator (per-head codebooks) |
| Attention Analysis in VDiTs (arXiv:2504.10317) | skip_layers config |
pipeline/runner.py |
Per-layer compression bypass for non-sparse layers |
| Finite-time Lyapunov exponents (general dynamical systems) | lyapunov monitor |
monitors/lyapunov.py |
Local divergence rate estimator |
| Attention entropy tracking | stability monitor |
monitors/stability.py |
EMA entropy-shift detection |
Mixing plugins
import tqai
# LLM with novelty scoring + temporal delta
cache = tqai.patch(
model,
bits_k=4, bits_v=2,
pipeline={
"scorer": "palm",
"strategy": "delta",
"scorer_kwargs": {"alpha": 0.5, "ema_decay": 0.95},
},
)
# Diffusion with SNR-driven second-order delta
cache = tqai.patch(
model,
pipeline={
"scorer": "snr",
"strategy": "delta2",
"monitor": "stability",
"scorer_kwargs": {"schedule": "cosine"},
"skip_layers": [0, 1, 28, 29], # protect non-sparse layers
},
)
GA policy search (offline)
from tqai.optimization import GASearch
def evaluate(genome):
# Return -loss for the candidate pipeline configuration
return -run_evaluation(genome.decode(scorers, strategies, monitors))
best = GASearch(population_size=20, generations=10, objective=evaluate, seed=42).run()
print(best.decode(scorers, strategies, monitors))
Honest results
| Plugin / config | Verdict on Apple Silicon |
|---|---|
bits_k=4, bits_v=2 (v0.3.1 default) |
Ship it — byte-identical to baseline on 6 models |
RotorQuantizer (MLX + Metal) |
Ship it — 2–5× faster rotation than PolarQuantizer at identical quality; fused Metal kernel, no Python round-trip |
RotorQuantizer (PyTorch / no Metal) |
Don't use for perf — numpy round-trip negates the O(d) advantage; identical quality but slower |
optimize_vae_memory() for WAN/LTX |
Ship it — eliminates 100 GB+ VAE spike |
patch_mps_compatibility() for LTX-2 |
Ship it — pure unblocker (float64 → float32 RoPE) |
palm + delta / snr + delta2 |
Ship it — −60% NMSE on synthetic streaming data |
Pipeline framework (pipeline=None default) |
Ship it — zero overhead, +ergonomics |
cfg_sharing strategy |
Functional, no speedup — Python overhead == compute saved |
chunk_attention=True (MLX) |
Don't use — fused Metal kernel wins; 3-5× slower |
fisher scorer (runtime, squared activation proxy) |
Don't use — over-allocates bits (NMSE 12× worse than baseline) |
fisher_static scorer + tqai calibrate (v0.4) |
Ship it — proper offline gradient-based Fisher calibration, constant-time runtime lookup |
Full benchmark write-up: reports/where_tqai_shines.md.
Paper coverage matrix: reports/paper_coverage_report.md.
Standalone HTML: reports/benchmark_report.html.
Diffusers Video Pipelines (WAN 2.2, LTX-2)
import tqai
from diffusers import WanPipeline
from tqai.dit import optimize_vae_memory, patch_mps_compatibility
pipe = WanPipeline.from_pretrained("Wan-AI/Wan2.2-TI2V-5B-Diffusers", ...)
pipe.to("mps")
# MUST DO: prevents VAE decoder OOM at long videos
optimize_vae_memory(pipe)
# MUST DO for LTX-2: fixes float64 RoPE crash on MPS
patch_mps_compatibility(pipe)
video = pipe("A cat surfing", num_frames=81)
| Utility | What it fixes |
|---|---|
optimize_vae_memory(pipe) |
WAN VAE 100 GB+ spike → 8 GB peak (88% saved) |
patch_mps_compatibility(pipe) |
LTX-2 RoPE float64 crash on Apple MPS |
patch_cfg_sharing(pipe) |
Caches conditional attention output for unconditional pass (50%–100% share rate). Note: functional but no MPS speedup yet — see honest results above. |
Offline Fisher Information Calibration (v0.4)
The runtime fisher scorer uses mean(x²) as a proxy for the Fisher
Information diagonal. This proxy over-allocates bits and routes
everything to the high-bit tier (NMSE 12× worse than baseline in our
synthetic benchmark). The fix is offline gradient-based calibration:
run a small calibration set through forward+backward passes once,
collect real per-layer K/V Fisher diagonals, save to JSON, then load
the JSON at runtime via the fisher_static scorer.
# Step 1: calibrate (PyTorch only, ~seconds for small models)
tqai calibrate --model Qwen/Qwen2.5-3B-Instruct --output qwen-3b-fisher.json
# Step 2: use the calibration at inference time
import tqai
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
cache = tqai.patch(
model,
bits_k=4, bits_v=2,
pipeline={
"scorer": "fisher_static",
"scorer_kwargs": {"calibration_path": "qwen-3b-fisher.json"},
"strategy": "tiered",
},
)
The static scorer is a constant-time dictionary lookup at runtime — no
gradient computation, no proxy, no per-call overhead. The calibration
JSON encodes per-layer importance derived from real ∂L/∂θ measurements.
For programmatic use:
from tqai.optimization import calibrate_fisher
cal = calibrate_fisher(
model=model,
tokenizer=tokenizer,
prompts=["...", "...", "..."], # 8-32 prompts is enough
output_path="my-fisher.json",
)
print(f"Calibrated {cal.num_layers} layers from {cal.num_samples} samples")
CLI
# Show environment and library info
tqai info
# v0.4 — list available pipeline plugins
tqai plugins
# Quantization accuracy benchmark
tqai benchmark
tqai benchmark --bits-k 3 --bits-v 2 --head-dim 128
# Generate text with compression
tqai run "Explain gravity" --model mlx-community/Qwen2.5-7B-Instruct-8bit
tqai run "Explain gravity" --model Qwen/Qwen2.5-3B-Instruct --backend torch
# v0.4 — generate with a pipeline plugin
tqai run "Explain gravity" --model Qwen/Qwen2.5-3B-Instruct --scorer palm --strategy delta
# Run with QJL Stage 2
tqai run "Explain gravity" --model Qwen/Qwen2.5-3B-Instruct --use-qjl
# Compare baseline vs compressed side by side
tqai compare "Explain gravity" --model mlx-community/Qwen2.5-7B-Instruct-8bit
# Pre-convert a model for faster startup
tqai convert --model mlx-community/Qwen2.5-7B-Instruct-8bit --output ./qwen7b-tqai/
# v0.4 — offline gradient-based Fisher Information calibration
# Runs forward+backward passes on a small calibration set, saves per-layer
# Fisher diagonals to JSON. Use with the `fisher_static` scorer at runtime.
tqai calibrate --model Qwen/Qwen2.5-3B-Instruct --output qwen-3b-fisher.json
tqai calibrate --model Qwen/Qwen2.5-3B-Instruct --output qwen-3b-fisher.json --num-samples 32
# Baseline (no compression)
tqai run "Explain gravity" --model mlx-community/Qwen2.5-7B-Instruct-8bit --no-tqai
Advanced Options
cache = tqai.patch(
model,
bits_k=4, # Bits per key coordinate (2–8)
bits_v=2, # Bits per value coordinate (2–8)
sink_tokens=4, # Keep first N tokens uncompressed (attention sinks)
backend="torch", # Force backend: "torch" or "mlx"
device="cuda", # PyTorch device (ignored for MLX)
use_qjl=False, # Enable QJL Stage 2 residual correction (research)
qjl_sketch_size=64, # JL sketch dimension (tradeoff: quality vs memory)
cache_strategy="auto", # "auto" (incremental), "residual", or "full"
residual_window=128, # Recent tokens kept uncompressed (residual strategy)
)
Running Tests
# Install dev dependencies
pip install tqai[dev]
# Unit + accuracy tests (~447 tests with v0.4, <90s)
pytest tests/ --ignore=tests/test_e2e_models.py --ignore=tests/test_e2e_large_models.py
# End-to-end with real models (requires model downloads)
pytest tests/test_e2e_models.py -v -s
# Large model E2E (7B–14B, requires ~20GB disk)
pytest tests/test_e2e_large_models.py -v -s
Project Structure
src/tqai/
├── __init__.py # patch(), unpatch(), TurboQuantConfig
├── config.py # Configuration dataclass
├── _patch.py # Backend router (HF / mlx-lm / DiT)
├── quantizer.py # PolarQuantizer (core algorithm + QJL Stage 2)
├── quantizer_rotor.py # RotorQuantizer (Clifford rotor block-diagonal rotation, Pope 2026)
├── attention.py # Chunked SDPA via online softmax (MLX, v0.4)
├── kernels/ # Fused Metal GPU kernels
├── hooks.py # Forward-pass activation compression (PyTorch + MLX)
├── module_utils.py # Transformer + DiT layer inspection
├── backend/ # PyTorch + MLX abstraction layer
├── codebook/ # Codebook solvers + per-head-type registry (copresheaf)
├── cache/ # HuggingFace DynamicCache + mlx-lm KVCache integrations
│
│ # ─── v0.4 composable pipeline ──────────────────────────
├── pipeline/ # Protocol + registry + runner + builder
│ ├── base.py # Scorer / Strategy / Monitor / ModelAdapter protocols
│ ├── registry.py # Name-based plugin registration
│ ├── runner.py # CompressionPipeline (scorer → strategy → quantizer → monitor)
│ └── __init__.py # build_pipeline()
├── scorers/ # palm, snr, fisher, fisher_static, sheaf, bsa
├── strategies/ # tiered, delta, delta2, window, cfg_sharing
├── monitors/ # stability, lyapunov
├── adapters/ # llm, dit, wan
├── optimization/ # GA policy search (genome + ga_policy) + Fisher calibration
└── dit/ # CFG sharing patch + VAE memory optimization + MPS fixes
benchmarks/
├── benchmark_forward.py # Full KV + activation throughput (v0.3.1 + v0.4 configs)
├── benchmark_pipeline.py # v0.4 — synthetic NMSE for every scorer × strategy
├── benchmark_long_context.py # v0.4 — chunked attention at 4K/8K/16K/32K context
├── benchmark_video.py # v0.4 — WAN 2.2 + LTX-2 video generation
├── eval_perplexity.py # Perplexity evaluation helper
└── results/ # Benchmark JSON results + actual MP4 outputs
reports/
├── where_tqai_shines.md # Honest assessment: where v0.4 wins / loses on Apple Silicon
├── paper_coverage_report.md # Per-paper implementation status
└── benchmark_report.html # Standalone HTML with charts
paper/
├── tqai_paper.md # Research paper (markdown source)
└── tqai_paper.tex # Research paper (LaTeX)
Paper
Core algorithm
This library implements the TurboQuant algorithm from Google Research:
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni ICLR 2026 | arXiv:2504.19874 | Google Research Blog
KV cache compression family
- PolarQuant (AISTATS 2026) — Random rotation + polar coordinate quantization (the core of tqai)
- QJL (AAAI 2025) — Quantized Johnson-Lindenstrauss residual correction (available in tqai as
use_qjl=True) - KIVI (ICML 2024) — Residual buffer strategy for KV cache compression (
cache_strategy="residual") - KVQuant (NeurIPS 2024) — Fused dequant-attention kernel design
- APTQ (DAC 2024) — Attention-aware post-training quantization (attention-aware codebook objective + Fisher proxy)
Diffusion / video compression (v0.4 paper playground)
- DiTFastAttn (NeurIPS 2024) — Window attention with residual sharing, step-wise sharing, CFG sharing (
window,delta,cfg_sharing) - QuantSparse (2025) — Second-order residual quantization for Wan2.1 KV cache (
delta2) - BSA — Bidirectional Sparse Attention (2025) — Query + KV joint sparsification (
bsascorer; KV side only — Q-side needs custom kernel) - Min-SNR Weighting (Hang et al., ICCV 2023) — Cosine SNR schedule for diffusion training (
snrscorer) - Classifier-Free Guidance (Ho & Salimans, 2022) — The CFG protocol that
cfg_sharingaccelerates - Attention Analysis in Video DiTs (2025) — Identifies non-sparse layers that must not be compressed (
skip_layers) - SparseDiT (NeurIPS 2025) — Tri-segment layer allocation (validates per-layer policy via
tiered+skip_layers)
Topological / category-theoretic foundations
- Sheaf Attention & Local Consistency (AAAI 2026) — Cellular sheaf framework, harmonic substructure (
sheafscorer) - Copresheaf Neural Networks (2025) — Per-cell stalks, justifies per-head-type codebooks (
codebook/registry.py)
Chunked / memory-efficient attention (v0.4 cherry-pick)
- Flash Attention (Dao et al., NeurIPS 2022) — Tile-wise attention with online softmax
- Milakov & Gimelshein, "Online normalizer calculation for softmax" (NVIDIA tech report, 2018) — The online softmax recurrence used in
attention.py
Codebook solvers (build-time)
- DSQ (2019) — Differentiable soft quantization (fuzzy codebook solver)
- IDE-LBG (2017) — Evolutionary codebook optimization (CMA-ES solver)
Theoretical context (informs design, not directly implemented)
- Fisher Information for Neural Network Compression (arXiv:1906.08589) — Theoretical basis for the
fisherscorer (we ship a squared-activation proxy; true gradient-based Fisher is offline-only) - Information geometry / Fisher-Rao geodesics — Justifies the cosine schedule in
snrscorer as the geodesically-optimal SNR allocation
Contributing
See CONTRIBUTING.md. All commits require a DCO sign-off (git commit -s).
Paper integrations are explicitly welcome. If you're working on a KV
cache, attention, or diffusion compression paper, the v0.4 architecture
is designed so you can land a reference implementation as a single file
without touching quantizer.py, cache/, or any other core module.
The fast path:
- Pick the right slot —
scorers/(per-token importance),strategies/(how to compress),monitors/(runtime adaptation), oradapters/(a new model family). - Implement the matching protocol from
pipeline/base.py(3-4 methods). - Register your plugin with one line in the directory's
__init__.py. - Add tests in
tests/and a row to the Paper Playground table above. - Open a PR — the core stays untouched, so review is fast.
The point of the freeze on the core is reproducibility: the
PolarQuantizer math, the Lloyd-Max codebooks, and the cache strategies
have been benchmarked across 6+ models with Δppl=0.00. New ideas
should compose with that baseline, not replace it.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tqai-0.6.0.tar.gz.
File metadata
- Download URL: tqai-0.6.0.tar.gz
- Upload date:
- Size: 2.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eccc6aa6a54182043c44e51b3171714261281a9f03e55b9cabd1e61d0b1a3952
|
|
| MD5 |
fcbf47c4aaf2db2bc4d7eb8beb52dfb9
|
|
| BLAKE2b-256 |
93c4d5e079c940d1f5fa756970af27adf5a27eeb3b085733f78484fd89865cc7
|
Provenance
The following attestation bundles were made for tqai-0.6.0.tar.gz:
Publisher:
release.yml on AlphaWaveSystems/tqai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tqai-0.6.0.tar.gz -
Subject digest:
eccc6aa6a54182043c44e51b3171714261281a9f03e55b9cabd1e61d0b1a3952 - Sigstore transparency entry: 1358942113
- Sigstore integration time:
-
Permalink:
AlphaWaveSystems/tqai@be9f697abbe0fd967e6f5eeacb9b23a330b944ad -
Branch / Tag:
refs/tags/v0.6.0 - Owner: https://github.com/AlphaWaveSystems
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@be9f697abbe0fd967e6f5eeacb9b23a330b944ad -
Trigger Event:
push
-
Statement type:
File details
Details for the file tqai-0.6.0-py3-none-any.whl.
File metadata
- Download URL: tqai-0.6.0-py3-none-any.whl
- Upload date:
- Size: 135.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6b6e01271e0d0e9f3d36b83949924bcd7da8eb9e87a5aabc8cd640879032055
|
|
| MD5 |
c82e1c93717d58b51180ca94bfefb1e6
|
|
| BLAKE2b-256 |
f01e239f7ed3b0e511e9f9b020b44d19aa26716ce9a40ad88cefb3b936c5aeda
|
Provenance
The following attestation bundles were made for tqai-0.6.0-py3-none-any.whl:
Publisher:
release.yml on AlphaWaveSystems/tqai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tqai-0.6.0-py3-none-any.whl -
Subject digest:
e6b6e01271e0d0e9f3d36b83949924bcd7da8eb9e87a5aabc8cd640879032055 - Sigstore transparency entry: 1358942130
- Sigstore integration time:
-
Permalink:
AlphaWaveSystems/tqai@be9f697abbe0fd967e6f5eeacb9b23a330b944ad -
Branch / Tag:
refs/tags/v0.6.0 - Owner: https://github.com/AlphaWaveSystems
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@be9f697abbe0fd967e6f5eeacb9b23a330b944ad -
Trigger Event:
push
-
Statement type: