Productionized TurboQuant KV cache compression for vLLM and SGLang (imported as `tqkv`)
Reason this release was yanked:
wrong licene info
Project description
Turbo Attention
A modular attention backend for vLLM and SGLang. Custom CUDA kernels, full CUDAGraph capture, asymmetric K/V quantization, hybrid-model support. 3.8× more KV context on the same GPU, under 1% accuracy loss. Built on FlashAttention.
PyPI: turbo-attn · Import: tqkv · License: Apache-2.0
Install
pip install turbo-attn # codec + CUDA/Triton kernels
pip install "turbo-attn[vllm]" # + vLLM attention backend
pip install "turbo-attn[all]" # + SGLang, FlashInfer, flash-attn, eval harness
Quickstart
import torch
from tqkv import TurboKVCodec
codec = TurboKVCodec(head_dim=128, bit_width=4)
keys = torch.randn(8, 128)
packed, norms = codec.compress_k(keys) # 4096 B → 512 B + 16 B
recon = codec.decompress_k(packed, norms)
Serve a model end-to-end:
# vLLM
vllm serve Qwen/Qwen3.5-0.8B \
--kv-cache-dtype tqkv --attention-backend custom \
--max-model-len 250000 --trust-remote-code
# SGLang (after `import tqkv.integrations.sglang as t; t.register()`)
python -m sglang.launch_server --model-path Qwen/Qwen3.5-0.8B \
--kv-cache-dtype tqkv --attention-backend tqkv --trust-remote-code
Full walkthroughs:
- vLLM quickstart
- SGLang quickstart
- examples/ — runnable Python snippets
- ARCHITECTURE.md — codebase tour for contributors
- docs/public_api.md — supported API surface
Repo layout
The supported public surface lives at the top of each tree:
tqkv/— the package (codec, kernels, runtime, vLLM/SGLang plugins, calibration pipeline).docs/,docker/,scripts/,experiments/— public docs, deploy recipes, helper scripts, research notes.- Anything under an
internal/subdirectory (docs/internal/,docker/internal/,scripts/internal/,experiments/internal/) is engineering-only and unsupported — it may move or break between releases. The wheel never ships these.
Why this exists
KV cache memory grows linearly with context length and dominates GPU memory beyond ~32K tokens. Google's TurboQuant (Zandieh et al., ICLR 2026) solved this in principle: a near-optimal KV compression scheme using Walsh–Hadamard rotation and Lloyd–Max codebooks, with provable distortion bounds. Multiple open-source references have appeared since the paper — but none ship production CUDA kernels, full CUDAGraph capture, or validation on the new generation of hybrid state-space / attention architectures.
turbo_attn is the implementation that actually ships in production. Custom CUDA throughout. Full FULL_AND_PIECEWISE CUDAGraph capture on prefill and decode. Asymmetric K/V bit widths across all nine {2,4,8}² combinations. Per-group block pools so attention-plus-Mamba models don't waste the memory they save. Drop-in backends for both vLLM and SGLang.
Headline results
Qwen3.5-27B-AWQ on 2× RTX 4090 (TP=2, TQ4 KV, MTP=3):[^27b]
| Metric | BF16 KV | turbo_attn (TQ4) |
|---|---|---|
| KV capacity | 370K tokens | 1,360K tokens (3.7×) |
| Throughput (1 user) | 131 tok/s | 117 tok/s (89%) |
| Throughput (8 users) | — | 355 tok/s |
| Throughput (128 users) | — | 1,393 tok/s |
| TTFT | 58 ms | 78 ms |
| Needle @ 99K | — | FOUND |
[^27b]: Measured on an internal AWQ-INT4 build of Qwen3.5-27B; the artifact itself is not on the public HF Hub. The methodology, flags, and harness are public — numbers reproduce on any AWQ-quantized 27B-class model that fits the same TP=2 / 2× RTX 4090 layout.
Qwen3.5-0.8B at 250K context (single RTX 4090):
| Config | Prefill tok/s | Decode tok/s | Needle @ 250K | Compression |
|---|---|---|---|---|
| BF16 KV | 17,908 | 336 | OOM at 64K | 1.0× |
| TQ4 | 17,347 | 354 | FOUND | 3.8× |
| TQ3 | 17,499 | 296 | FOUND | 4.9× |
| TQ2 | 17,614 | 314 | MISS (found @ 100K) | 7.1× |
TQ4 decode is 5% faster than BF16 at long context — compressed KV reads less memory, and the dequant overhead is smaller than the bandwidth savings.
Perplexity (Wikitext-2, Qwen3.5-0.8B): Measured via vLLM prompt_logprobs on the production prefill kernel. Additional models and context lengths in progress.
| Config | PPL | Δ vs BF16 | (Δ%) |
|---|---|---|---|
| BF16 | 24.61 | 0.000 | (0.00%) |
| TQ8 (K8V8) | 24.59 | −0.02 | (−0.08%) |
| TQ4 (K4V4) | 24.87 | +0.27 | (+1.08%) |
| TQ3 | 25.67 | +1.07 | (+4.33%) |
| TQ2 (K2V2) | 27.87 | +3.27 | (+13.28%) |
| K4V8 | 24.62 | +0.01 | (+0.06%) |
| K8V4 | 24.81 | +0.20 | (+0.81%) |
| K2V4 | 25.50 | +0.90 | (+3.65%) |
| K6V3 | 25.52 | +0.91 | (+3.71%) |
| K8V2 | 27.42 | +2.81 | (+11.43%) |
| K4V2 | 27.40 | +2.79 | (+11.35%) |
Measured via vLLM prompt_logprobs on the production turbo_attn prefill kernel (not the experimental HF reference path). 8 chunks × 512 wikitext-2 tokens, subprocess-isolated. SEM ≈ ±2 PPL across chunks.
Reproduce the headline numbers
Every number in this README is reproducible from benchmarks/run_all.py:
pip install turbo-attn
python benchmarks/run_all.py --profile smoke # ~2 minutes, sanity check
python benchmarks/run_all.py --profile quick # ~8 minutes, headline numbers
Output lands in benchmarks/results/<timestamp>/ with summary.md (human-readable table), summary.json and per-suite CSVs (perplexity.csv, throughput.csv, memory.csv), per-subprocess logs under logs/, and a system.json stamp recording GPU, driver, CUDA, package versions, and git SHA.
Reference numbers we publish against are committed under benchmarks/baselines/qwen3.5-0.8b/ and touchstone/baselines/. To stand up a local server for ad-hoc prompts, see examples/02_vllm_server.py.
Hardware tested
| GPU class | SM | Status |
|---|---|---|
| RTX 4090 / L40 / Ada | 89 | tested in CI and on the headline benches |
| A100 / Ampere | 80, 86 | builds; not yet validated end-to-end |
| H100 / Hopper | 90 | builds; FP8 attention not yet validated |
| B200 / Blackwell | 100 | not yet ported |
What turbo_attn is — and isn't
turbo_attn is an attention backend in the vLLM/SGLang sense: it owns the KV cache layout and the kernels that read it. Everything outside attention is untouched and orthogonal.
Layers: who owns what
| Layer | Owner | turbo_attn's role |
|---|---|---|
| Tokenizer, sampler, guided decoding, LoRA | vLLM | none |
| Request scheduler, batching, chunked prefill, prefix caching | vLLM | we consume what it sends |
| Paged block allocator | vLLM | we register our per-token byte budget |
| KV cache on-wire format | turbo_attn | rotate + quantize + bitpack into paged blocks |
| Attention compute (decode) | turbo_attn | fused CUDA kernel, end-to-end |
| Attention compute (prefill) | turbo_attn | one of three paths; see below |
| CUDAGraph capture orchestration | vLLM | we declare capturable modes; vLLM captures |
| Hybrid-model block dispatch | vLLM + our per-group BlockPool |
we ensure compressed pages don't inflate |
| Weight loading (BF16 / AWQ / GPTQ / FP8) | vLLM + quant libs | orthogonal; composes cleanly |
| TP/PP collectives | vLLM | orthogonal; we run per-rank |
Attention compute by run mode
| Run mode | KV storage | KV load + dequant | Q·K | softmax | P·V | output |
|---|---|---|---|---|---|---|
| Decode (fused CUDA) | ours (compressed) | ours | ours | ours (online) | ours | ours |
| Decode split-K (batch=1) | ours | ours | ours | ours (2-phase) | ours | ours |
| MTP verification (BLOCK_M 2–8) | ours | ours | ours | ours | ours | ours |
| Prefill — FA4 inline-dequant | ours | ours (subclass override) | FA4 | FA4 | FA4 | FA4 |
| Prefill — CUDA C++ (v9) | ours | ours | ours (mma.sync) |
ours | ours (mma.sync) |
ours |
| Prefill — decompress + stock FA | ours → BF16 scratch | ours (Triton) | stock FA | stock FA | stock FA | stock FA |
Stock FlashAttention is never called on compressed bytes. When it runs, it runs unmodified on a decompressed scratch buffer.
What's novel
1. Our own warp-fused compress kernel. Walsh–Hadamard rotation via warp butterfly (__shfl_xor_sync, five shuffle stages), quantization, and bit-packing — all in a single CUDA launch. Measured 5× faster than a Triton baseline (~7 μs vs ~35 μs per layer at batch 1 on RTX 4090), contributing ~10% end-to-end decode speedup.
2. Unified fused decode kernel, templated on BLOCK_M. Single source handles ordinary decode (BLOCK_M=1), speculative decoding verification (BLOCK_M=2..8), and asymmetric K/V bit widths (all nine {2,4,8}² combinations). NCU-profiled: 40 registers, zero spills, at the per-SM performance ceiling. CUDAGraph-safe. MTP verification is a native fused path, not a fallback.
3. FA4 with inline dequantization (novel). A CuTeDSL subclass of FlashAttention-4 that overrides the load_K and load_V stages to dequantize compressed bytes directly into register tiles during the MMA pipeline. No decompress buffer. Shared-memory pressure actually decreases because the staging region for packed bytes is ≤1/8 the size of the BF16 tile it replaces. Generalized over all nine {2,4,8}² K/V configurations. Intended as an upstream contribution to FlashAttention.
4. Production CUDA C++ prefill kernel. XOR shared-memory swizzle (col ^ ((row & 7) << 3) — 1.75× improvement), staged cp.async.cg loads, ldmatrix.sync.aligned.m8n8.x4 fragment loads, mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 tensor cores, register-held query tiles, warp-shuffle softmax reductions, pack_gqa query layout. Templated on K/V bit widths and head dimensions. 128–200 registers, zero spills, 2 CTAs/SM.
5. Full FULL_AND_PIECEWISE CUDAGraph capture. Decode captured as a full graph, prefill captured piecewise per layer. Both the FA4 subclass and the CUDA C++ prefill kernel are allocation-free and capture-safe. This is the bleeding-edge vLLM CUDAGraph mode; most backends don't use it yet.
6. Per-group BlockPool for hybrid models. Attention-plus-Mamba/GatedDeltaNet/ShortConv models suffer from block-pool inflation in vLLM's default allocator, because attention pages get padded to the Mamba state page size. With TQ4 compression this waste becomes catastrophic (effective capacity drops to 2–3%). The fork carries a per-layer-group block pool that lets each layer type use its natural page size. Model-agnostic; benefits any hybrid architecture.
7. Asymmetric K/V bit widths. All nine combinations of {2,4,8}² in both decode and prefill, configured independently via TQKV_K_BITS and TQKV_V_BITS.
The optimal asymmetric K/V configuration is model-specific. On Qwen3.5-0.8B (hybrid attention + GatedDeltaNet), value precision matters more — K4V8 is essentially lossless (+0.06% PPL) while K8V4 is +0.81%. On other architectures with smaller head dimensions and more attention layers (e.g. Qwen3-0.6B, head_dim=64, 28 attention layers), key precision dominates — K8V4 is essentially lossless there while symmetric TQ4 diverges. The direction depends on which side of the model has wider activation distributions, not a universal rule. We recommend measuring per architecture with the included kv_norm_profile.py diagnostic and the tier2_wikitext_quick.py sweep harness.
8. Sliding-window attention support. Per-layer SlidingWindowSpec propagates window size into the decode kernel mask. Works on Gemma, Mistral variants, and any model that declares sliding-window layers.
9. Native MTP / speculative decoding. BLOCK_M>1 verification uses the same unified decode kernel, amortizing the KV read across all verified tokens. No decompress-and-fallback path. MTP=3 reaches ~96% of BF16 speed on Qwen3.5-27B at TQ4 with 3.7× the KV capacity.
10. Orthogonal to weight quantization. Composes cleanly with AWQ, GPTQ, FP8, and bitsandbytes. Validated on Qwen3.5-27B-AWQ-4bit at 170K context, TP=2.
Supported models
| Architecture | Type | Status | Notes |
|---|---|---|---|
| Qwen3.5 (0.8B, 27B) | Hybrid (full-attn + GatedDeltaNet) | ✅ Primary | 27B via AWQ-4bit, TP=2, 1.36M context validated |
| Qwen3.6-MoE | Hybrid MoE (full-attn + GatedDeltaNet + experts) | ✅ Validated | TP=2, 2.94× KV capacity (888k vs 302k BF16); throughput parity warm |
| LFM2-8B-A1B | Hybrid MoE (full-attn + ShortConv + 32 experts) | ✅ Validated | Needle 11/12 (matches BF16), PPL within 4% |
| Gemma, Mistral (sliding window) | Dense w/ SWA | ✅ Backend supports | Spot-tested |
| Llama-3, Mixtral, DeepSeek-V3, Command-R+ | Dense / MoE | 🗺️ Roadmap | Expected to work; not yet explicitly validated |
| Pure Mamba / SSM | State-space | N/A | Mamba has no KV to compress |
turbo_attn is architecture-neutral: any model vLLM can serve with a paged KV cache can use it. The tested list reflects what's been benchmarked, not what works.
How TurboQuant works
- Rotate each KV vector with a fast Walsh–Hadamard transform. Raw KV values have uneven distributions; after rotation every coordinate follows the same Gaussian.
- Normalize — store the vector's magnitude as a single BF16 value.
- Quantize each coordinate to a shared codebook (4 values for TQ2, 16 for TQ4, up to 256 for TQ8). Because the rotated distribution is uniform across coordinates, a single shared codebook achieves near-optimal distortion.
At TQ4, head_dim=128: 64 packed bytes + 4 norm bytes = 68 B vs 256 B at BF16 = 3.8× compression. The same arithmetic extends to all supported widths: 2-bit → 7.1×, 3-bit → 4.9×, 8-bit → 1.9× (lossless).
Attention scores computed on rotated KV are bit-identical to attention on unrotated KV, provided the query is rotated by the same matrix before the dot product. We pre-rotate the query once per request and compute everything in the rotated space, never decompressing into the original coordinate system.
Architecture
Decode path
One fused CUDA kernel (turbo_attn/kernels/_cuda_decode_unified.cu) handles all decode math: unpacks nibbles from compressed KV pages, looks up centroids from a shared-memory codebook, multiplies by per-token norms, computes Q·K and P·V with online softmax, and writes outputs — in a single pass. No decompress buffer, no external kernel calls. Templated on HEAD_DIM, GQA_RATIO, BLOCK_M, TQ_K_BIT_WIDTH, and TQ_V_BIT_WIDTH; the compiler generates one variant per tuple. Split-K variant available for batch=1 workloads where grid utilization is the bottleneck.
Prefill: three paths, FULL_AND_PIECEWISE capture
All three prefill paths run at approximately BF16-FlashAttention speed on tested workloads.
- Path A — FA4 inline-dequant (default). CuTeDSL subclass of FlashAttention-4 that overrides K/V loads to dequantize directly into register tiles. No decompress buffer. CUDAGraph-captured via piecewise mode.
- Path B — CUDA C++ prod kernel (
TQKV_PREFILL_ENGINE=cuda). Hand-written kernel with XOR swizzle,cp.async,ldmatrix, andmma.sync. Also piecewise-captured. - Path C — Decompress + stock FA (
=decompress). Triton decompress into a small scratch buffer, then stock FlashAttention. Fallback for maximal compatibility and for very long chunks where fused dequant overhead compounds. Not CUDAGraph-captured.
An adaptive dispatcher (=adaptive) picks Path A for short chunks and Path C for chunks above a hand-tuned crossover. Adaptive is an explicit opt-in — it is never auto-selected, because silent switches between engines make perf numbers incoherent and can hide fa4 regressions in benchmarks. Use fa4 (the default) for stable measurement.
CUDAGraph mode
The backend requests CUDAGraphMode.FULL_AND_PIECEWISE when the prefill engine is fa4 (the default) or adaptive (explicit opt-in). Decode runs as a full captured graph; prefill runs as per-layer piecewise-compiled graphs. The FA4 subclass and the CUDA C++ prefill kernel are both allocation-free and fully compatible with this mode. Any other prefill engine (decompress, triton, cuda) falls back to FULL_DECODE_ONLY.
Compress kernel
Our warp-fused CUDA compress kernel runs on every KV write. One warp per (token, head) pair performs the Walsh–Hadamard rotation via five __shfl_xor_sync butterfly stages, reduces the norm via __shfl_down_sync, scalar-quantizes against the codebook, bit-packs indices into bytes, and scatters directly to the block offset — all in a single kernel. ~7 μs per layer at batch 1, vs ~35 μs for a Triton baseline.
Install and use
Full end-to-end walkthroughs for each engine:
- vLLM quickstart — install the vLLM fork, serve a model, query it, validate compression.
- SGLang quickstart — same flow via SGLang's attention-backend plugin.
vLLM
For vLLM serving, install the vllm fork (turbo-attn branch). The fork is a thin overlay on vllm/vllm-openai:v0.19.0 that wires the "tqkv" kv_cache_dtype through vLLM's config system and adds per-group block-pool bookkeeping for hybrid models. The actual backend (~2000 lines) lives in this package, not the fork; the full per-file breakdown is in docker/PATCHES.md. The "Why a vLLM fork (for now)" section below explains why this fork is required today.
Two flags are required for vLLM serving with TQKV:
--kv-cache-dtype tqkv— selects the compressed KV layout--attention-backend custom— routes attention through the registered TQKV backend
# 4-bit KV on Qwen3.5-0.8B, single GPU, up to 250K context
vllm serve Qwen/Qwen3.5-0.8B \
--kv-cache-dtype tqkv \
--attention-backend custom
# 27B on 2 GPUs, TP=2, TQ4, with MTP-3 speculative decoding
TQKV_BITS=4 vllm serve /path/to/Qwen3.5-27B-AWQ-4bit \
--kv-cache-dtype tqkv \
--attention-backend custom \
--tensor-parallel-size 2 \
--language-model-only \
--max-model-len 170000 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.92 \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
# Asymmetric K/V: 2-bit keys, 4-bit values (optimal direction is model-specific — measure with tier2_wikitext_quick.py)
TQKV_K_BITS=2 TQKV_V_BITS=4 vllm serve Qwen/Qwen3.5-0.8B \
--kv-cache-dtype tqkv \
--attention-backend custom
Recommended flags for production performance
TQKV's headline numbers (capacity, TTFT, throughput) assume a few non-default scheduler and compilation settings. The plugin does not set these for you — pass them explicitly so what you configure is what you get:
| Flag | Recommended value | Why |
|---|---|---|
--max-num-batched-tokens |
16384 |
TQKV's prefill bypass is chunk-size-agnostic; lifting MNBT lets long prompts land in fewer kernel launches → markedly better TTFT at ≥16k prompts |
--max-num-seqs |
<N> (auto-size) |
Without this, the scheduler caps concurrent requests at the bf16 cap and hides TQKV's 3–4× capacity gain. Set so that N × max_model_len ≈ your KV budget |
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}' |
as shown | TQKV's FA4 prefill kernel is CUDAGraph-safe; capturing it removes Python launch overhead on prefill |
vllm serve Qwen/Qwen3.5-0.8B \
--kv-cache-dtype tqkv \
--attention-backend custom \
--max-num-batched-tokens 16384 \
--max-num-seqs 256 \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}'
Sizing --max-num-seqs: a fast estimate is N ≈ (kv_budget_bytes) / (per_token_bytes × max_model_len) where per_token_bytes for K4V4 is roughly num_layers × num_kv_heads × head_size × 2 ÷ 4. Bench at increasing values and watch for OOM at request peak.
MLA models (DeepSeek V2/V3)
MLA layers need a separate wrapper backend that compresses the shared-KV slot. Set TQKV_MLA_ENABLE=1 to enable; without it, MLA models will fail at load with "no MLA backend supports tqkv":
TQKV_MLA_ENABLE=1 vllm serve deepseek-ai/DeepSeek-V2-Lite-Chat \
--kv-cache-dtype tqkv \
--attention-backend custom
SGLang
SGLang integration is plugin-based — no fork required. tqkv.integrations.sglang.register() installs a pool-factory and wires the tqkv attention backend into SGLang's registry. Call it once before SGLang reads its attention-backend registry:
# In your launch script, before SGLang imports its backend registry:
import tqkv.integrations.sglang as tqkv_sglang
tqkv_sglang.register()
# Then launch SGLang as usual with:
# --kv-cache-dtype tqkv --attention-backend tqkv
TQKV_BITS=4 python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-0.8B \
--kv-cache-dtype tqkv \
--attention-backend tqkv
Standalone codec (no serving engine)
from tqkv import TurboKVCodec
codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")
k_packed, k_norms = codec.compress_k(key_vectors)
k_recon = codec.decompress_k(k_packed, k_norms)
# Pre-rotation trick (fused attention without decompress buffer)
q_rotated = codec.rotate_query(query, scale=1/math.sqrt(head_dim))
output = codec.unrotate_output(raw_output)
HuggingFace Transformers (reference path)
from tqkv.hf_cache import TQKVCache
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B").cuda()
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B")
cache = TQKVCache(bit_width=4)
cache.init_from_model(model, tok)
out = model.generate(
**tok("Explain entropy in one sentence.", return_tensors="pt").to("cuda"),
past_key_values=cache, max_new_tokens=64,
)
Note: the HF path is a reference implementation used by the perplexity harness. For production serving, use vLLM or SGLang.
Benchmarking
Accuracy and throughput claims that require running on raw weights cannot be tested through our demo API — you have to install the backend and run benchmarks locally. We ship a reproducible harness and publish our own numbers, and we strongly encourage independent verification.
Accuracy / quality (run by us, scripts in benchmarks/):
- Perplexity on Wikitext-2 and PG-19 (long-context PPL)
- RULER — NVIDIA's long-context benchmark, 13 tasks (multi-key NIAH, variable tracking, aggregation, QA) at 4K–128K+. The long-context standard.
- LongBench-v2 — realistic long-context QA and reasoning
- MMLU 5-shot, GSM8K, TruthfulQA, HellaSwag via
lm-eval-harness— short-context retention - Needle-in-a-haystack at 100K and 250K — quick sanity check (not a primary metric)
Throughput / latency:
- vLLM
benchmark_serving.pywith ShareGPT workload at batch {1, 8, 32, 128} - TTFT / TPOT distributions under Poisson arrivals at QPS {1, 4, 16}
- Max context at fixed GPU memory — GB per 1K tokens
Reproduction scripts and exact commit hashes live in benchmarks/. All of them print CSV-formatted results to benchmarks/results/<timestamp>/ so you can diff them against ours.
# Everything the README claims, reproducible end-to-end
python benchmarks/run_all.py --profile full
# Individual suites
python benchmarks/perplexity/run.py --model Qwen/Qwen3.5-0.8B --bits 4
python benchmarks/ruler/run.py --model Qwen/Qwen3.5-0.8B --bits 4 --ctx 32768,65536,131072
python benchmarks/vllm/run_serving.py --model Qwen/Qwen3.5-0.8B --bits 4 --workload sharegpt
Configuration
All runtime configuration is done through TQKV_-prefixed environment
variables. The full surface is below, organised by category. Anything not
listed here is internal and may change without notice.
Bit width and calibration
| Variable | Default | Description |
|---|---|---|
TQKV_BITS |
4 |
Symmetric K/V bit width (2–8). Falls through to TQKV_K_BITS/TQKV_V_BITS when those are unset. |
TQKV_K_BITS / TQKV_V_BITS |
inherits TQKV_BITS |
Asymmetric K/V override (e.g. K2/V4). |
TQKV_LAYER_BITS |
"" |
Per-layer override string (e.g. 0:8,8;5:2,4). Layer index → (k_bits, v_bits). |
TQKV_CALIBRATION_FILE |
"" |
Path to a calibration JSON bundle produced by python -m tqkv.auto_calibrate. |
TQKV_ALLOCATION_FILE |
"" |
Path to a per-layer bit-allocation file produced by python -m tqkv.calibration.solve_bits. |
TQKV_AUTO_CALIBRATE_MODEL |
"" |
Model path for plugin-side auto-calibration. Triggers calibration on first init when set. |
TQKV_CALIBRATION_CACHE |
"" |
Directory used by auto-calibration to cache intermediate artefacts. |
TQKV_PROFILE |
none |
Calibration profile from the bundle: lossless, balanced, aggressive. |
Engine selection
| Variable | Default | Description |
|---|---|---|
TQKV_ENGINE |
"" (auto) |
Decode engine: native_tq (CUDA SIMT), flash_attn (decompress + FA), or bypass. Empty = auto-select. |
TQKV_PREFILL_ENGINE |
fa4 |
Prefill path: fa4, triton, decompress, or adaptive. See "Prefill: three paths" above. |
TQKV_PREFILL_BYPASS |
1 |
First-chunk prefill bypass — skip codec on prompt-prefill, then re-rotate to TQ basis for decode. |
TQKV_BYPASS_INLINE |
0 |
Inline the bypass logic in the runtime hot path (perf experiment). |
TQKV_FUSE_QROT |
"" (auto) |
Fused Q-rotation prologue: on / off / auto. Decode-only. |
TQKV_O_PROJ_FOLD |
on |
Fold rotate_output into the o_proj weights. Default-on, universal across attention layer classes. |
TQKV_MTP_SPLITK |
1 |
Use split-K decode kernel for MTP layers. |
TQKV_DECODE_SPLITS |
"" (autotune) |
Force decode-kernel split count. Empty = autotuned. |
TQKV_MTP_AUTOTUNE_SPLITS |
1 |
Run the autotune loop for MTP split-K. |
TQKV_MTP_DECODE_SPLITS |
"" |
MTP decode split count override. |
TQKV_KV_TILE_TOKENS |
"" |
KV tile size in tokens (autotuner override). |
TQKV_REFERENCE_SEQ_LEN |
"" |
Reference sequence length used by autotune heuristics. |
Backend behaviour
| Variable | Default | Description |
|---|---|---|
TQKV_NO_JIT |
0 |
Fail if a kernel variant is not pre-compiled (strict-mode for production images). |
TQKV_K_NC |
1 |
Apply norm-correction to K reads in the dequant path. |
TQKV_DISABLE_PRESCALE |
0 |
Disable per-channel pre-scaling on compress upload (debug). |
TQKV_DISABLE_HYBRID_REVIEW |
0 |
Disable the hybrid-model (Mamba/GDN) cache-budget review pass. |
TQKV_STRICT_NO_SDPA |
0 |
When 1, raise RuntimeError instead of taking the D>256 SDPA Python-loop fallback (~1000× slower than FA4/cuda-v9). Recommended for Gemma 4 31B and other head_dim>256 deployments. |
TQKV_SIMULATE_NO_FA4 |
0 |
Test fallback path by pretending FA4 isn't available. |
TQKV_LOG_HOIST |
"" |
When set, log per-layer-builder hoist diagnostics (debug). |
FA4 prefill scheduler (advanced)
| Variable | Default | Description |
|---|---|---|
TQKV_FA4_SPLIT_D |
auto | Head-dim splitting in the FA4 split-D kernel. 1/0 to override the auto-detect. |
TQKV_FA4_LPT |
auto |
Longest-processing-time scheduling: 0, 1, auto. Auto = on for causal-or-local. |
TQKV_FA4_HEAD_SWIZZLE |
1 |
Head-axis tile swizzle for L2-locality. Most impactful for Gemma H_kv=16. |
MLA (DeepSeek V2/V3/V4)
| Variable | Default | Description |
|---|---|---|
TQKV_MLA_ENABLE |
0 |
Master switch for the MLA backend. When 0, MLA models route to vLLM's stock backend. |
TQKV_MLA_ROPE_HEAD_DIM |
64 |
RoPE head dimension for MLA latent + RoPE split. |
Cold tier (variant 2 — H2O-style attention-weighted pool)
Off by default. See docs/cold_tier_design.md
for current shipping status. Master flag must be set for any other
cold-tier env to take effect.
| Variable | Default | Description |
|---|---|---|
TQKV_COLD_ENABLE |
0 |
Master switch. When 0, every other cold-tier env is ignored and the runtime is bit-identical to main. |
TQKV_COLD_HOT_PATH |
0 |
Hot-path opt-in. 1 installs the observer hook; 0 (default) installs state only. |
TQKV_COLD_TIERED_ATTEND |
0 |
γ_merge: full tiered attention with LSE merge. Currently a shipping blocker — leave 0. |
TQKV_COLD_PROFILE |
"" |
One of balanced, aggressive, max_compress. Presets M, P, K_PCT, W_track, and per-layer bits. |
TQKV_COLD_M |
8192 |
Age threshold: tokens older than this enter the cold tier. |
TQKV_COLD_P |
16 (=block_size) |
Pool slice width. Page-aligned — each eviction frees one vLLM page. |
TQKV_COLD_K_PCT |
1.0 |
Percent of each pool slice retained individually as heavy hitters (ranked by pre-softmax Q·K). |
TQKV_COLD_W_TRACK |
64 |
Aging-window size (decode steps a token stays in the ring before eviction). Must be ≥ TQKV_COLD_P. |
TQKV_COLD_BITS_K / TQKV_COLD_BITS_V |
hot-tier bits | Scalar override for cold-tier K/V bit width. |
TQKV_COLD_LAYER_BITS |
"" |
Per-layer cold-tier override, e.g. 5:8,8;10:4,4. |
TQKV_COLD_NO_POOL_LAYERS |
"0 1" |
Space-separated layers that skip pooling entirely (L0/L1 are attention-sink layers). |
Profiles (balanced / aggressive / max_compress): set bundles
of the above. Expected compression at 1M ctx on Qwen3-4B: ~47×
(balanced/aggressive) and ~100× (max_compress, unvalidated).
Status: flag-off-default. γ_nomerge path (archive + drain, no LSE
merge) decodes coherently and saves ~5% peak_MiB at batch=4. γ_merge
path is gated behind TQKV_COLD_TIERED_ATTEND=1 and is currently a
shipping blocker pending the multi-seq capacity bug fix.
Why a vLLM fork (for now)
SGLang doesn't need a fork — its attention-backend plugin surface accepts custom pools via a call-time patch today, so the SGLang integration ships as a pure plugin.
vLLM is structurally different. CacheDType in vllm/config/cache.py is a Pydantic Literal that validates at class-definition time, which blocks runtime plugin registration of any new KV-cache dtype. Until that's relaxed upstream there's no plugin-only path, and we ship a fork.
The fork is a thin overlay on top of vllm/vllm-openai:v0.19.0. The full layout — every file we copy, what each patch does, and the v0.19.0 compat guards — lives in docker/PATCHES.md. At the time of writing, that's roughly 19 files copied across config/, v1/attention/, and v1/core/, plus one append-patch on utils/torch_utils.py. The shape of the patches falls into three groups:
- Plugin-registration plumbing (about half the files): adding
"tqkv"toCacheDType, the backend enum, the platform selector, the cache-spec dispatch, and the bytes-layout calculator. This is what would collapse into a pure plugin ifCacheDTypebecomes extensible. - Hybrid-model KV bookkeeping (
v1/core/kv_cache_*.py): keeps per-group page sizes for attention vs. Mamba/GDN layers so compressed attention pages don't get unified up to a Mamba-state page size. Required for LFM2 and Qwen3.5 MoE+GDN; transparent on dense models. - Capture-mode and OOM-fallback wiring:
CUDAGraphMode.FULL_AND_PIECEWISEfor chunked-prefill capture, and the determine-available-memory fallback for compressed KV when the profiler returns 0.
We don't currently have an upstream PR open for the CacheDType relaxation — the open question is whether vLLM is willing to accept a plugin-extensible KV-cache-dtype surface; until that conversation happens, the fork is the way. If you want to track or contribute, the canonical state is arbi-dev/vllm and docker/PATCHES.md.
Roadmap
- vLLM upstream: explore
CacheDTyperelaxation so tqkv can register as a pure plugin (no PR open yet) - FlashAttention upstream: explore upstreaming the FA4 inline-dequant variant (no PR open yet)
- Validated model matrix: Llama-3, Mixtral, DeepSeek-V3, Command-R+, Gemma
- TP > 2 validation
- RULER and LongBench-v2 at 32K/64K/128K across all bit widths
- SGLang upstream: explore a first-class pool-factory hook so
register()can become a no-op (no PR open yet) - Hopper and Blackwell prefill optimization (TMA, larger smem budgets)
Citation
If turbo_attn helps your work, please cite both Google's TurboQuant paper and this implementation:
@misc{turbo_attn2026,
title = {turbo\_attn: Production attention backend for TurboQuant KV cache compression},
author = {Evseev, Dmitri},
year = {2026},
url = {https://github.com/arbi-dev/turbo_attn}
}
@inproceedings{zandieh2026turboquant,
title = {TurboQuant: Near-optimal KV Cache Quantization for LLM Inference},
author = {Zandieh, Amir and others},
booktitle = {ICLR},
year = {2026}
}
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turbo_attn-0.1.0.tar.gz.
File metadata
- Download URL: turbo_attn-0.1.0.tar.gz
- Upload date:
- Size: 532.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f89c623a2a0a237d03cc3ec90839e81e8598dbbeaf46ce7de03df9950071b69
|
|
| MD5 |
e2dc90c41adce2a15184427321093e9e
|
|
| BLAKE2b-256 |
35290e3d8af8215b53f43af6d6d0cc730c535dee9f8d5db91971a53f86f8e335
|
File details
Details for the file turbo_attn-0.1.0-py3-none-any.whl.
File metadata
- Download URL: turbo_attn-0.1.0-py3-none-any.whl
- Upload date:
- Size: 590.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5746a66d8c7508747a8b5b3d7d597f9749a0402c6570ca64d423f2ffcc2c09f5
|
|
| MD5 |
ed97b39549f046b490092099da481195
|
|
| BLAKE2b-256 |
0462a1c3c00d91bbdea987922d8fa7814ca5112050358aff29fb4350b7bdf1fd
|