Skip to main content

Productionized TurboQuant KV cache compression for vLLM and SGLang (imported as `tqkv`)

Reason this release was yanked:

wrong licene info

Project description

Turbo Attention

CI PyPI License

A modular attention backend for vLLM and SGLang. Custom CUDA kernels, full CUDAGraph capture, asymmetric K/V quantization, hybrid-model support. 3.8× more KV context on the same GPU, under 1% accuracy loss. Built on FlashAttention.

PyPI: turbo-attn · Import: tqkv · License: Apache-2.0

Install

pip install turbo-attn                  # codec + CUDA/Triton kernels
pip install "turbo-attn[vllm]"          # + vLLM attention backend
pip install "turbo-attn[all]"           # + SGLang, FlashInfer, flash-attn, eval harness

Quickstart

import torch
from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4)
keys = torch.randn(8, 128)

packed, norms = codec.compress_k(keys)        # 4096 B → 512 B + 16 B
recon = codec.decompress_k(packed, norms)

Serve a model end-to-end:

# vLLM
vllm serve Qwen/Qwen3.5-0.8B \
    --kv-cache-dtype tqkv --attention-backend custom \
    --max-model-len 250000 --trust-remote-code

# SGLang (after `import tqkv.integrations.sglang as t; t.register()`)
python -m sglang.launch_server --model-path Qwen/Qwen3.5-0.8B \
    --kv-cache-dtype tqkv --attention-backend tqkv --trust-remote-code

Full walkthroughs:

Repo layout

The supported public surface lives at the top of each tree:

  • tqkv/ — the package (codec, kernels, runtime, vLLM/SGLang plugins, calibration pipeline).
  • docs/, docker/, scripts/, experiments/ — public docs, deploy recipes, helper scripts, research notes.
  • Anything under an internal/ subdirectory (docs/internal/, docker/internal/, scripts/internal/, experiments/internal/) is engineering-only and unsupported — it may move or break between releases. The wheel never ships these.

Why this exists

KV cache memory grows linearly with context length and dominates GPU memory beyond ~32K tokens. Google's TurboQuant (Zandieh et al., ICLR 2026) solved this in principle: a near-optimal KV compression scheme using Walsh–Hadamard rotation and Lloyd–Max codebooks, with provable distortion bounds. Multiple open-source references have appeared since the paper — but none ship production CUDA kernels, full CUDAGraph capture, or validation on the new generation of hybrid state-space / attention architectures.

turbo_attn is the implementation that actually ships in production. Custom CUDA throughout. Full FULL_AND_PIECEWISE CUDAGraph capture on prefill and decode. Asymmetric K/V bit widths across all nine {2,4,8}² combinations. Per-group block pools so attention-plus-Mamba models don't waste the memory they save. Drop-in backends for both vLLM and SGLang.

Headline results

Qwen3.5-27B-AWQ on 2× RTX 4090 (TP=2, TQ4 KV, MTP=3):[^27b]

Metric BF16 KV turbo_attn (TQ4)
KV capacity 370K tokens 1,360K tokens (3.7×)
Throughput (1 user) 131 tok/s 117 tok/s (89%)
Throughput (8 users) 355 tok/s
Throughput (128 users) 1,393 tok/s
TTFT 58 ms 78 ms
Needle @ 99K FOUND

[^27b]: Measured on an internal AWQ-INT4 build of Qwen3.5-27B; the artifact itself is not on the public HF Hub. The methodology, flags, and harness are public — numbers reproduce on any AWQ-quantized 27B-class model that fits the same TP=2 / 2× RTX 4090 layout.

Qwen3.5-0.8B at 250K context (single RTX 4090):

Config Prefill tok/s Decode tok/s Needle @ 250K Compression
BF16 KV 17,908 336 OOM at 64K 1.0×
TQ4 17,347 354 FOUND 3.8×
TQ3 17,499 296 FOUND 4.9×
TQ2 17,614 314 MISS (found @ 100K) 7.1×

TQ4 decode is 5% faster than BF16 at long context — compressed KV reads less memory, and the dequant overhead is smaller than the bandwidth savings.

Perplexity (Wikitext-2, Qwen3.5-0.8B): Measured via vLLM prompt_logprobs on the production prefill kernel. Additional models and context lengths in progress.

Config PPL Δ vs BF16 (Δ%)
BF16 24.61 0.000 (0.00%)
TQ8 (K8V8) 24.59 −0.02 (−0.08%)
TQ4 (K4V4) 24.87 +0.27 (+1.08%)
TQ3 25.67 +1.07 (+4.33%)
TQ2 (K2V2) 27.87 +3.27 (+13.28%)
K4V8 24.62 +0.01 (+0.06%)
K8V4 24.81 +0.20 (+0.81%)
K2V4 25.50 +0.90 (+3.65%)
K6V3 25.52 +0.91 (+3.71%)
K8V2 27.42 +2.81 (+11.43%)
K4V2 27.40 +2.79 (+11.35%)

Measured via vLLM prompt_logprobs on the production turbo_attn prefill kernel (not the experimental HF reference path). 8 chunks × 512 wikitext-2 tokens, subprocess-isolated. SEM ≈ ±2 PPL across chunks.

Reproduce the headline numbers

Every number in this README is reproducible from benchmarks/run_all.py:

pip install turbo-attn
python benchmarks/run_all.py --profile smoke   # ~2 minutes, sanity check
python benchmarks/run_all.py --profile quick   # ~8 minutes, headline numbers

Output lands in benchmarks/results/<timestamp>/ with summary.md (human-readable table), summary.json and per-suite CSVs (perplexity.csv, throughput.csv, memory.csv), per-subprocess logs under logs/, and a system.json stamp recording GPU, driver, CUDA, package versions, and git SHA.

Reference numbers we publish against are committed under benchmarks/baselines/qwen3.5-0.8b/ and touchstone/baselines/. To stand up a local server for ad-hoc prompts, see examples/02_vllm_server.py.

Hardware tested

GPU class SM Status
RTX 4090 / L40 / Ada 89 tested in CI and on the headline benches
A100 / Ampere 80, 86 builds; not yet validated end-to-end
H100 / Hopper 90 builds; FP8 attention not yet validated
B200 / Blackwell 100 not yet ported

What turbo_attn is — and isn't

turbo_attn is an attention backend in the vLLM/SGLang sense: it owns the KV cache layout and the kernels that read it. Everything outside attention is untouched and orthogonal.

Layers: who owns what

Layer Owner turbo_attn's role
Tokenizer, sampler, guided decoding, LoRA vLLM none
Request scheduler, batching, chunked prefill, prefix caching vLLM we consume what it sends
Paged block allocator vLLM we register our per-token byte budget
KV cache on-wire format turbo_attn rotate + quantize + bitpack into paged blocks
Attention compute (decode) turbo_attn fused CUDA kernel, end-to-end
Attention compute (prefill) turbo_attn one of three paths; see below
CUDAGraph capture orchestration vLLM we declare capturable modes; vLLM captures
Hybrid-model block dispatch vLLM + our per-group BlockPool we ensure compressed pages don't inflate
Weight loading (BF16 / AWQ / GPTQ / FP8) vLLM + quant libs orthogonal; composes cleanly
TP/PP collectives vLLM orthogonal; we run per-rank

Attention compute by run mode

Run mode KV storage KV load + dequant Q·K softmax P·V output
Decode (fused CUDA) ours (compressed) ours ours ours (online) ours ours
Decode split-K (batch=1) ours ours ours ours (2-phase) ours ours
MTP verification (BLOCK_M 2–8) ours ours ours ours ours ours
Prefill — FA4 inline-dequant ours ours (subclass override) FA4 FA4 FA4 FA4
Prefill — CUDA C++ (v9) ours ours ours (mma.sync) ours ours (mma.sync) ours
Prefill — decompress + stock FA ours → BF16 scratch ours (Triton) stock FA stock FA stock FA stock FA

Stock FlashAttention is never called on compressed bytes. When it runs, it runs unmodified on a decompressed scratch buffer.

What's novel

1. Our own warp-fused compress kernel. Walsh–Hadamard rotation via warp butterfly (__shfl_xor_sync, five shuffle stages), quantization, and bit-packing — all in a single CUDA launch. Measured 5× faster than a Triton baseline (~7 μs vs ~35 μs per layer at batch 1 on RTX 4090), contributing ~10% end-to-end decode speedup.

2. Unified fused decode kernel, templated on BLOCK_M. Single source handles ordinary decode (BLOCK_M=1), speculative decoding verification (BLOCK_M=2..8), and asymmetric K/V bit widths (all nine {2,4,8}² combinations). NCU-profiled: 40 registers, zero spills, at the per-SM performance ceiling. CUDAGraph-safe. MTP verification is a native fused path, not a fallback.

3. FA4 with inline dequantization (novel). A CuTeDSL subclass of FlashAttention-4 that overrides the load_K and load_V stages to dequantize compressed bytes directly into register tiles during the MMA pipeline. No decompress buffer. Shared-memory pressure actually decreases because the staging region for packed bytes is ≤1/8 the size of the BF16 tile it replaces. Generalized over all nine {2,4,8}² K/V configurations. Intended as an upstream contribution to FlashAttention.

4. Production CUDA C++ prefill kernel. XOR shared-memory swizzle (col ^ ((row & 7) << 3) — 1.75× improvement), staged cp.async.cg loads, ldmatrix.sync.aligned.m8n8.x4 fragment loads, mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 tensor cores, register-held query tiles, warp-shuffle softmax reductions, pack_gqa query layout. Templated on K/V bit widths and head dimensions. 128–200 registers, zero spills, 2 CTAs/SM.

5. Full FULL_AND_PIECEWISE CUDAGraph capture. Decode captured as a full graph, prefill captured piecewise per layer. Both the FA4 subclass and the CUDA C++ prefill kernel are allocation-free and capture-safe. This is the bleeding-edge vLLM CUDAGraph mode; most backends don't use it yet.

6. Per-group BlockPool for hybrid models. Attention-plus-Mamba/GatedDeltaNet/ShortConv models suffer from block-pool inflation in vLLM's default allocator, because attention pages get padded to the Mamba state page size. With TQ4 compression this waste becomes catastrophic (effective capacity drops to 2–3%). The fork carries a per-layer-group block pool that lets each layer type use its natural page size. Model-agnostic; benefits any hybrid architecture.

7. Asymmetric K/V bit widths. All nine combinations of {2,4,8}² in both decode and prefill, configured independently via TQKV_K_BITS and TQKV_V_BITS.

The optimal asymmetric K/V configuration is model-specific. On Qwen3.5-0.8B (hybrid attention + GatedDeltaNet), value precision matters more — K4V8 is essentially lossless (+0.06% PPL) while K8V4 is +0.81%. On other architectures with smaller head dimensions and more attention layers (e.g. Qwen3-0.6B, head_dim=64, 28 attention layers), key precision dominates — K8V4 is essentially lossless there while symmetric TQ4 diverges. The direction depends on which side of the model has wider activation distributions, not a universal rule. We recommend measuring per architecture with the included kv_norm_profile.py diagnostic and the tier2_wikitext_quick.py sweep harness.

8. Sliding-window attention support. Per-layer SlidingWindowSpec propagates window size into the decode kernel mask. Works on Gemma, Mistral variants, and any model that declares sliding-window layers.

9. Native MTP / speculative decoding. BLOCK_M>1 verification uses the same unified decode kernel, amortizing the KV read across all verified tokens. No decompress-and-fallback path. MTP=3 reaches ~96% of BF16 speed on Qwen3.5-27B at TQ4 with 3.7× the KV capacity.

10. Orthogonal to weight quantization. Composes cleanly with AWQ, GPTQ, FP8, and bitsandbytes. Validated on Qwen3.5-27B-AWQ-4bit at 170K context, TP=2.

Supported models

Architecture Type Status Notes
Qwen3.5 (0.8B, 27B) Hybrid (full-attn + GatedDeltaNet) ✅ Primary 27B via AWQ-4bit, TP=2, 1.36M context validated
Qwen3.6-MoE Hybrid MoE (full-attn + GatedDeltaNet + experts) ✅ Validated TP=2, 2.94× KV capacity (888k vs 302k BF16); throughput parity warm
LFM2-8B-A1B Hybrid MoE (full-attn + ShortConv + 32 experts) ✅ Validated Needle 11/12 (matches BF16), PPL within 4%
Gemma, Mistral (sliding window) Dense w/ SWA ✅ Backend supports Spot-tested
Llama-3, Mixtral, DeepSeek-V3, Command-R+ Dense / MoE 🗺️ Roadmap Expected to work; not yet explicitly validated
Pure Mamba / SSM State-space N/A Mamba has no KV to compress

turbo_attn is architecture-neutral: any model vLLM can serve with a paged KV cache can use it. The tested list reflects what's been benchmarked, not what works.

How TurboQuant works

  1. Rotate each KV vector with a fast Walsh–Hadamard transform. Raw KV values have uneven distributions; after rotation every coordinate follows the same Gaussian.
  2. Normalize — store the vector's magnitude as a single BF16 value.
  3. Quantize each coordinate to a shared codebook (4 values for TQ2, 16 for TQ4, up to 256 for TQ8). Because the rotated distribution is uniform across coordinates, a single shared codebook achieves near-optimal distortion.

At TQ4, head_dim=128: 64 packed bytes + 4 norm bytes = 68 B vs 256 B at BF16 = 3.8× compression. The same arithmetic extends to all supported widths: 2-bit → 7.1×, 3-bit → 4.9×, 8-bit → 1.9× (lossless).

Attention scores computed on rotated KV are bit-identical to attention on unrotated KV, provided the query is rotated by the same matrix before the dot product. We pre-rotate the query once per request and compute everything in the rotated space, never decompressing into the original coordinate system.

Architecture

Decode path

One fused CUDA kernel (turbo_attn/kernels/_cuda_decode_unified.cu) handles all decode math: unpacks nibbles from compressed KV pages, looks up centroids from a shared-memory codebook, multiplies by per-token norms, computes Q·K and P·V with online softmax, and writes outputs — in a single pass. No decompress buffer, no external kernel calls. Templated on HEAD_DIM, GQA_RATIO, BLOCK_M, TQ_K_BIT_WIDTH, and TQ_V_BIT_WIDTH; the compiler generates one variant per tuple. Split-K variant available for batch=1 workloads where grid utilization is the bottleneck.

Prefill: three paths, FULL_AND_PIECEWISE capture

All three prefill paths run at approximately BF16-FlashAttention speed on tested workloads.

  • Path A — FA4 inline-dequant (default). CuTeDSL subclass of FlashAttention-4 that overrides K/V loads to dequantize directly into register tiles. No decompress buffer. CUDAGraph-captured via piecewise mode.
  • Path B — CUDA C++ prod kernel (TQKV_PREFILL_ENGINE=cuda). Hand-written kernel with XOR swizzle, cp.async, ldmatrix, and mma.sync. Also piecewise-captured.
  • Path C — Decompress + stock FA (=decompress). Triton decompress into a small scratch buffer, then stock FlashAttention. Fallback for maximal compatibility and for very long chunks where fused dequant overhead compounds. Not CUDAGraph-captured.

An adaptive dispatcher (=adaptive) picks Path A for short chunks and Path C for chunks above a hand-tuned crossover. Adaptive is an explicit opt-in — it is never auto-selected, because silent switches between engines make perf numbers incoherent and can hide fa4 regressions in benchmarks. Use fa4 (the default) for stable measurement.

CUDAGraph mode

The backend requests CUDAGraphMode.FULL_AND_PIECEWISE when the prefill engine is fa4 (the default) or adaptive (explicit opt-in). Decode runs as a full captured graph; prefill runs as per-layer piecewise-compiled graphs. The FA4 subclass and the CUDA C++ prefill kernel are both allocation-free and fully compatible with this mode. Any other prefill engine (decompress, triton, cuda) falls back to FULL_DECODE_ONLY.

Compress kernel

Our warp-fused CUDA compress kernel runs on every KV write. One warp per (token, head) pair performs the Walsh–Hadamard rotation via five __shfl_xor_sync butterfly stages, reduces the norm via __shfl_down_sync, scalar-quantizes against the codebook, bit-packs indices into bytes, and scatters directly to the block offset — all in a single kernel. ~7 μs per layer at batch 1, vs ~35 μs for a Triton baseline.

Install and use

Full end-to-end walkthroughs for each engine:

  • vLLM quickstart — install the vLLM fork, serve a model, query it, validate compression.
  • SGLang quickstart — same flow via SGLang's attention-backend plugin.

vLLM

For vLLM serving, install the vllm fork (turbo-attn branch). The fork is a thin overlay on vllm/vllm-openai:v0.19.0 that wires the "tqkv" kv_cache_dtype through vLLM's config system and adds per-group block-pool bookkeeping for hybrid models. The actual backend (~2000 lines) lives in this package, not the fork; the full per-file breakdown is in docker/PATCHES.md. The "Why a vLLM fork (for now)" section below explains why this fork is required today.

Two flags are required for vLLM serving with TQKV:

  • --kv-cache-dtype tqkv — selects the compressed KV layout
  • --attention-backend custom — routes attention through the registered TQKV backend
# 4-bit KV on Qwen3.5-0.8B, single GPU, up to 250K context
vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom

# 27B on 2 GPUs, TP=2, TQ4, with MTP-3 speculative decoding
TQKV_BITS=4 vllm serve /path/to/Qwen3.5-27B-AWQ-4bit \
  --kv-cache-dtype tqkv \
  --attention-backend custom \
  --tensor-parallel-size 2 \
  --language-model-only \
  --max-model-len 170000 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.92 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

# Asymmetric K/V: 2-bit keys, 4-bit values (optimal direction is model-specific — measure with tier2_wikitext_quick.py)
TQKV_K_BITS=2 TQKV_V_BITS=4 vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom

Recommended flags for production performance

TQKV's headline numbers (capacity, TTFT, throughput) assume a few non-default scheduler and compilation settings. The plugin does not set these for you — pass them explicitly so what you configure is what you get:

Flag Recommended value Why
--max-num-batched-tokens 16384 TQKV's prefill bypass is chunk-size-agnostic; lifting MNBT lets long prompts land in fewer kernel launches → markedly better TTFT at ≥16k prompts
--max-num-seqs <N> (auto-size) Without this, the scheduler caps concurrent requests at the bf16 cap and hides TQKV's 3–4× capacity gain. Set so that N × max_model_len ≈ your KV budget
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}' as shown TQKV's FA4 prefill kernel is CUDAGraph-safe; capturing it removes Python launch overhead on prefill
vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 256 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}'

Sizing --max-num-seqs: a fast estimate is N ≈ (kv_budget_bytes) / (per_token_bytes × max_model_len) where per_token_bytes for K4V4 is roughly num_layers × num_kv_heads × head_size × 2 ÷ 4. Bench at increasing values and watch for OOM at request peak.

MLA models (DeepSeek V2/V3)

MLA layers need a separate wrapper backend that compresses the shared-KV slot. Set TQKV_MLA_ENABLE=1 to enable; without it, MLA models will fail at load with "no MLA backend supports tqkv":

TQKV_MLA_ENABLE=1 vllm serve deepseek-ai/DeepSeek-V2-Lite-Chat \
  --kv-cache-dtype tqkv \
  --attention-backend custom

SGLang

SGLang integration is plugin-based — no fork required. tqkv.integrations.sglang.register() installs a pool-factory and wires the tqkv attention backend into SGLang's registry. Call it once before SGLang reads its attention-backend registry:

# In your launch script, before SGLang imports its backend registry:
import tqkv.integrations.sglang as tqkv_sglang
tqkv_sglang.register()

# Then launch SGLang as usual with:
#   --kv-cache-dtype tqkv --attention-backend tqkv
TQKV_BITS=4 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend tqkv

Standalone codec (no serving engine)

from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")
k_packed, k_norms = codec.compress_k(key_vectors)
k_recon = codec.decompress_k(k_packed, k_norms)

# Pre-rotation trick (fused attention without decompress buffer)
q_rotated = codec.rotate_query(query, scale=1/math.sqrt(head_dim))
output = codec.unrotate_output(raw_output)

HuggingFace Transformers (reference path)

from tqkv.hf_cache import TQKVCache
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B").cuda()
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B")
cache = TQKVCache(bit_width=4)
cache.init_from_model(model, tok)

out = model.generate(
    **tok("Explain entropy in one sentence.", return_tensors="pt").to("cuda"),
    past_key_values=cache, max_new_tokens=64,
)

Note: the HF path is a reference implementation used by the perplexity harness. For production serving, use vLLM or SGLang.

Benchmarking

Accuracy and throughput claims that require running on raw weights cannot be tested through our demo API — you have to install the backend and run benchmarks locally. We ship a reproducible harness and publish our own numbers, and we strongly encourage independent verification.

Accuracy / quality (run by us, scripts in benchmarks/):

  • Perplexity on Wikitext-2 and PG-19 (long-context PPL)
  • RULER — NVIDIA's long-context benchmark, 13 tasks (multi-key NIAH, variable tracking, aggregation, QA) at 4K–128K+. The long-context standard.
  • LongBench-v2 — realistic long-context QA and reasoning
  • MMLU 5-shot, GSM8K, TruthfulQA, HellaSwag via lm-eval-harness — short-context retention
  • Needle-in-a-haystack at 100K and 250K — quick sanity check (not a primary metric)

Throughput / latency:

  • vLLM benchmark_serving.py with ShareGPT workload at batch {1, 8, 32, 128}
  • TTFT / TPOT distributions under Poisson arrivals at QPS {1, 4, 16}
  • Max context at fixed GPU memory — GB per 1K tokens

Reproduction scripts and exact commit hashes live in benchmarks/. All of them print CSV-formatted results to benchmarks/results/<timestamp>/ so you can diff them against ours.

# Everything the README claims, reproducible end-to-end
python benchmarks/run_all.py --profile full

# Individual suites
python benchmarks/perplexity/run.py --model Qwen/Qwen3.5-0.8B --bits 4
python benchmarks/ruler/run.py      --model Qwen/Qwen3.5-0.8B --bits 4 --ctx 32768,65536,131072
python benchmarks/vllm/run_serving.py --model Qwen/Qwen3.5-0.8B --bits 4 --workload sharegpt

Configuration

All runtime configuration is done through TQKV_-prefixed environment variables. The full surface is below, organised by category. Anything not listed here is internal and may change without notice.

Bit width and calibration

Variable Default Description
TQKV_BITS 4 Symmetric K/V bit width (2–8). Falls through to TQKV_K_BITS/TQKV_V_BITS when those are unset.
TQKV_K_BITS / TQKV_V_BITS inherits TQKV_BITS Asymmetric K/V override (e.g. K2/V4).
TQKV_LAYER_BITS "" Per-layer override string (e.g. 0:8,8;5:2,4). Layer index → (k_bits, v_bits).
TQKV_CALIBRATION_FILE "" Path to a calibration JSON bundle produced by python -m tqkv.auto_calibrate.
TQKV_ALLOCATION_FILE "" Path to a per-layer bit-allocation file produced by python -m tqkv.calibration.solve_bits.
TQKV_AUTO_CALIBRATE_MODEL "" Model path for plugin-side auto-calibration. Triggers calibration on first init when set.
TQKV_CALIBRATION_CACHE "" Directory used by auto-calibration to cache intermediate artefacts.
TQKV_PROFILE none Calibration profile from the bundle: lossless, balanced, aggressive.

Engine selection

Variable Default Description
TQKV_ENGINE "" (auto) Decode engine: native_tq (CUDA SIMT), flash_attn (decompress + FA), or bypass. Empty = auto-select.
TQKV_PREFILL_ENGINE fa4 Prefill path: fa4, triton, decompress, or adaptive. See "Prefill: three paths" above.
TQKV_PREFILL_BYPASS 1 First-chunk prefill bypass — skip codec on prompt-prefill, then re-rotate to TQ basis for decode.
TQKV_BYPASS_INLINE 0 Inline the bypass logic in the runtime hot path (perf experiment).
TQKV_FUSE_QROT "" (auto) Fused Q-rotation prologue: on / off / auto. Decode-only.
TQKV_O_PROJ_FOLD on Fold rotate_output into the o_proj weights. Default-on, universal across attention layer classes.
TQKV_MTP_SPLITK 1 Use split-K decode kernel for MTP layers.
TQKV_DECODE_SPLITS "" (autotune) Force decode-kernel split count. Empty = autotuned.
TQKV_MTP_AUTOTUNE_SPLITS 1 Run the autotune loop for MTP split-K.
TQKV_MTP_DECODE_SPLITS "" MTP decode split count override.
TQKV_KV_TILE_TOKENS "" KV tile size in tokens (autotuner override).
TQKV_REFERENCE_SEQ_LEN "" Reference sequence length used by autotune heuristics.

Backend behaviour

Variable Default Description
TQKV_NO_JIT 0 Fail if a kernel variant is not pre-compiled (strict-mode for production images).
TQKV_K_NC 1 Apply norm-correction to K reads in the dequant path.
TQKV_DISABLE_PRESCALE 0 Disable per-channel pre-scaling on compress upload (debug).
TQKV_DISABLE_HYBRID_REVIEW 0 Disable the hybrid-model (Mamba/GDN) cache-budget review pass.
TQKV_STRICT_NO_SDPA 0 When 1, raise RuntimeError instead of taking the D>256 SDPA Python-loop fallback (~1000× slower than FA4/cuda-v9). Recommended for Gemma 4 31B and other head_dim>256 deployments.
TQKV_SIMULATE_NO_FA4 0 Test fallback path by pretending FA4 isn't available.
TQKV_LOG_HOIST "" When set, log per-layer-builder hoist diagnostics (debug).

FA4 prefill scheduler (advanced)

Variable Default Description
TQKV_FA4_SPLIT_D auto Head-dim splitting in the FA4 split-D kernel. 1/0 to override the auto-detect.
TQKV_FA4_LPT auto Longest-processing-time scheduling: 0, 1, auto. Auto = on for causal-or-local.
TQKV_FA4_HEAD_SWIZZLE 1 Head-axis tile swizzle for L2-locality. Most impactful for Gemma H_kv=16.

MLA (DeepSeek V2/V3/V4)

Variable Default Description
TQKV_MLA_ENABLE 0 Master switch for the MLA backend. When 0, MLA models route to vLLM's stock backend.
TQKV_MLA_ROPE_HEAD_DIM 64 RoPE head dimension for MLA latent + RoPE split.

Cold tier (variant 2 — H2O-style attention-weighted pool)

Off by default. See docs/cold_tier_design.md for current shipping status. Master flag must be set for any other cold-tier env to take effect.

Variable Default Description
TQKV_COLD_ENABLE 0 Master switch. When 0, every other cold-tier env is ignored and the runtime is bit-identical to main.
TQKV_COLD_HOT_PATH 0 Hot-path opt-in. 1 installs the observer hook; 0 (default) installs state only.
TQKV_COLD_TIERED_ATTEND 0 γ_merge: full tiered attention with LSE merge. Currently a shipping blocker — leave 0.
TQKV_COLD_PROFILE "" One of balanced, aggressive, max_compress. Presets M, P, K_PCT, W_track, and per-layer bits.
TQKV_COLD_M 8192 Age threshold: tokens older than this enter the cold tier.
TQKV_COLD_P 16 (=block_size) Pool slice width. Page-aligned — each eviction frees one vLLM page.
TQKV_COLD_K_PCT 1.0 Percent of each pool slice retained individually as heavy hitters (ranked by pre-softmax Q·K).
TQKV_COLD_W_TRACK 64 Aging-window size (decode steps a token stays in the ring before eviction). Must be ≥ TQKV_COLD_P.
TQKV_COLD_BITS_K / TQKV_COLD_BITS_V hot-tier bits Scalar override for cold-tier K/V bit width.
TQKV_COLD_LAYER_BITS "" Per-layer cold-tier override, e.g. 5:8,8;10:4,4.
TQKV_COLD_NO_POOL_LAYERS "0 1" Space-separated layers that skip pooling entirely (L0/L1 are attention-sink layers).

Profiles (balanced / aggressive / max_compress): set bundles of the above. Expected compression at 1M ctx on Qwen3-4B: ~47× (balanced/aggressive) and ~100× (max_compress, unvalidated).

Status: flag-off-default. γ_nomerge path (archive + drain, no LSE merge) decodes coherently and saves ~5% peak_MiB at batch=4. γ_merge path is gated behind TQKV_COLD_TIERED_ATTEND=1 and is currently a shipping blocker pending the multi-seq capacity bug fix.

Why a vLLM fork (for now)

SGLang doesn't need a fork — its attention-backend plugin surface accepts custom pools via a call-time patch today, so the SGLang integration ships as a pure plugin.

vLLM is structurally different. CacheDType in vllm/config/cache.py is a Pydantic Literal that validates at class-definition time, which blocks runtime plugin registration of any new KV-cache dtype. Until that's relaxed upstream there's no plugin-only path, and we ship a fork.

The fork is a thin overlay on top of vllm/vllm-openai:v0.19.0. The full layout — every file we copy, what each patch does, and the v0.19.0 compat guards — lives in docker/PATCHES.md. At the time of writing, that's roughly 19 files copied across config/, v1/attention/, and v1/core/, plus one append-patch on utils/torch_utils.py. The shape of the patches falls into three groups:

  • Plugin-registration plumbing (about half the files): adding "tqkv" to CacheDType, the backend enum, the platform selector, the cache-spec dispatch, and the bytes-layout calculator. This is what would collapse into a pure plugin if CacheDType becomes extensible.
  • Hybrid-model KV bookkeeping (v1/core/kv_cache_*.py): keeps per-group page sizes for attention vs. Mamba/GDN layers so compressed attention pages don't get unified up to a Mamba-state page size. Required for LFM2 and Qwen3.5 MoE+GDN; transparent on dense models.
  • Capture-mode and OOM-fallback wiring: CUDAGraphMode.FULL_AND_PIECEWISE for chunked-prefill capture, and the determine-available-memory fallback for compressed KV when the profiler returns 0.

We don't currently have an upstream PR open for the CacheDType relaxation — the open question is whether vLLM is willing to accept a plugin-extensible KV-cache-dtype surface; until that conversation happens, the fork is the way. If you want to track or contribute, the canonical state is arbi-dev/vllm and docker/PATCHES.md.

Roadmap

  • vLLM upstream: explore CacheDType relaxation so tqkv can register as a pure plugin (no PR open yet)
  • FlashAttention upstream: explore upstreaming the FA4 inline-dequant variant (no PR open yet)
  • Validated model matrix: Llama-3, Mixtral, DeepSeek-V3, Command-R+, Gemma
  • TP > 2 validation
  • RULER and LongBench-v2 at 32K/64K/128K across all bit widths
  • SGLang upstream: explore a first-class pool-factory hook so register() can become a no-op (no PR open yet)
  • Hopper and Blackwell prefill optimization (TMA, larger smem budgets)

Citation

If turbo_attn helps your work, please cite both Google's TurboQuant paper and this implementation:

@misc{turbo_attn2026,
  title = {turbo\_attn: Production attention backend for TurboQuant KV cache compression},
  author = {Evseev, Dmitri},
  year = {2026},
  url = {https://github.com/arbi-dev/turbo_attn}
}

@inproceedings{zandieh2026turboquant,
  title = {TurboQuant: Near-optimal KV Cache Quantization for LLM Inference},
  author = {Zandieh, Amir and others},
  booktitle = {ICLR},
  year = {2026}
}

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_attn-0.1.0.tar.gz (532.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turbo_attn-0.1.0-py3-none-any.whl (590.7 kB view details)

Uploaded Python 3

File details

Details for the file turbo_attn-0.1.0.tar.gz.

File metadata

  • Download URL: turbo_attn-0.1.0.tar.gz
  • Upload date:
  • Size: 532.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turbo_attn-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5f89c623a2a0a237d03cc3ec90839e81e8598dbbeaf46ce7de03df9950071b69
MD5 e2dc90c41adce2a15184427321093e9e
BLAKE2b-256 35290e3d8af8215b53f43af6d6d0cc730c535dee9f8d5db91971a53f86f8e335

See more details on using hashes here.

File details

Details for the file turbo_attn-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: turbo_attn-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 590.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turbo_attn-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5746a66d8c7508747a8b5b3d7d597f9749a0402c6570ca64d423f2ffcc2c09f5
MD5 ed97b39549f046b490092099da481195
BLAKE2b-256 0462a1c3c00d91bbdea987922d8fa7814ca5112050358aff29fb4350b7bdf1fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page