Productionized TurboQuant KV cache compression for vLLM and SGLang (imported as `tqkv`)

These details have not been verified by PyPI

Project links

Project description

Turbo Attention

A modular attention backend for vLLM and SGLang. Custom CUDA kernels, full CUDAGraph capture, asymmetric K/V quantization, hybrid-model support. 3.8× more KV context on the same GPU, under 1% accuracy loss. Built on FlashAttention.

PyPI: turbo-attn · Import: tqkv · License: Apache-2.0

Install

pip install turbo-attn                  # codec + CUDA/Triton kernels
pip install "turbo-attn[vllm]"          # + vLLM attention backend
pip install "turbo-attn[all]"           # + SGLang, FlashInfer, flash-attn, eval harness

Quickstart

import torch
from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4)
keys = torch.randn(8, 128)

packed, norms = codec.compress_k(keys)        # 4096 B → 512 B + 16 B
recon = codec.decompress_k(packed, norms)

Serve a model end-to-end:

# vLLM
vllm serve Qwen/Qwen3.5-0.8B \
    --kv-cache-dtype tqkv --attention-backend custom \
    --max-model-len 250000 --trust-remote-code

# SGLang (after `import tqkv.integrations.sglang as t; t.register()`)
python -m sglang.launch_server --model-path Qwen/Qwen3.5-0.8B \
    --kv-cache-dtype tqkv --attention-backend tqkv --trust-remote-code

Full walkthroughs:

vLLM quickstart
SGLang quickstart
examples/ — runnable Python snippets
ARCHITECTURE.md — codebase tour for contributors
docs/public_api.md — supported API surface

Repo layout

The supported public surface lives at the top of each tree:

tqkv/ — the package (codec, kernels, runtime, vLLM/SGLang plugins, calibration pipeline).
docs/, docker/, scripts/, experiments/ — public docs, deploy recipes, helper scripts, research notes.
Anything under an internal/ subdirectory (docs/internal/, docker/internal/, scripts/internal/, experiments/internal/) is engineering-only and unsupported — it may move or break between releases. The wheel never ships these.

Why this exists

KV cache memory grows linearly with context length and dominates GPU memory beyond ~32K tokens. Google's TurboQuant (Zandieh et al., ICLR 2026) solved this in principle: a near-optimal KV compression scheme using Walsh–Hadamard rotation and Lloyd–Max codebooks, with provable distortion bounds. Multiple open-source references have appeared since the paper — but none ship production CUDA kernels, full CUDAGraph capture, or validation on the new generation of hybrid state-space / attention architectures.

turbo_attn is the implementation that actually ships in production. Custom CUDA throughout. Full FULL_AND_PIECEWISE CUDAGraph capture on prefill and decode. Asymmetric K/V bit widths across all nine {2,4,8}² combinations. Per-group block pools so attention-plus-Mamba models don't waste the memory they save. Drop-in backends for both vLLM and SGLang.

Headline results

Qwen3.5-27B-AWQ on 2× RTX 4090 (TP=2, TQ4 KV, MTP=3):[^27b]

Metric	BF16 KV	turbo_attn (TQ4)
KV capacity	370K tokens	1,360K tokens (3.7×)
Throughput (1 user)	131 tok/s	117 tok/s (89%)
Throughput (8 users)	—	355 tok/s
Throughput (128 users)	—	1,393 tok/s
TTFT	58 ms	78 ms
Needle @ 99K	—	FOUND

[^27b]: Measured on an internal AWQ-INT4 build of Qwen3.5-27B; the artifact itself is not on the public HF Hub. The methodology, flags, and harness are public — numbers reproduce on any AWQ-quantized 27B-class model that fits the same TP=2 / 2× RTX 4090 layout.

Qwen3.5-0.8B at 250K context (single RTX 4090):

Config	Prefill tok/s	Decode tok/s	Needle @ 250K	Compression
BF16 KV	17,908	336	OOM at 64K	1.0×
TQ4	17,347	354	FOUND	3.8×
TQ3	17,499	296	FOUND	4.9×
TQ2	17,614	314	MISS (found @ 100K)	7.1×

TQ4 decode is 5% faster than BF16 at long context — compressed KV reads less memory, and the dequant overhead is smaller than the bandwidth savings.

Perplexity (Wikitext-2, Qwen3.5-0.8B): Measured via vLLM prompt_logprobs on the production prefill kernel. Additional models and context lengths in progress.

Config	PPL	Δ vs BF16	(Δ%)
BF16	24.61	0.000	(0.00%)
TQ8 (K8V8)	24.59	−0.02	(−0.08%)
TQ4 (K4V4)	24.87	+0.27	(+1.08%)
TQ3	25.67	+1.07	(+4.33%)
TQ2 (K2V2)	27.87	+3.27	(+13.28%)
K4V8	24.62	+0.01	(+0.06%)
K8V4	24.81	+0.20	(+0.81%)
K2V4	25.50	+0.90	(+3.65%)
K6V3	25.52	+0.91	(+3.71%)
K8V2	27.42	+2.81	(+11.43%)
K4V2	27.40	+2.79	(+11.35%)

_{Measured via vLLM prompt_logprobs on the production turbo_attn prefill kernel (not the experimental HF reference path). 8 chunks × 512 wikitext-2 tokens, subprocess-isolated. SEM ≈ ±2 PPL across chunks.}

Reproduce the headline numbers

Every number in this README is reproducible from benchmarks/run_all.py:

pip install turbo-attn
python benchmarks/run_all.py --profile smoke   # ~2 minutes, sanity check
python benchmarks/run_all.py --profile quick   # ~8 minutes, headline numbers

Output lands in benchmarks/results/<timestamp>/ with summary.md (human-readable table), summary.json and per-suite CSVs (perplexity.csv, throughput.csv, memory.csv), per-subprocess logs under logs/, and a system.json stamp recording GPU, driver, CUDA, package versions, and git SHA.

Reference numbers we publish against are committed under benchmarks/baselines/qwen3.5-0.8b/ and touchstone/baselines/. To stand up a local server for ad-hoc prompts, see examples/02_vllm_server.py.

Hardware tested

GPU class	SM	Status
RTX 4090 / L40 / Ada	89	tested in CI and on the headline benches
A100 / Ampere	80, 86	builds; not yet validated end-to-end
H100 / Hopper	90	builds; FP8 attention not yet validated
B200 / Blackwell	100	not yet ported

What turbo_attn is — and isn't

turbo_attn is an attention backend in the vLLM/SGLang sense: it owns the KV cache layout and the kernels that read it. Everything outside attention is untouched and orthogonal.

Layers: who owns what

Layer	Owner	turbo_attn's role
Tokenizer, sampler, guided decoding, LoRA	vLLM	none
Request scheduler, batching, chunked prefill, prefix caching	vLLM	we consume what it sends
Paged block allocator	vLLM	we register our per-token byte budget
KV cache on-wire format	turbo_attn	rotate + quantize + bitpack into paged blocks
Attention compute (decode)	turbo_attn	fused CUDA kernel, end-to-end
Attention compute (prefill)	turbo_attn	one of three paths; see below
CUDAGraph capture orchestration	vLLM	we declare capturable modes; vLLM captures
Hybrid-model block dispatch	vLLM + our per-group `BlockPool`	we ensure compressed pages don't inflate
Weight loading (BF16 / AWQ / GPTQ / FP8)	vLLM + quant libs	orthogonal; composes cleanly
TP/PP collectives	vLLM	orthogonal; we run per-rank

Attention compute by run mode

Run mode	KV storage	KV load + dequant	Q·K	softmax	P·V	output
Decode (fused CUDA)	ours (compressed)	ours	ours	ours (online)	ours	ours
Decode split-K (batch=1)	ours	ours	ours	ours (2-phase)	ours	ours
MTP verification (BLOCK_M 2–8)	ours	ours	ours	ours	ours	ours
Prefill — FA4 inline-dequant	ours	ours (subclass override)	FA4	FA4	FA4	FA4
Prefill — CUDA C++ (v9)	ours	ours	ours (`mma.sync`)	ours	ours (`mma.sync`)	ours
Prefill — decompress + stock FA	ours → BF16 scratch	ours (Triton)	stock FA	stock FA	stock FA	stock FA

Stock FlashAttention is never called on compressed bytes. When it runs, it runs unmodified on a decompressed scratch buffer.

What's novel

1. Our own warp-fused compress kernel. Walsh–Hadamard rotation via warp butterfly (__shfl_xor_sync, five shuffle stages), quantization, and bit-packing — all in a single CUDA launch. Measured 5× faster than a Triton baseline (~7 μs vs ~35 μs per layer at batch 1 on RTX 4090), contributing ~10% end-to-end decode speedup.

2. Unified fused decode kernel, templated on BLOCK_M. Single source handles ordinary decode (BLOCK_M=1), speculative decoding verification (BLOCK_M=2..8), and asymmetric K/V bit widths (all nine {2,4,8}² combinations). NCU-profiled: 40 registers, zero spills, at the per-SM performance ceiling. CUDAGraph-safe. MTP verification is a native fused path, not a fallback.

3. FA4 with inline dequantization (novel). A CuTeDSL subclass of FlashAttention-4 that overrides the load_K and load_V stages to dequantize compressed bytes directly into register tiles during the MMA pipeline. No decompress buffer. Shared-memory pressure actually decreases because the staging region for packed bytes is ≤1/8 the size of the BF16 tile it replaces. Generalized over all nine {2,4,8}² K/V configurations. Intended as an upstream contribution to FlashAttention.

4. Production CUDA C++ prefill kernel. XOR shared-memory swizzle (col ^ ((row & 7) << 3) — 1.75× improvement), staged cp.async.cg loads, ldmatrix.sync.aligned.m8n8.x4 fragment loads, mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 tensor cores, register-held query tiles, warp-shuffle softmax reductions, pack_gqa query layout. Templated on K/V bit widths and head dimensions. 128–200 registers, zero spills, 2 CTAs/SM.

5. Full FULL_AND_PIECEWISE CUDAGraph capture. Decode captured as a full graph, prefill captured piecewise per layer. Both the FA4 subclass and the CUDA C++ prefill kernel are allocation-free and capture-safe. This is the bleeding-edge vLLM CUDAGraph mode; most backends don't use it yet.

6. Per-group BlockPool for hybrid models. Attention-plus-Mamba/GatedDeltaNet/ShortConv models suffer from block-pool inflation in vLLM's default allocator, because attention pages get padded to the Mamba state page size. With TQ4 compression this waste becomes catastrophic (effective capacity drops to 2–3%). The fork carries a per-layer-group block pool that lets each layer type use its natural page size. Model-agnostic; benefits any hybrid architecture.

7. Asymmetric K/V bit widths. All nine combinations of {2,4,8}² in both decode and prefill, configured independently via TQKV_K_BITS and TQKV_V_BITS.

The optimal asymmetric K/V configuration is model-specific. On Qwen3.5-0.8B (hybrid attention + GatedDeltaNet), value precision matters more — K4V8 is essentially lossless (+0.06% PPL) while K8V4 is +0.81%. On other architectures with smaller head dimensions and more attention layers (e.g. Qwen3-0.6B, head_dim=64, 28 attention layers), key precision dominates — K8V4 is essentially lossless there while symmetric TQ4 diverges. The direction depends on which side of the model has wider activation distributions, not a universal rule. We recommend measuring per architecture with the included kv_norm_profile.py diagnostic and the tier2_wikitext_quick.py sweep harness.

8. Sliding-window attention support. Per-layer SlidingWindowSpec propagates window size into the decode kernel mask. Works on Gemma, Mistral variants, and any model that declares sliding-window layers.

9. Native MTP / speculative decoding. BLOCK_M>1 verification uses the same unified decode kernel, amortizing the KV read across all verified tokens. No decompress-and-fallback path. MTP=3 reaches ~96% of BF16 speed on Qwen3.5-27B at TQ4 with 3.7× the KV capacity.

10. Orthogonal to weight quantization. Composes cleanly with AWQ, GPTQ, FP8, and bitsandbytes. Validated on Qwen3.5-27B-AWQ-4bit at 170K context, TP=2.

Supported models

Architecture	Type	Status	Notes
Qwen3.5 (0.8B, 27B)	Hybrid (full-attn + GatedDeltaNet)	✅ Primary	27B via AWQ-4bit, TP=2, 1.36M context validated
Qwen3.6-MoE	Hybrid MoE (full-attn + GatedDeltaNet + experts)	✅ Validated	TP=2, 2.94× KV capacity (888k vs 302k BF16); throughput parity warm
LFM2-8B-A1B	Hybrid MoE (full-attn + ShortConv + 32 experts)	✅ Validated	Needle 11/12 (matches BF16), PPL within 4%
Gemma, Mistral (sliding window)	Dense w/ SWA	✅ Backend supports	Spot-tested
Llama-3, Mixtral, DeepSeek-V3, Command-R+	Dense / MoE	🗺️ Roadmap	Expected to work; not yet explicitly validated
Pure Mamba / SSM	State-space	N/A	Mamba has no KV to compress

turbo_attn is architecture-neutral: any model vLLM can serve with a paged KV cache can use it. The tested list reflects what's been benchmarked, not what works.

How TurboQuant works

Rotate each KV vector with a fast Walsh–Hadamard transform. Raw KV values have uneven distributions; after rotation every coordinate follows the same Gaussian.
Normalize — store the vector's magnitude as a single BF16 value.
Quantize each coordinate to a shared codebook (4 values for TQ2, 16 for TQ4, up to 256 for TQ8). Because the rotated distribution is uniform across coordinates, a single shared codebook achieves near-optimal distortion.

At TQ4, head_dim=128: 64 packed bytes + 4 norm bytes = 68 B vs 256 B at BF16 = 3.8× compression. The same arithmetic extends to all supported widths: 2-bit → 7.1×, 3-bit → 4.9×, 8-bit → 1.9× (lossless).

Attention scores computed on rotated KV are bit-identical to attention on unrotated KV, provided the query is rotated by the same matrix before the dot product. We pre-rotate the query once per request and compute everything in the rotated space, never decompressing into the original coordinate system.

Architecture

Decode path

One fused CUDA kernel (turbo_attn/kernels/_cuda_decode_unified.cu) handles all decode math: unpacks nibbles from compressed KV pages, looks up centroids from a shared-memory codebook, multiplies by per-token norms, computes Q·K and P·V with online softmax, and writes outputs — in a single pass. No decompress buffer, no external kernel calls. Templated on HEAD_DIM, GQA_RATIO, BLOCK_M, TQ_K_BIT_WIDTH, and TQ_V_BIT_WIDTH; the compiler generates one variant per tuple. Split-K variant available for batch=1 workloads where grid utilization is the bottleneck.

Prefill: three paths, `FULL_AND_PIECEWISE` capture

All three prefill paths run at approximately BF16-FlashAttention speed on tested workloads.

Path A — FA4 inline-dequant (default). CuTeDSL subclass of FlashAttention-4 that overrides K/V loads to dequantize directly into register tiles. No decompress buffer. CUDAGraph-captured via piecewise mode.
Path B — CUDA C++ prod kernel (TQKV_PREFILL_ENGINE=cuda). Hand-written kernel with XOR swizzle, cp.async, ldmatrix, and mma.sync. Also piecewise-captured.
Path C — Decompress + stock FA (=decompress). Triton decompress into a small scratch buffer, then stock FlashAttention. Fallback for maximal compatibility and for very long chunks where fused dequant overhead compounds. Not CUDAGraph-captured.

An adaptive dispatcher (=adaptive) picks Path A for short chunks and Path C for chunks above a hand-tuned crossover. Adaptive is an explicit opt-in — it is never auto-selected, because silent switches between engines make perf numbers incoherent and can hide fa4 regressions in benchmarks. Use fa4 (the default) for stable measurement.

CUDAGraph mode

The backend requests CUDAGraphMode.FULL_AND_PIECEWISE when the prefill engine is fa4 (the default) or adaptive (explicit opt-in). Decode runs as a full captured graph; prefill runs as per-layer piecewise-compiled graphs. The FA4 subclass and the CUDA C++ prefill kernel are both allocation-free and fully compatible with this mode. Any other prefill engine (decompress, triton, cuda) falls back to FULL_DECODE_ONLY.

Compress kernel

Our warp-fused CUDA compress kernel runs on every KV write. One warp per (token, head) pair performs the Walsh–Hadamard rotation via five __shfl_xor_sync butterfly stages, reduces the norm via __shfl_down_sync, scalar-quantizes against the codebook, bit-packs indices into bytes, and scatters directly to the block offset — all in a single kernel. ~7 μs per layer at batch 1, vs ~35 μs for a Triton baseline.

Install and use

Full end-to-end walkthroughs for each engine:

vLLM quickstart — install the vLLM fork, serve a model, query it, validate compression.
SGLang quickstart — same flow via SGLang's attention-backend plugin.

vLLM

For vLLM serving, install the vllm fork (turbo-attn branch). The fork is a thin overlay on vllm/vllm-openai:v0.19.0 that wires the "tqkv" kv_cache_dtype through vLLM's config system and adds per-group block-pool bookkeeping for hybrid models. The actual backend (~2000 lines) lives in this package, not the fork; the full per-file breakdown is in docker/PATCHES.md. The "Why a vLLM fork (for now)" section below explains why this fork is required today.

Two flags are required for vLLM serving with TQKV:

--kv-cache-dtype tqkv — selects the compressed KV layout
--attention-backend custom — routes attention through the registered TQKV backend

# 4-bit KV on Qwen3.5-0.8B, single GPU, up to 250K context
vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom

# 27B on 2 GPUs, TP=2, TQ4, with MTP-3 speculative decoding
TQKV_BITS=4 vllm serve /path/to/Qwen3.5-27B-AWQ-4bit \
  --kv-cache-dtype tqkv \
  --attention-backend custom \
  --tensor-parallel-size 2 \
  --language-model-only \
  --max-model-len 170000 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.92 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

# Asymmetric K/V: 2-bit keys, 4-bit values (optimal direction is model-specific — measure with tier2_wikitext_quick.py)
TQKV_K_BITS=2 TQKV_V_BITS=4 vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom

Recommended flags for production performance

TQKV's headline numbers (capacity, TTFT, throughput) assume a few non-default scheduler and compilation settings. The plugin does not set these for you — pass them explicitly so what you configure is what you get:

Flag	Recommended value	Why
`--max-num-batched-tokens`	`16384`	TQKV's prefill bypass is chunk-size-agnostic; lifting MNBT lets long prompts land in fewer kernel launches → markedly better TTFT at ≥16k prompts
`--max-num-seqs`	`<N>` (auto-size)	Without this, the scheduler caps concurrent requests at the bf16 cap and hides TQKV's 3–4× capacity gain. Set so that `N × max_model_len` ≈ your KV budget
`--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}'`	as shown	TQKV's FA4 prefill kernel is CUDAGraph-safe; capturing it removes Python launch overhead on prefill

vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 256 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}'

Sizing --max-num-seqs: a fast estimate is N ≈ (kv_budget_bytes) / (per_token_bytes × max_model_len) where per_token_bytes for K4V4 is roughly num_layers × num_kv_heads × head_size × 2 ÷ 4. Bench at increasing values and watch for OOM at request peak.

MLA models (DeepSeek V2/V3)

MLA layers need a separate wrapper backend that compresses the shared-KV slot. Set TQKV_MLA_ENABLE=1 to enable; without it, MLA models will fail at load with "no MLA backend supports tqkv":

TQKV_MLA_ENABLE=1 vllm serve deepseek-ai/DeepSeek-V2-Lite-Chat \
  --kv-cache-dtype tqkv \
  --attention-backend custom

SGLang

SGLang integration is plugin-based — no fork required. tqkv.integrations.sglang.register() installs a pool-factory and wires the tqkv attention backend into SGLang's registry. Call it once before SGLang reads its attention-backend registry:

# In your launch script, before SGLang imports its backend registry:
import tqkv.integrations.sglang as tqkv_sglang
tqkv_sglang.register()

# Then launch SGLang as usual with:
#   --kv-cache-dtype tqkv --attention-backend tqkv

TQKV_BITS=4 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend tqkv

Standalone codec (no serving engine)

from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")
k_packed, k_norms = codec.compress_k(key_vectors)
k_recon = codec.decompress_k(k_packed, k_norms)

# Pre-rotation trick (fused attention without decompress buffer)
q_rotated = codec.rotate_query(query, scale=1/math.sqrt(head_dim))
output = codec.unrotate_output(raw_output)

HuggingFace Transformers (reference path)

from tqkv.hf_cache import TQKVCache
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B").cuda()
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B")
cache = TQKVCache(bit_width=4)
cache.init_from_model(model, tok)

out = model.generate(
    **tok("Explain entropy in one sentence.", return_tensors="pt").to("cuda"),
    past_key_values=cache, max_new_tokens=64,
)

Note: the HF path is a reference implementation used by the perplexity harness. For production serving, use vLLM or SGLang.

Benchmarking

Accuracy and throughput claims that require running on raw weights cannot be tested through our demo API — you have to install the backend and run benchmarks locally. We ship a reproducible harness and publish our own numbers, and we strongly encourage independent verification.

Accuracy / quality (run by us, scripts in benchmarks/):

Perplexity on Wikitext-2 and PG-19 (long-context PPL)
RULER — NVIDIA's long-context benchmark, 13 tasks (multi-key NIAH, variable tracking, aggregation, QA) at 4K–128K+. The long-context standard.
LongBench-v2 — realistic long-context QA and reasoning
MMLU 5-shot, GSM8K, TruthfulQA, HellaSwag via lm-eval-harness — short-context retention
Needle-in-a-haystack at 100K and 250K — quick sanity check (not a primary metric)

Throughput / latency:

vLLM benchmark_serving.py with ShareGPT workload at batch {1, 8, 32, 128}
TTFT / TPOT distributions under Poisson arrivals at QPS {1, 4, 16}
Max context at fixed GPU memory — GB per 1K tokens

Reproduction scripts and exact commit hashes live in benchmarks/. All of them print CSV-formatted results to benchmarks/results/<timestamp>/ so you can diff them against ours.

# Everything the README claims, reproducible end-to-end
python benchmarks/run_all.py --profile full

# Individual suites
python benchmarks/perplexity/run.py --model Qwen/Qwen3.5-0.8B --bits 4
python benchmarks/ruler/run.py      --model Qwen/Qwen3.5-0.8B --bits 4 --ctx 32768,65536,131072
python benchmarks/vllm/run_serving.py --model Qwen/Qwen3.5-0.8B --bits 4 --workload sharegpt

Configuration

All runtime configuration is done through TQKV_-prefixed environment variables. The full surface is below, organised by category. Anything not listed here is internal and may change without notice.

Bit width and calibration

Variable	Default	Description
`TQKV_BITS`	`4`	Symmetric K/V bit width (2–8). Falls through to `TQKV_K_BITS`/`TQKV_V_BITS` when those are unset.
`TQKV_K_BITS` / `TQKV_V_BITS`	inherits `TQKV_BITS`	Asymmetric K/V override (e.g. `K2/V4`).
`TQKV_LAYER_BITS`	`""`	Per-layer override string (e.g. `0:8,8;5:2,4`). Layer index → `(k_bits, v_bits)`.
`TQKV_CALIBRATION_FILE`	`""`	Path to a calibration JSON bundle produced by `python -m tqkv.auto_calibrate`.
`TQKV_ALLOCATION_FILE`	`""`	Path to a per-layer bit-allocation file produced by `python -m tqkv.calibration.solve_bits`.
`TQKV_AUTO_CALIBRATE_MODEL`	`""`	Model path for plugin-side auto-calibration. Triggers calibration on first init when set.
`TQKV_CALIBRATION_CACHE`	`""`	Directory used by auto-calibration to cache intermediate artefacts.
`TQKV_PROFILE`	`none`	Calibration profile from the bundle: `lossless`, `balanced`, `aggressive`.

Engine selection

Variable	Default	Description
`TQKV_ENGINE`	`""` (auto)	Decode engine: `native_tq` (CUDA SIMT), `flash_attn` (decompress + FA), or `bypass`. Empty = auto-select.
`TQKV_PREFILL_ENGINE`	`fa4`	Prefill path: `fa4`, `triton`, `decompress`, or `adaptive`. See "Prefill: three paths" above.
`TQKV_PREFILL_BYPASS`	`1`	First-chunk prefill bypass — skip codec on prompt-prefill, then re-rotate to TQ basis for decode.
`TQKV_BYPASS_INLINE`	`0`	Inline the bypass logic in the runtime hot path (perf experiment).
`TQKV_FUSE_QROT`	`""` (auto)	Fused Q-rotation prologue: `on` / `off` / `auto`. Decode-only.
`TQKV_O_PROJ_FOLD`	`on`	Fold `rotate_output` into the `o_proj` weights. Default-on, universal across attention layer classes.
`TQKV_MTP_SPLITK`	`1`	Use split-K decode kernel for MTP layers.
`TQKV_DECODE_SPLITS`	`""` (autotune)	Force decode-kernel split count. Empty = autotuned.
`TQKV_MTP_AUTOTUNE_SPLITS`	`1`	Run the autotune loop for MTP split-K.
`TQKV_MTP_DECODE_SPLITS`	`""`	MTP decode split count override.
`TQKV_KV_TILE_TOKENS`	`""`	KV tile size in tokens (autotuner override).
`TQKV_REFERENCE_SEQ_LEN`	`""`	Reference sequence length used by autotune heuristics.

Backend behaviour

Variable	Default	Description
`TQKV_NO_JIT`	`0`	Fail if a kernel variant is not pre-compiled (strict-mode for production images).
`TQKV_K_NC`	`1`	Apply norm-correction to K reads in the dequant path.
`TQKV_DISABLE_PRESCALE`	`0`	Disable per-channel pre-scaling on compress upload (debug).
`TQKV_DISABLE_HYBRID_REVIEW`	`0`	Disable the hybrid-model (Mamba/GDN) cache-budget review pass.
`TQKV_STRICT_NO_SDPA`	`0`	When `1`, raise `RuntimeError` instead of taking the `D>256` SDPA Python-loop fallback (~1000× slower than FA4/cuda-v9). Recommended for Gemma 4 31B and other `head_dim>256` deployments.
`TQKV_SIMULATE_NO_FA4`	`0`	Test fallback path by pretending FA4 isn't available.
`TQKV_LOG_HOIST`	`""`	When set, log per-layer-builder hoist diagnostics (debug).

FA4 prefill scheduler (advanced)

Variable	Default	Description
`TQKV_FA4_SPLIT_D`	auto	Head-dim splitting in the FA4 split-D kernel. `1`/`0` to override the auto-detect.
`TQKV_FA4_LPT`	`auto`	Longest-processing-time scheduling: `0`, `1`, `auto`. Auto = on for causal-or-local.
`TQKV_FA4_HEAD_SWIZZLE`	`1`	Head-axis tile swizzle for L2-locality. Most impactful for Gemma `H_kv=16`.

MLA (DeepSeek V2/V3/V4)

Variable	Default	Description
`TQKV_MLA_ENABLE`	`0`	Master switch for the MLA backend. When `0`, MLA models route to vLLM's stock backend.
`TQKV_MLA_ROPE_HEAD_DIM`	`64`	RoPE head dimension for MLA latent + RoPE split.

Cold tier (variant 2 — H2O-style attention-weighted pool)

Off by default. See docs/cold_tier_design.md for current shipping status. Master flag must be set for any other cold-tier env to take effect.

Variable	Default	Description
`TQKV_COLD_ENABLE`	`0`	Master switch. When `0`, every other cold-tier env is ignored and the runtime is bit-identical to main.
`TQKV_COLD_HOT_PATH`	`0`	Hot-path opt-in. `1` installs the observer hook; `0` (default) installs state only.
`TQKV_COLD_TIERED_ATTEND`	`0`	γ_merge: full tiered attention with LSE merge. Currently a shipping blocker — leave `0`.
`TQKV_COLD_PROFILE`	`""`	One of `balanced`, `aggressive`, `max_compress`. Presets `M`, `P`, `K_PCT`, `W_track`, and per-layer bits.
`TQKV_COLD_M`	`8192`	Age threshold: tokens older than this enter the cold tier.
`TQKV_COLD_P`	`16` (=block_size)	Pool slice width. Page-aligned — each eviction frees one vLLM page.
`TQKV_COLD_K_PCT`	`1.0`	Percent of each pool slice retained individually as heavy hitters (ranked by pre-softmax Q·K).
`TQKV_COLD_W_TRACK`	`64`	Aging-window size (decode steps a token stays in the ring before eviction). Must be ≥ `TQKV_COLD_P`.
`TQKV_COLD_BITS_K` / `TQKV_COLD_BITS_V`	hot-tier bits	Scalar override for cold-tier K/V bit width.
`TQKV_COLD_LAYER_BITS`	`""`	Per-layer cold-tier override, e.g. `5:8,8;10:4,4`.
`TQKV_COLD_NO_POOL_LAYERS`	`"0 1"`	Space-separated layers that skip pooling entirely (L0/L1 are attention-sink layers).

Profiles (balanced / aggressive / max_compress): set bundles of the above. Expected compression at 1M ctx on Qwen3-4B: ~47× (balanced/aggressive) and ~100× (max_compress, unvalidated).

Status: flag-off-default. γ_nomerge path (archive + drain, no LSE merge) decodes coherently and saves ~5% peak_MiB at batch=4. γ_merge path is gated behind TQKV_COLD_TIERED_ATTEND=1 and is currently a shipping blocker pending the multi-seq capacity bug fix.

Why a vLLM fork (for now)

SGLang doesn't need a fork — its attention-backend plugin surface accepts custom pools via a call-time patch today, so the SGLang integration ships as a pure plugin.

vLLM is structurally different. CacheDType in vllm/config/cache.py is a Pydantic Literal that validates at class-definition time, which blocks runtime plugin registration of any new KV-cache dtype. Until that's relaxed upstream there's no plugin-only path, and we ship a fork.

The fork is a thin overlay on top of vllm/vllm-openai:v0.19.0. The full layout — every file we copy, what each patch does, and the v0.19.0 compat guards — lives in docker/PATCHES.md. At the time of writing, that's roughly 19 files copied across config/, v1/attention/, and v1/core/, plus one append-patch on utils/torch_utils.py. The shape of the patches falls into three groups:

Plugin-registration plumbing (about half the files): adding "tqkv" to CacheDType, the backend enum, the platform selector, the cache-spec dispatch, and the bytes-layout calculator. This is what would collapse into a pure plugin if CacheDType becomes extensible.
Hybrid-model KV bookkeeping (v1/core/kv_cache_*.py): keeps per-group page sizes for attention vs. Mamba/GDN layers so compressed attention pages don't get unified up to a Mamba-state page size. Required for LFM2 and Qwen3.5 MoE+GDN; transparent on dense models.
Capture-mode and OOM-fallback wiring: CUDAGraphMode.FULL_AND_PIECEWISE for chunked-prefill capture, and the determine-available-memory fallback for compressed KV when the profiler returns 0.

We don't currently have an upstream PR open for the CacheDType relaxation — the open question is whether vLLM is willing to accept a plugin-extensible KV-cache-dtype surface; until that conversation happens, the fork is the way. If you want to track or contribute, the canonical state is arbi-dev/vllm and docker/PATCHES.md.

Roadmap

vLLM upstream: explore CacheDType relaxation so tqkv can register as a pure plugin (no PR open yet)
FlashAttention upstream: explore upstreaming the FA4 inline-dequant variant (no PR open yet)
Validated model matrix: Llama-3, Mixtral, DeepSeek-V3, Command-R+, Gemma
TP > 2 validation
RULER and LongBench-v2 at 32K/64K/128K across all bit widths
SGLang upstream: explore a first-class pool-factory hook so register() can become a no-op (no PR open yet)
Hopper and Blackwell prefill optimization (TMA, larger smem budgets)

Citation

If turbo_attn helps your work, please cite both Google's TurboQuant paper and this implementation:

@misc{turbo_attn2026,
  title = {turbo\_attn: Production attention backend for TurboQuant KV cache compression},
  author = {Evseev, Dmitri},
  year = {2026},
  url = {https://github.com/arbi-dev/turbo_attn}
}

@inproceedings{zandieh2026turboquant,
  title = {TurboQuant: Near-optimal KV Cache Quantization for LLM Inference},
  author = {Zandieh, Amir and others},
  booktitle = {ICLR},
  year = {2026}
}

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.4

Jun 13, 2026

0.6.3

Jun 13, 2026

0.6.2

Jun 12, 2026

0.6.1

Jun 12, 2026

0.6.0

Jun 12, 2026

0.5.1

Jun 12, 2026

0.5.0

Jun 11, 2026

0.4.1

Jun 11, 2026

0.4.0

Jun 10, 2026

0.3.2

Jun 10, 2026

0.3.1

Jun 10, 2026

0.3.0

Jun 10, 2026

0.2.0

Jun 2, 2026

0.1.2

Apr 30, 2026

0.1.1

Apr 30, 2026

This version

0.1.0 yanked

Apr 30, 2026

Reason this release was yanked:

wrong licene info

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_attn-0.1.0.tar.gz (532.8 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turbo_attn-0.1.0-py3-none-any.whl (590.7 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file turbo_attn-0.1.0.tar.gz.

File metadata

Download URL: turbo_attn-0.1.0.tar.gz
Upload date: Apr 30, 2026
Size: 532.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turbo_attn-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5f89c623a2a0a237d03cc3ec90839e81e8598dbbeaf46ce7de03df9950071b69`
MD5	`e2dc90c41adce2a15184427321093e9e`
BLAKE2b-256	`35290e3d8af8215b53f43af6d6d0cc730c535dee9f8d5db91971a53f86f8e335`

See more details on using hashes here.

File details

Details for the file turbo_attn-0.1.0-py3-none-any.whl.

File metadata

Download URL: turbo_attn-0.1.0-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 590.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turbo_attn-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5746a66d8c7508747a8b5b3d7d597f9749a0402c6570ca64d423f2ffcc2c09f5`
MD5	`ed97b39549f046b490092099da481195`
BLAKE2b-256	`0462a1c3c00d91bbdea987922d8fa7814ca5112050358aff29fb4350b7bdf1fd`

See more details on using hashes here.

turbo-attn 0.1.0

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

Turbo Attention

Install

Quickstart

Repo layout

Why this exists

Headline results

Reproduce the headline numbers

Hardware tested

What turbo_attn is — and isn't

Layers: who owns what

Attention compute by run mode

What's novel

Supported models

How TurboQuant works

Architecture

Decode path

Prefill: three paths, FULL_AND_PIECEWISE capture

CUDAGraph mode

Compress kernel

Install and use

vLLM

Recommended flags for production performance

MLA models (DeepSeek V2/V3)

SGLang

Standalone codec (no serving engine)

HuggingFace Transformers (reference path)

Benchmarking

Configuration

Bit width and calibration

Engine selection

Backend behaviour

FA4 prefill scheduler (advanced)

MLA (DeepSeek V2/V3/V4)

Cold tier (variant 2 — H2O-style attention-weighted pool)

Why a vLLM fork (for now)

Roadmap

Citation

License

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Prefill: three paths, `FULL_AND_PIECEWISE` capture