Skip to main content

Optimized CUDAgraph-enabled kernels and attention backend for vLLM, SGLang and more based on TurboQuant near-lossless KV cache compression. SOTA performance with Gemma 4, Qwen 3.6 and other modern LLMs.

Project description

Turbo Attention

CI PyPI License

A modular attention backend for vLLM, SGLang, and HuggingFace Transformers. Custom CUDA + Triton kernels with full CUDAGraph capture, asymmetric K/V quantization, hybrid-model support. Built on FlashAttention; based on TurboQuant near-lossless KV cache compression.

PyPI: turbo-attn · Import: tqkv · License: MPL-2.0

Install

pip install turbo-attn                  # codec + CUDA/Triton kernels
pip install "turbo-attn[vllm]"          # + vLLM attention backend
pip install "turbo-attn[all]"           # + SGLang, FlashInfer, flash-attn, eval harness

Quickstart

import torch
from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")
keys = torch.randn(8, 128, device="cuda")

packed, norms = codec.compress_k(keys)
recon = codec.decompress_k(packed, norms)

See examples/ for runnable snippets and ARCHITECTURE.md for a codebase tour.

Repo layout

  • tqkv/ — the package (codec, kernels, runtime, vLLM/SGLang plugins, calibration pipeline).
  • docs/, docker/, scripts/, examples/, experiments/ — public docs, deploy recipes, helper scripts, runnable examples, research notes.
  • Anything under an internal/ subdirectory is engineering-only and unsupported. The wheel never ships these.

vLLM

vLLM serving currently requires the vllm fork (turbo-attn branch) — it wires "tqkv" through vLLM's CacheDType Literal and adds per-group block-pool bookkeeping for hybrid models. Patch layout in docker/PATCHES.md.

vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom

For MLA models (DeepSeek V2/V3) set TQKV_MLA_ENABLE=1.

SGLang

No fork required — tqkv.integrations.sglang.register() installs a pool factory and wires the tqkv attention backend.

import tqkv.integrations.sglang as tqkv_sglang
tqkv_sglang.register()
python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend tqkv

How it works

  1. Rotate each KV vector with a fast Walsh–Hadamard transform.
  2. Normalize — store the magnitude as a single BF16 value.
  3. Quantize each rotated coordinate to a shared codebook.

Attention scores on rotated KV are bit-identical to attention on unrotated KV when the query is rotated by the same matrix; we pre-rotate Q once per request and compute everything in the rotated space.

The decode path is one fused CUDA kernel that unpacks, dequantizes, and runs Q·K, online softmax, and P·V in a single pass — no decompress buffer. Prefill has three paths: an FA4 CuTeDSL subclass that dequantizes inline during the MMA pipeline (default), a hand-written CUDA C++ kernel, and a decompress + stock FlashAttention fallback.

Configuration

All runtime configuration is via TQKV_-prefixed environment variables. The supported surface is below; anything unlisted is internal and may change.

Bit width and calibration

Variable Default Description
TQKV_BITS 4 Symmetric K/V bit width (2–8). Falls through to TQKV_K_BITS/TQKV_V_BITS when those are unset.
TQKV_K_BITS / TQKV_V_BITS inherits TQKV_BITS Asymmetric K/V override.
TQKV_LAYER_BITS "" Per-layer override string (e.g. 0:8,8;5:2,4).
TQKV_CALIBRATION_FILE "" Path to a calibration JSON bundle from python -m tqkv.calibration.calibrate_model.
TQKV_ALLOCATION_FILE "" Path to a per-layer bit-allocation file from python -m tqkv.calibration.solve_bits.
TQKV_AUTO_CALIBRATE_MODEL "" Model path for plugin-side auto-calibration on first init.
TQKV_PROFILE none Calibration profile from the bundle: lossless, balanced, aggressive.

Engine selection

Variable Default Description
TQKV_ENGINE "" (auto) Decode engine: native_tq, flash_attn, or bypass.
TQKV_PREFILL_ENGINE fa4 Prefill path: fa4 or adaptive.
TQKV_PREFILL_BYPASS 1 First-chunk prefill bypass — skip codec on prompt-prefill, then re-rotate to TQ basis for decode.
TQKV_FUSE_QROT "" (auto) Fused Q-rotation prologue. Decode-only.
TQKV_O_PROJ_FOLD on Fold rotate_output into o_proj weights.
TQKV_MTP_SPLITK 1 Use split-K decode kernel for MTP layers.
TQKV_DECODE_SPLITS "" (autotune) Force decode-kernel split count.

Backend behaviour

Variable Default Description
TQKV_NO_JIT 0 Fail if a kernel variant is not pre-compiled.
TQKV_K_NC 1 Apply norm-correction to K reads in the dequant path.
TQKV_DISABLE_PRESCALE 0 Disable per-channel pre-scaling on compress upload.
TQKV_STRICT_NO_SDPA 0 Raise instead of taking the head_dim>256 SDPA fallback. Recommended for head_dim>256 deployments.

MLA (DeepSeek V2/V3/V4)

Variable Default Description
TQKV_MLA_ENABLE 0 Master switch for the MLA backend.
TQKV_MLA_ROPE_HEAD_DIM 64 RoPE head dimension for MLA latent + RoPE split.

Why a vLLM fork (for now)

CacheDType in vllm/config/cache.py is a Pydantic Literal validated at class-definition time, which blocks runtime registration of new KV-cache dtypes. Until that's relaxed upstream, we ship a fork. The fork is a thin overlay; full layout in docker/PATCHES.md. SGLang does not need a fork.

Citation

If Turbo Attention helps your work, please cite both the underlying TurboQuant paper and this implementation:

@misc{turbo_attention2026,
  title = {Turbo Attention: Production attention backend for TurboQuant KV cache compression},
  author = {Evseev, Dmitri},
  year = {2026},
  url = {https://github.com/arbi-dev/turbo_attn}
}

@inproceedings{zandieh2026turboquant,
  title = {TurboQuant: Near-optimal KV Cache Quantization for LLM Inference},
  author = {Zandieh, Amir and others},
  booktitle = {ICLR},
  year = {2026}
}

License

Mozilla Public License 2.0 (MPL-2.0). See LICENSE and NOTICE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_attn-0.1.2.tar.gz (501.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turbo_attn-0.1.2-py3-none-any.whl (581.2 kB view details)

Uploaded Python 3

File details

Details for the file turbo_attn-0.1.2.tar.gz.

File metadata

  • Download URL: turbo_attn-0.1.2.tar.gz
  • Upload date:
  • Size: 501.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turbo_attn-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5de624bd351e5d1c435a3ac5c13be86fb4d392d5b9d67b9edbcd2630f8ceffcf
MD5 e25fb118e1fa9b4da9ee8b6e234c01b6
BLAKE2b-256 dc001e2cb59e7a8a8d75ef8018805b53e391585cc8cd4d5762d30318c1585fae

See more details on using hashes here.

File details

Details for the file turbo_attn-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: turbo_attn-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 581.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turbo_attn-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a160954dbc74725e300b5556281d0795a334f18f3de6b263d93b51590689ff9d
MD5 6ccd0cf949f1b5aa3b62ea97e431218d
BLAKE2b-256 12256ed012e941b1118ae6ac8f8086238a9fb3f033d20f418caacf41d5ed02eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page