Optimized CUDAgraph-enabled kernels and attention backend for vLLM, SGLang and more based on TurboQuant near-lossless KV cache compression. SOTA performance with Gemma 4, Qwen 3.6 and other modern LLMs.

These details have not been verified by PyPI

Project links

Project description

Turbo Attention

A modular attention backend for vLLM, SGLang, and HuggingFace Transformers. Custom CUDA + Triton kernels with full CUDAGraph capture, asymmetric K/V quantization, hybrid-model support. Built on FlashAttention; based on TurboQuant near-lossless KV cache compression.

PyPI: turbo-attn · Import: tqkv · License: MPL-2.0

Install

pip install turbo-attn                  # codec + CUDA/Triton kernels
pip install "turbo-attn[vllm]"          # + vLLM attention backend
pip install "turbo-attn[all]"           # + SGLang, FlashInfer, flash-attn, eval harness

Quickstart

import torch
from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")
keys = torch.randn(8, 128, device="cuda")

packed, norms = codec.compress_k(keys)
recon = codec.decompress_k(packed, norms)

See examples/ for runnable snippets and ARCHITECTURE.md for a codebase tour.

Repo layout

tqkv/ — the package (codec, kernels, runtime, vLLM/SGLang plugins, calibration pipeline).
docs/, docker/, scripts/, examples/, experiments/ — public docs, deploy recipes, helper scripts, runnable examples, research notes.
Anything under an internal/ subdirectory is engineering-only and unsupported. The wheel never ships these.

vLLM

vLLM serving currently requires the vllm fork (turbo-attn branch) — it wires "tqkv" through vLLM's CacheDType Literal and adds per-group block-pool bookkeeping for hybrid models. Patch layout in docker/PATCHES.md.

vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom

For MLA models (DeepSeek V2/V3) set TQKV_MLA_ENABLE=1.

SGLang

No fork required — tqkv.integrations.sglang.register() installs a pool factory and wires the tqkv attention backend.

import tqkv.integrations.sglang as tqkv_sglang
tqkv_sglang.register()

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend tqkv

How it works

Rotate each KV vector with a fast Walsh–Hadamard transform.
Normalize — store the magnitude as a single BF16 value.
Quantize each rotated coordinate to a shared codebook.

Attention scores on rotated KV are bit-identical to attention on unrotated KV when the query is rotated by the same matrix; we pre-rotate Q once per request and compute everything in the rotated space.

The decode path is one fused CUDA kernel that unpacks, dequantizes, and runs Q·K, online softmax, and P·V in a single pass — no decompress buffer. Prefill has three paths: an FA4 CuTeDSL subclass that dequantizes inline during the MMA pipeline (default), a hand-written CUDA C++ kernel, and a decompress + stock FlashAttention fallback.

Configuration

All runtime configuration is via TQKV_-prefixed environment variables. The supported surface is below; anything unlisted is internal and may change.

Bit width and calibration

Variable	Default	Description
`TQKV_BITS`	`4`	Symmetric K/V bit width (2–8). Falls through to `TQKV_K_BITS`/`TQKV_V_BITS` when those are unset.
`TQKV_K_BITS` / `TQKV_V_BITS`	inherits `TQKV_BITS`	Asymmetric K/V override.
`TQKV_LAYER_BITS`	`""`	Per-layer override string (e.g. `0:8,8;5:2,4`).
`TQKV_CALIBRATION_FILE`	`""`	Path to a calibration JSON bundle from `python -m tqkv.calibration.calibrate_model`.
`TQKV_ALLOCATION_FILE`	`""`	Path to a per-layer bit-allocation file from `python -m tqkv.calibration.solve_bits`.
`TQKV_AUTO_CALIBRATE_MODEL`	`""`	Model path for plugin-side auto-calibration on first init.
`TQKV_PROFILE`	`none`	Calibration profile from the bundle: `lossless`, `balanced`, `aggressive`.

Engine selection

Variable	Default	Description
`TQKV_ENGINE`	`""` (auto)	Decode engine: `native_tq`, `flash_attn`, or `bypass`.
`TQKV_PREFILL_ENGINE`	`fa4`	Prefill path: `fa4` or `adaptive`.
`TQKV_PREFILL_BYPASS`	`1`	First-chunk prefill bypass — skip codec on prompt-prefill, then re-rotate to TQ basis for decode.
`TQKV_FUSE_QROT`	`""` (auto)	Fused Q-rotation prologue. Decode-only.
`TQKV_O_PROJ_FOLD`	`on`	Fold `rotate_output` into `o_proj` weights.
`TQKV_MTP_SPLITK`	`1`	Use split-K decode kernel for MTP layers.
`TQKV_DECODE_SPLITS`	`""` (autotune)	Force decode-kernel split count.

Backend behaviour

Variable	Default	Description
`TQKV_NO_JIT`	`0`	Fail if a kernel variant is not pre-compiled.
`TQKV_K_NC`	`1`	Apply norm-correction to K reads in the dequant path.
`TQKV_DISABLE_PRESCALE`	`0`	Disable per-channel pre-scaling on compress upload.
`TQKV_STRICT_NO_SDPA`	`0`	Raise instead of taking the `head_dim>256` SDPA fallback. Recommended for `head_dim>256` deployments.

MLA (DeepSeek V2/V3/V4)

Variable	Default	Description
`TQKV_MLA_ENABLE`	`0`	Master switch for the MLA backend.
`TQKV_MLA_ROPE_HEAD_DIM`	`64`	RoPE head dimension for MLA latent + RoPE split.

Why a vLLM fork (for now)

CacheDType in vllm/config/cache.py is a Pydantic Literal validated at class-definition time, which blocks runtime registration of new KV-cache dtypes. Until that's relaxed upstream, we ship a fork. The fork is a thin overlay; full layout in docker/PATCHES.md. SGLang does not need a fork.

Citation

If Turbo Attention helps your work, please cite both the underlying TurboQuant paper and this implementation:

@misc{turbo_attention2026,
  title = {Turbo Attention: Production attention backend for TurboQuant KV cache compression},
  author = {Evseev, Dmitri},
  year = {2026},
  url = {https://github.com/arbi-dev/turbo_attn}
}

@inproceedings{zandieh2026turboquant,
  title = {TurboQuant: Near-optimal KV Cache Quantization for LLM Inference},
  author = {Zandieh, Amir and others},
  booktitle = {ICLR},
  year = {2026}
}

License

Mozilla Public License 2.0 (MPL-2.0). See LICENSE and NOTICE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.4

Jun 13, 2026

0.6.3

Jun 13, 2026

0.6.2

Jun 12, 2026

0.6.1

Jun 12, 2026

0.6.0

Jun 12, 2026

0.5.1

Jun 12, 2026

0.5.0

Jun 11, 2026

0.4.1

Jun 11, 2026

0.4.0

Jun 10, 2026

0.3.2

Jun 10, 2026

0.3.1

Jun 10, 2026

0.3.0

Jun 10, 2026

0.2.0

Jun 2, 2026

This version

0.1.2

Apr 30, 2026

0.1.1

Apr 30, 2026

0.1.0 yanked

Apr 30, 2026

Reason this release was yanked:

wrong licene info

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_attn-0.1.2.tar.gz (501.3 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turbo_attn-0.1.2-py3-none-any.whl (581.2 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file turbo_attn-0.1.2.tar.gz.

File metadata

Download URL: turbo_attn-0.1.2.tar.gz
Upload date: Apr 30, 2026
Size: 501.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turbo_attn-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`5de624bd351e5d1c435a3ac5c13be86fb4d392d5b9d67b9edbcd2630f8ceffcf`
MD5	`e25fb118e1fa9b4da9ee8b6e234c01b6`
BLAKE2b-256	`dc001e2cb59e7a8a8d75ef8018805b53e391585cc8cd4d5762d30318c1585fae`

See more details on using hashes here.

File details

Details for the file turbo_attn-0.1.2-py3-none-any.whl.

File metadata

Download URL: turbo_attn-0.1.2-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 581.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turbo_attn-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a160954dbc74725e300b5556281d0795a334f18f3de6b263d93b51590689ff9d`
MD5	`6ccd0cf949f1b5aa3b62ea97e431218d`
BLAKE2b-256	`12256ed012e941b1118ae6ac8f8086238a9fb3f033d20f418caacf41d5ed02eb`

See more details on using hashes here.

turbo-attn 0.1.2

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

Turbo Attention

Install

Quickstart

Repo layout

vLLM

SGLang

How it works

Configuration

Bit width and calibration

Engine selection

Backend behaviour

MLA (DeepSeek V2/V3/V4)

Why a vLLM fork (for now)

Citation

License

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes