Optimized CUDAgraph-enabled kernels and attention backend for vLLM, SGLang and more based on TurboQuant near-lossless KV cache compression. SOTA performance with Gemma 4, Qwen 3.6 and other modern LLMs.

These details have not been verified by PyPI

Project links

Project description

Turbo Attention

A modular attention backend for vLLM, SGLang, and HuggingFace Transformers. Custom CUDA + Triton kernels with full CUDAGraph capture, asymmetric K/V quantization, hybrid-model support. Built on FlashAttention; based on TurboQuant near-lossless KV cache compression.

PyPI: turbo-attn · Import: tqkv · License: MPL-2.0

Install

pip install turbo-attn                  # codec + CUDA/Triton kernels
pip install "turbo-attn[vllm]"          # + vLLM attention backend
pip install "turbo-attn[all]"           # + SGLang, FlashInfer, flash-attn, eval harness

Quickstart

import torch
from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")
keys = torch.randn(8, 128, device="cuda")

packed, norms = codec.compress_k(keys)
recon = codec.decompress_k(packed, norms)

See examples/ for runnable snippets and ARCHITECTURE.md for a codebase tour.

Two independently-usable pieces

turbo-attn ships two pieces that are sold as a stack but designed to be consumed separately:

Codec → any attention backend. TurboKVCodec is a pure, framework-agnostic compressor: compress with TQKV, decompress to bf16 / fp16, hand the result to vanilla flash_attn_varlen_func, FlashInfer, SGLang attention, anything that takes raw KV. See examples/06_tqkv_codec_with_third_party_attention.py.
Kernels → any KV format. The cute-DSL prefill and split-K paged decode kernels are policy-parametric on the K/V format via the Loader extension point. The bundled set is {TqkvLoader, BypassLoader}:
- TqkvLoader — TQKV centroid-based codec dequant (the production path).
- BypassLoader — raw bf16 / fp16 KV, no codec. Useful for apples-to-apples ablations under an otherwise-byte-identical kernel. Third-party formats (fp8, int8, nvfp4, …) are not shipped — write a sibling Loader for your format. The Loader is the public extension surface; mainloop / scheduler / softmax / epilogue stay turbo-attn's. See docs/writing_a_loader.md for a worked fp8 example.

Repo layout

tqkv/ — the package (codec, kernels, runtime, vLLM/SGLang plugins, calibration pipeline).
tqkv/kernels/loaders/ — bundled cute-DSL prefill Loaders (tqkv, bypass).
tqkv/kernels/_decode_loader_*.cuh — bundled decode Loaders (TqkvDecodeLoader, BypassDecodeLoader).
docs/, docker/, scripts/, examples/, experiments/ — public docs, deploy recipes, helper scripts, runnable examples, research notes.
The top-level internal/ directory is engineering-only and unsupported — design notes, internal compose files, dev harnesses. The wheel never ships it.

Run with Docker

Three inference servers are supported: vLLM, SGLang, and arbi-serve. Each ships a turn-key Dockerfile. Calibration files for the bit-width / model combo go in a host directory; the TQKV_CALIBRATION_FILE env var inside the container points to one. Examples below use Qwen3.5-0.8B + a K4V4 calibration.

Layout assumed

/path/to/models/Qwen3.5-0.8B/...                       # HF snapshot
/path/to/calibrations/qwen3.5-0.8b_tq4_v3.json         # calibration bundle

All three accept the same CLI flags: --kv-cache-dtype tqkv --attention-backend turbo-attn plus TQKV_BITS=<float> and TQKV_CALIBRATION_FILE=<path>. TQKV_BITS is the average bits-per-element across K and V (e.g. 4.0, 5.0, 6.0); a per-layer Lagrangian solver turns that target into a per-layer (k_bits, v_bits) allocation that lives in the calibration bundle. TQKV_BITS=4.0 does not mean "K and V both at 4 bits" — it means "average 4 bits-per-element under the smart per-layer allocation".

Sibling-checkout layout

All three Dockerfiles COPY from a sibling-repo layout. Clone the relevant repos as siblings of turbo-attn/:

GIT/
├── turbo-attn/        # this repo
├── vllm-fork/         # arbi-dev/vllm        (only needed for vLLM image)
├── sglang-fork/       # arbi-dev/sglang      (only needed for SGLang image)
└── arbi-serve/        # arbi-dev/arbi-serve  (only needed for arbi-serve image)

mkdir -p ~/GIT && cd ~/GIT
git clone https://github.com/arbi-dev/turbo-attn
git clone https://github.com/arbi-dev/vllm       vllm-fork    # for vLLM
git clone https://github.com/arbi-dev/sglang     sglang-fork  # for SGLang
git clone https://github.com/arbi-dev/arbi-serve              # for arbi-serve

vLLM

Uses our vllm-fork rebased onto upstream v0.20.1 (small overlay — CacheDType Literal relaxation, per-group block-pool bookkeeping, named TURBO_ATTN slot in AttentionBackendEnum; full layout in docker/PATCHES.md).

cd ~/GIT/turbo-attn/docker

# build + run
TQKV_MODELS_ROOT=/path/to/models \
  docker compose -f compose.vllm.yaml up -d --build

docker compose -f compose.vllm.yaml logs -f

# serve a request once "Application startup complete" appears:
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen3.5-0.8B", "prompt": "The capital of France is", "max_tokens": 12, "temperature": 0}'

Optional env (override on the docker compose command line):

Variable	Default	Purpose
`TQKV_MODEL`	`/models/Qwen3.5-0.8B`	Container-side model path
`TQKV_BITS`	`4.0`	Average bits-per-element target
`TQKV_CALIBRATION_FILE`	unset	Path to a v4 calibration bundle. Unset → uniform K4V4 fallback (tests/CI only; production needs a bundle)
`TQKV_PORT`	`8000`	Host port
`TQKV_GPU_DEVICE`	`0`	`NVIDIA_VISIBLE_DEVICES`
`TQKV_MAX_MODEL_LEN`	`2048`	Max context length

SGLang

Uses our sglang-fork rebased onto upstream v0.5.11 (small overlay — plugin registries for KV-cache dtypes and attention backends; full layout in docker/PATCHES.md).

cd ~/GIT/turbo-attn/docker

TQKV_MODELS_ROOT=/path/to/models \
  docker compose -f compose.sglang.yaml up -d --build

docker compose -f compose.sglang.yaml logs -f

curl http://localhost:30000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen3.5-0.8B", "prompt": "The capital of France is", "max_tokens": 12, "temperature": 0}'

Same env-var contract as vLLM (drop TQKV_MAX_MODEL_LEN; SGLang uses TQKV_CONTEXT_LEN and TQKV_MEM_FRAC for --mem-fraction-static, default 0.45).

arbi-serve

Standalone OpenAI-compatible server with TQKV backends as a first-class citizen. Lives in arbi-dev/arbi-serve.

cd ~/GIT/arbi-serve

ARBI_MODELS_ROOT=/path/to/models \
  docker compose up -d --build

docker compose logs -f

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen3.5-0.8B", "prompt": "The capital of France is", "max_tokens": 12, "temperature": 0}'

MLA models

For DeepSeek V2/V3/V4, additionally set -e TQKV_MLA_ENABLE=1 on whichever container.

Calibration

Pre-built calibration bundles for common models live in HuggingFace at arbi-dev/turbo-attn-calibrations. To roll your own:

python -m tqkv.calibration.calibrate_centroids \
    --model Qwen/Qwen3.5-0.8B \
    --output qwen3.5-0.8b_tq4_v3.json \
    --bits 4

How it works

Rotate each KV vector with a fast Walsh–Hadamard transform.
Normalize — store the magnitude as a single BF16 value.
Quantize each rotated coordinate to a shared codebook.

Attention scores on rotated KV are bit-identical to attention on unrotated KV when the query is rotated by the same matrix; we pre-rotate Q once per request and compute everything in the rotated space.

The decode path is one fused CUDA kernel that unpacks, dequantizes, and runs Q·K, online softmax, and P·V in a single pass — no decompress buffer. Prefill has three paths: an FA4 CuTeDSL subclass that dequantizes inline during the MMA pipeline (default), a hand-written CUDA C++ kernel, and a decompress + stock FlashAttention fallback.

Configuration

All runtime configuration is via TQKV_-prefixed environment variables. The supported surface is below; anything unlisted is internal and may change.

Bit width and calibration

Variable	Default	Description
`TQKV_BITS`	`4.0`	Average bits-per-element target across K and V (float in `[2.0, 8.0]`). The runtime looks up the calibration bundle's `byte_budget_table[<TQKV_BITS>]` for the per-layer `(k_bits, v_bits)` allocation. Hard error if the entry is missing — no silent fallback.
`TQKV_CALIBRATION_FILE`	`""`	Path to a schema-v4 calibration bundle (centroids + per-channel scales + `byte_budget_table`). Required for production; bundle generation: `calibrate_centroids` → `optimize_quant` → `migrate_v3_to_v4`. When unset, the plugin falls back to uniform K4V4 (tests/CI only).
`TQKV_AUTO_CALIBRATE_MODEL`	`""`	Model path for plugin-side auto-calibration when `TQKV_CALIBRATION_FILE` doesn't exist on first init.

Engine selection

Variable	Default	Description
`TQKV_ENGINE`	`""` (auto)	Decode engine: `native_tq`, `flash_attn`, or `bypass`.
`TQKV_PREFILL_ENGINE`	`fa4`	Prefill path: `fa4`, `adaptive`, or `decomp_fa_main_only` (bench-only — main-token prefill through decomp+FA; decode + MTP verify untouched).
`TQKV_PREFILL_BYPASS`	`1`	First-chunk prefill bypass — skip codec on prompt-prefill, then re-rotate to TQ basis for decode.
`TQKV_FUSE_QROT`	`""` (auto)	Fused Q-rotation prologue. Decode-only.
`TQKV_O_PROJ_FOLD`	`on`	Fold `rotate_output` into `o_proj` weights.
`TQKV_MTP_SPLITK`	`1`	Use split-K decode kernel for MTP layers.
`TQKV_DECODE_SPLITS`	`""` (autotune)	Force decode-kernel split count.

Backend behaviour

Variable	Default	Description
`TQKV_NO_JIT`	`0`	Fail if a kernel variant is not pre-compiled.
`TQKV_K_NC`	`1`	Apply norm-correction to K reads in the dequant path.
`TQKV_DISABLE_PRESCALE`	`0`	Disable per-channel pre-scaling on compress upload.
`TQKV_STRICT_NO_SDPA`	`0`	Raise instead of taking the `head_dim>256` SDPA fallback. Recommended for `head_dim>256` deployments.

MLA (DeepSeek V2/V3/V4)

Variable	Default	Description
`TQKV_MLA_ENABLE`	`0`	Master switch for the MLA backend.
`TQKV_MLA_ROPE_HEAD_DIM`	`64`	RoPE head dimension for MLA latent + RoPE split.

Why a vLLM fork (for now)

CacheDType in vllm/config/cache.py is a Pydantic Literal validated at class-definition time, which blocks runtime registration of new KV-cache dtypes. Until that's relaxed upstream, we ship a fork. The fork is a thin overlay; full layout in docker/PATCHES.md. SGLang does not need a fork.

Citation

If Turbo Attention helps your work, please cite both the underlying TurboQuant paper and this implementation:

@misc{turbo_attention2026,
  title = {Turbo Attention: Production attention backend for TurboQuant KV cache compression},
  author = {Evseev, Dmitri},
  year = {2026},
  url = {https://github.com/arbi-dev/turbo-attn}
}

@inproceedings{zandieh2026turboquant,
  title = {TurboQuant: Near-optimal KV Cache Quantization for LLM Inference},
  author = {Zandieh, Amir and others},
  booktitle = {ICLR},
  year = {2026}
}

License

Mozilla Public License 2.0 (MPL-2.0). See LICENSE and NOTICE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.4

Jun 13, 2026

0.6.3

Jun 13, 2026

0.6.2

Jun 12, 2026

0.6.1

Jun 12, 2026

0.6.0

Jun 12, 2026

0.5.1

Jun 12, 2026

0.5.0

Jun 11, 2026

0.4.1

Jun 11, 2026

This version

0.4.0

Jun 10, 2026

0.3.2

Jun 10, 2026

0.3.1

Jun 10, 2026

0.3.0

Jun 10, 2026

0.2.0

Jun 2, 2026

0.1.2

Apr 30, 2026

0.1.1

Apr 30, 2026

0.1.0 yanked

Apr 30, 2026

Reason this release was yanked:

wrong licene info

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_attn-0.4.0.tar.gz (654.8 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turbo_attn-0.4.0-py3-none-any.whl (710.0 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file turbo_attn-0.4.0.tar.gz.

File metadata

Download URL: turbo_attn-0.4.0.tar.gz
Upload date: Jun 10, 2026
Size: 654.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.20 {"installer":{"name":"uv","version":"0.11.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for turbo_attn-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`a735c6f3f45f76df620db752e3f995c126ac0d45e9ec9a692f302e3d3c2f3755`
MD5	`b35238a897985b7010e27cf653e655a6`
BLAKE2b-256	`89920b0206ae8061a82dffcb706a4c4f15ef3b8f065f71c40dbc25ae53b440ce`

See more details on using hashes here.

File details

Details for the file turbo_attn-0.4.0-py3-none-any.whl.

File metadata

Download URL: turbo_attn-0.4.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 710.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.20 {"installer":{"name":"uv","version":"0.11.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for turbo_attn-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d37bf554a7e1ff99e66eb65e661b38d48f598430f1a19220c4a1869e1b31d0af`
MD5	`5c83389693ff5c4f1338593e792c1259`
BLAKE2b-256	`6c324be796971de49e55ffe908bd83051510ed08990ba2bc146f2fc61f5d2a1d`

See more details on using hashes here.

turbo-attn 0.4.0

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

Turbo Attention

Install

Quickstart

Two independently-usable pieces

Repo layout

Run with Docker

Layout assumed

Sibling-checkout layout

vLLM

SGLang

arbi-serve

MLA models

Calibration

How it works

Configuration

Bit width and calibration

Engine selection

Backend behaviour

MLA (DeepSeek V2/V3/V4)

Why a vLLM fork (for now)

Citation

License

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes