TurboQuant — PolarQuant KV cache compression for LLM inference
Project description
TurboKV
Compressed KV cache for LLM serving. 4-8x memory reduction with minimal quality loss.
The Problem
When serving an LLM, every token the model has seen is stored in a KV cache that grows with conversation length. Each token stores two vectors (Key and Value) per layer in bf16 (2 bytes per number).
For a 27B model at 250K context, the KV cache alone can consume 40+ GB of GPU memory. This is the main bottleneck for long conversations, large batches, and smaller GPUs.
How It Works
TurboKV compresses KV cache values from 16 bits down to 2-4 bits — a 4-8x memory reduction.
The compression pipeline:
- Rotate the vector with a fast Hadamard transform (spreads information evenly across all dimensions)
- Normalize (store the vector's magnitude separately as a single float)
- Quantize each dimension to one of a small set of values (4 values for TQ2, 16 for TQ4)
The rotation is the key insight. Raw KV values have uneven distributions — some dimensions carry more information than others. After rotation, every dimension follows the same Gaussian distribution, so a single shared codebook achieves near-optimal quantization.
Repository Architecture
The system spans three repositories, each handling a different concern:
turbokv (this repo) — The Compression Library
Framework-agnostic core. Doesn't know about vLLM or FlashInfer.
- Codec — compress/decompress API (
TurboKVCodec) - Codebook computation — optimal quantization levels for Gaussian data via Lloyd's algorithm
- CUDA and Triton kernels — fast compress/decompress with auto-selection (CUDA > Triton > PyTorch)
- Calibration — per-layer centroid optimization from real model data (for models that deviate from Gaussian)
- Header-only C++ library —
include/turbokv/*.cuhcan be included in any CUDA project
flashinfer-fork — Attention Kernels
FlashInfer is a GPU kernel library for attention computation. The fork adds:
- Fused decode kernel — reads compressed KV cache directly with no decompression step. Unpacks bits, looks up centroids, and computes attention in one pass. Runs at 771 GB/s (near hardware ceiling on RTX 5090).
- TQ prefill loader — decompresses KV inline during FlashInfer's existing optimized prefill kernel. No custom prefill attention needed.
vllm-fork — Serving Engine Integration
vLLM is the production serving engine. The fork adds:
- TurboKV attention backend — manages compressed KV cache pages
- CUDAGraph decode — zero CPU overhead, fully graph-captured (385 tok/s on RTX 5090)
- Three engine paths —
native_tq(fused),flash_attn(decompress+FA),flashinfer(decompress+FI)
Key Architecture Decisions
Why three repos?
Each has a different rate of change and different maintainers. Keeping them separate means a vLLM upgrade doesn't break the kernels, and a kernel optimization doesn't require re-testing the serving engine. Thin forwarding shims (shims/) let the forks import from turbokv without code duplication.
Why a fused decode kernel?
Decode is memory-bound — the GPU mostly waits for data, not computes. Decompressing to a buffer first means 3x the memory traffic (read compressed, write decompressed, read again for attention). The fused kernel reads compressed data once and computes in-place. With TQ4, it reads 4x less data than bf16. Result: 2x faster decode.
Why NOT a fused prefill kernel?
Prefill is compute-bound — the GPU is busy doing matrix multiplication on tensor cores, not waiting for memory. Reading 4x less data doesn't help when compute is the bottleneck. We tested a standalone tensor-core prefill kernel and it was 3-10x slower than FlashInfer at realistic lengths. Instead, we plug TQ decompression into FlashInfer's prefill as a custom data loader — FlashInfer handles tiling and pipelining, we just provide decompressed values.
Why are the codebooks Gaussian?
The fast Hadamard transform makes post-rotation KV values converge to a Gaussian distribution (central limit theorem). We verified this empirically — on Qwen3.5, the excess kurtosis is -0.07 (nearly perfect Gaussian). Lloyd's algorithm on Gaussian data gives fixed optimal centroids, so no per-model calibration is needed. The calibration infrastructure exists for models that might deviate, but Qwen3.5 doesn't need it.
Performance
On RTX 5090 (24GB) with Qwen3.5-0.8B, TQ4:
| Metric | Value |
|---|---|
| Decode throughput | 385 tok/s (CUDAGraph) |
| Decode bandwidth | 771 GB/s (43% of peak) |
| Prefill throughput | 15K+ tok/s (FlashInfer inline dequant) |
| KV cache compression | 4x (TQ4) to 8x (TQ2) |
| Max context tested | 250K tokens (needle-in-haystack passing) |
| Quality (TQ4) | cosine similarity > 0.999 vs bf16 |
Quick Start
pip install -e .
from turbokv import TurboKVCodec
codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")
# Compress
k_packed, k_norms = codec.compress_k(key_vectors) # bf16 -> 4-bit packed
v_packed, v_norms = codec.compress_v(value_vectors)
# Decompress
k_recon = codec.decompress_k(k_packed, k_norms) # 4-bit packed -> bf16
For serving with vLLM, see the vllm-fork README.
Project Structure
turbokv/
codec.py # TurboKVCodec API
core.py # Codebook, rotation, bit packing primitives
calibrate_online.py # Lloyd's per-layer optimization
calibrate_static.py # Scale/shift calibration
page_metadata.py # CUDAGraph-safe page metadata (CUDA + Triton)
kernels/
triton_kernels.py # Triton compress/decompress (2-8 bit)
cuda/
include/turbokv/ # Header-only C++ library
standalone_kernels/ # Fused decode kernels (warpspec, wmma)
shims/
flashinfer/ # Forwarding headers for flashinfer-fork
vllm/ # Forwarding imports for vllm-fork
scripts/
calibrate_model.py # Generate per-layer calibration from model weights
gen_codebooks.py # Pre-compute codebook tables
tests/
test_codec.py # Roundtrip, rotation orthogonality
test_kernels.py # CUDA + Triton kernel correctness
Supported Configurations
| Bit Width | Centroids | Compression | Use Case |
|---|---|---|---|
| TQ2 | 4 | 8x | Maximum compression, slight quality trade-off |
| TQ3 | 8 | 5.3x | Balanced |
| TQ4 | 16 | 4x | Recommended default, negligible quality loss |
Supported head dimensions: 64, 128, 256, 512 (zero-padded to next power of 2 for WHT).
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tqf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tqf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 113.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b496a2fb6b0a613a3d5914bbd82659da42c0fbadeb5f58c48094afde1bb4baee
|
|
| MD5 |
42fea33f60ad47b8b1d63a6aae4a7826
|
|
| BLAKE2b-256 |
73f1b4a611b4c65950b77c85897d9a5c8183b7c6bed9ee84f9a497210557cd67
|