TurboQuant — PolarQuant KV cache compression for LLM inference

Project description

TurboKV

Compressed KV cache for LLM serving. 4-8x memory reduction with minimal quality loss.

The Problem

When serving an LLM, every token the model has seen is stored in a KV cache that grows with conversation length. Each token stores two vectors (Key and Value) per layer in bf16 (2 bytes per number).

For a 27B model at 250K context, the KV cache alone can consume 40+ GB of GPU memory. This is the main bottleneck for long conversations, large batches, and smaller GPUs.

How It Works

TurboKV compresses KV cache values from 16 bits down to 2-4 bits — a 4-8x memory reduction.

The compression pipeline:

Rotate the vector with a fast Hadamard transform (spreads information evenly across all dimensions)
Normalize (store the vector's magnitude separately as a single float)
Quantize each dimension to one of a small set of values (4 values for TQ2, 16 for TQ4)

The rotation is the key insight. Raw KV values have uneven distributions — some dimensions carry more information than others. After rotation, every dimension follows the same Gaussian distribution, so a single shared codebook achieves near-optimal quantization.

Repository Architecture

The system spans three repositories, each handling a different concern:

turbokv (this repo) — The Compression Library

Framework-agnostic core. Doesn't know about vLLM or FlashInfer.

Codec — compress/decompress API (TurboKVCodec)
Codebook computation — optimal quantization levels for Gaussian data via Lloyd's algorithm
CUDA and Triton kernels — fast compress/decompress with auto-selection (CUDA > Triton > PyTorch)
Calibration — per-layer centroid optimization from real model data (for models that deviate from Gaussian)
Header-only C++ library — include/turbokv/*.cuh can be included in any CUDA project

flashinfer-fork — Attention Kernels

FlashInfer is a GPU kernel library for attention computation. The fork adds:

Fused decode kernel — reads compressed KV cache directly with no decompression step. Unpacks bits, looks up centroids, and computes attention in one pass. Runs at 771 GB/s (near hardware ceiling on RTX 5090).
TQ prefill loader — decompresses KV inline during FlashInfer's existing optimized prefill kernel. No custom prefill attention needed.

vllm-fork — Serving Engine Integration

vLLM is the production serving engine. The fork adds:

TurboKV attention backend — manages compressed KV cache pages
CUDAGraph decode — zero CPU overhead, fully graph-captured (385 tok/s on RTX 5090)
Three engine paths — native_tq (fused), flash_attn (decompress+FA), flashinfer (decompress+FI)

Key Architecture Decisions

Why three repos?

Each has a different rate of change and different maintainers. Keeping them separate means a vLLM upgrade doesn't break the kernels, and a kernel optimization doesn't require re-testing the serving engine. Thin forwarding shims (shims/) let the forks import from turbokv without code duplication.

Why a fused decode kernel?

Decode is memory-bound — the GPU mostly waits for data, not computes. Decompressing to a buffer first means 3x the memory traffic (read compressed, write decompressed, read again for attention). The fused kernel reads compressed data once and computes in-place. With TQ4, it reads 4x less data than bf16. Result: 2x faster decode.

Why NOT a fused prefill kernel?

Prefill is compute-bound — the GPU is busy doing matrix multiplication on tensor cores, not waiting for memory. Reading 4x less data doesn't help when compute is the bottleneck. We tested a standalone tensor-core prefill kernel and it was 3-10x slower than FlashInfer at realistic lengths. Instead, we plug TQ decompression into FlashInfer's prefill as a custom data loader — FlashInfer handles tiling and pipelining, we just provide decompressed values.

Why are the codebooks Gaussian?

The fast Hadamard transform makes post-rotation KV values converge to a Gaussian distribution (central limit theorem). We verified this empirically — on Qwen3.5, the excess kurtosis is -0.07 (nearly perfect Gaussian). Lloyd's algorithm on Gaussian data gives fixed optimal centroids, so no per-model calibration is needed. The calibration infrastructure exists for models that might deviate, but Qwen3.5 doesn't need it.

Performance

On RTX 5090 (24GB) with Qwen3.5-0.8B, TQ4:

Metric	Value
Decode throughput	385 tok/s (CUDAGraph)
Decode bandwidth	771 GB/s (43% of peak)
Prefill throughput	15K+ tok/s (FlashInfer inline dequant)
KV cache compression	4x (TQ4) to 8x (TQ2)
Max context tested	250K tokens (needle-in-haystack passing)
Quality (TQ4)	cosine similarity > 0.999 vs bf16

Quick Start

pip install -e .

from turbokv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")

# Compress
k_packed, k_norms = codec.compress_k(key_vectors)   # bf16 -> 4-bit packed
v_packed, v_norms = codec.compress_v(value_vectors)

# Decompress
k_recon = codec.decompress_k(k_packed, k_norms)     # 4-bit packed -> bf16

For serving with vLLM, see the vllm-fork README.

Project Structure

turbokv/
    codec.py                 # TurboKVCodec API
    core.py                  # Codebook, rotation, bit packing primitives
    calibrate_online.py      # Lloyd's per-layer optimization
    calibrate_static.py      # Scale/shift calibration
    page_metadata.py         # CUDAGraph-safe page metadata (CUDA + Triton)
    kernels/
        triton_kernels.py    # Triton compress/decompress (2-8 bit)
        cuda/
            include/turbokv/   # Header-only C++ library
            standalone_kernels/   # Fused decode kernels (warpspec, wmma)
shims/
    flashinfer/              # Forwarding headers for flashinfer-fork
    vllm/                    # Forwarding imports for vllm-fork
scripts/
    calibrate_model.py       # Generate per-layer calibration from model weights
    gen_codebooks.py         # Pre-compute codebook tables
tests/
    test_codec.py            # Roundtrip, rotation orthogonality
    test_kernels.py          # CUDA + Triton kernel correctness

Supported Configurations

Bit Width	Centroids	Compression	Use Case
TQ2	4	8x	Maximum compression, slight quality trade-off
TQ3	8	5.3x	Balanced
TQ4	16	4x	Recommended default, negligible quality loss

Supported head dimensions: 64, 128, 256, 512 (zero-padded to next power of 2 for WHT).

License

Apache 2.0

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tqf-0.1.0-py3-none-any.whl (113.4 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file tqf-0.1.0-py3-none-any.whl.

File metadata

Download URL: tqf-0.1.0-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 113.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for tqf-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b496a2fb6b0a613a3d5914bbd82659da42c0fbadeb5f58c48094afde1bb4baee`
MD5	`42fea33f60ad47b8b1d63a6aae4a7826`
BLAKE2b-256	`73f1b4a611b4c65950b77c85897d9a5c8183b7c6bed9ee84f9a497210557cd67`

See more details on using hashes here.

tqf 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta