Skip to main content

TurboQuant — PolarQuant KV cache compression for LLM inference

Project description

TurboKV

Compressed KV cache for LLM serving. 4-8x memory reduction with minimal quality loss.

The Problem

When serving an LLM, every token the model has seen is stored in a KV cache that grows with conversation length. Each token stores two vectors (Key and Value) per layer in bf16 (2 bytes per number).

For a 27B model at 250K context, the KV cache alone can consume 40+ GB of GPU memory. This is the main bottleneck for long conversations, large batches, and smaller GPUs.

How It Works

TurboKV compresses KV cache values from 16 bits down to 2-4 bits — a 4-8x memory reduction.

The compression pipeline:

  1. Rotate the vector with a fast Hadamard transform (spreads information evenly across all dimensions)
  2. Normalize (store the vector's magnitude separately as a single float)
  3. Quantize each dimension to one of a small set of values (4 values for TQ2, 16 for TQ4)

The rotation is the key insight. Raw KV values have uneven distributions — some dimensions carry more information than others. After rotation, every dimension follows the same Gaussian distribution, so a single shared codebook achieves near-optimal quantization.

Repository Architecture

The system spans three repositories, each handling a different concern:

turbokv (this repo) — The Compression Library

Framework-agnostic core. Doesn't know about vLLM or FlashInfer.

  • Codec — compress/decompress API (TurboKVCodec)
  • Codebook computation — optimal quantization levels for Gaussian data via Lloyd's algorithm
  • CUDA and Triton kernels — fast compress/decompress with auto-selection (CUDA > Triton > PyTorch)
  • Calibration — per-layer centroid optimization from real model data (for models that deviate from Gaussian)
  • Header-only C++ libraryinclude/turbokv/*.cuh can be included in any CUDA project

flashinfer-fork — Attention Kernels

FlashInfer is a GPU kernel library for attention computation. The fork adds:

  • Fused decode kernel — reads compressed KV cache directly with no decompression step. Unpacks bits, looks up centroids, and computes attention in one pass. Runs at 771 GB/s (near hardware ceiling on RTX 5090).
  • TQ prefill loader — decompresses KV inline during FlashInfer's existing optimized prefill kernel. No custom prefill attention needed.

vllm-fork — Serving Engine Integration

vLLM is the production serving engine. The fork adds:

  • TurboKV attention backend — manages compressed KV cache pages
  • CUDAGraph decode — zero CPU overhead, fully graph-captured (385 tok/s on RTX 5090)
  • Three engine pathsnative_tq (fused), flash_attn (decompress+FA), flashinfer (decompress+FI)

Key Architecture Decisions

Why three repos?

Each has a different rate of change and different maintainers. Keeping them separate means a vLLM upgrade doesn't break the kernels, and a kernel optimization doesn't require re-testing the serving engine. Thin forwarding shims (shims/) let the forks import from turbokv without code duplication.

Why a fused decode kernel?

Decode is memory-bound — the GPU mostly waits for data, not computes. Decompressing to a buffer first means 3x the memory traffic (read compressed, write decompressed, read again for attention). The fused kernel reads compressed data once and computes in-place. With TQ4, it reads 4x less data than bf16. Result: 2x faster decode.

Why NOT a fused prefill kernel?

Prefill is compute-bound — the GPU is busy doing matrix multiplication on tensor cores, not waiting for memory. Reading 4x less data doesn't help when compute is the bottleneck. We tested a standalone tensor-core prefill kernel and it was 3-10x slower than FlashInfer at realistic lengths. Instead, we plug TQ decompression into FlashInfer's prefill as a custom data loader — FlashInfer handles tiling and pipelining, we just provide decompressed values.

Why are the codebooks Gaussian?

The fast Hadamard transform makes post-rotation KV values converge to a Gaussian distribution (central limit theorem). We verified this empirically — on Qwen3.5, the excess kurtosis is -0.07 (nearly perfect Gaussian). Lloyd's algorithm on Gaussian data gives fixed optimal centroids, so no per-model calibration is needed. The calibration infrastructure exists for models that might deviate, but Qwen3.5 doesn't need it.

Performance

On RTX 5090 (24GB) with Qwen3.5-0.8B, TQ4:

Metric Value
Decode throughput 385 tok/s (CUDAGraph)
Decode bandwidth 771 GB/s (43% of peak)
Prefill throughput 15K+ tok/s (FlashInfer inline dequant)
KV cache compression 4x (TQ4) to 8x (TQ2)
Max context tested 250K tokens (needle-in-haystack passing)
Quality (TQ4) cosine similarity > 0.999 vs bf16

Quick Start

pip install -e .
from turbokv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")

# Compress
k_packed, k_norms = codec.compress_k(key_vectors)   # bf16 -> 4-bit packed
v_packed, v_norms = codec.compress_v(value_vectors)

# Decompress
k_recon = codec.decompress_k(k_packed, k_norms)     # 4-bit packed -> bf16

For serving with vLLM, see the vllm-fork README.

Project Structure

turbokv/
    codec.py                 # TurboKVCodec API
    core.py                  # Codebook, rotation, bit packing primitives
    calibrate_online.py      # Lloyd's per-layer optimization
    calibrate_static.py      # Scale/shift calibration
    page_metadata.py         # CUDAGraph-safe page metadata (CUDA + Triton)
    kernels/
        triton_kernels.py    # Triton compress/decompress (2-8 bit)
        cuda/
            include/turbokv/   # Header-only C++ library
            standalone_kernels/   # Fused decode kernels (warpspec, wmma)
shims/
    flashinfer/              # Forwarding headers for flashinfer-fork
    vllm/                    # Forwarding imports for vllm-fork
scripts/
    calibrate_model.py       # Generate per-layer calibration from model weights
    gen_codebooks.py         # Pre-compute codebook tables
tests/
    test_codec.py            # Roundtrip, rotation orthogonality
    test_kernels.py          # CUDA + Triton kernel correctness

Supported Configurations

Bit Width Centroids Compression Use Case
TQ2 4 8x Maximum compression, slight quality trade-off
TQ3 8 5.3x Balanced
TQ4 16 4x Recommended default, negligible quality loss

Supported head dimensions: 64, 128, 256, 512 (zero-padded to next power of 2 for WHT).

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tqf-0.1.0-py3-none-any.whl (113.4 kB view details)

Uploaded Python 3

File details

Details for the file tqf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tqf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 113.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for tqf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b496a2fb6b0a613a3d5914bbd82659da42c0fbadeb5f58c48094afde1bb4baee
MD5 42fea33f60ad47b8b1d63a6aae4a7826
BLAKE2b-256 73f1b4a611b4c65950b77c85897d9a5c8183b7c6bed9ee84f9a497210557cd67

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page