Skip to main content

TurboQuant — PolarQuant KV cache compression for LLM inference

Project description

TurboKV

Compressed KV cache for LLM serving. 4-8x memory reduction with minimal quality loss.

The Problem

When serving an LLM, every token the model has seen is stored in a KV cache that grows with conversation length. Each token stores two vectors (Key and Value) per layer in bf16 (2 bytes per number).

For a 27B model at 250K context, the KV cache alone can consume 40+ GB of GPU memory. This is the main bottleneck for long conversations, large batches, and smaller GPUs.

How It Works

TurboKV compresses KV cache values from 16 bits down to 2-4 bits — a 4-8x memory reduction.

The compression pipeline:

  1. Rotate the vector with a fast Hadamard transform (spreads information evenly across all dimensions)
  2. Normalize (store the vector's magnitude separately as a single float)
  3. Quantize each dimension to one of a small set of values (4 values for TQ2, 16 for TQ4)

The rotation is the key insight. Raw KV values have uneven distributions — some dimensions carry more information than others. After rotation, every dimension follows the same Gaussian distribution, so a single shared codebook achieves near-optimal quantization.

Repository Architecture

The system spans three repositories, each handling a different concern:

turbokv (this repo) — The Compression Library

Framework-agnostic core. Doesn't know about vLLM or FlashInfer.

  • Codec — compress/decompress API (TurboKVCodec)
  • Codebook computation — optimal quantization levels for Gaussian data via Lloyd's algorithm
  • CUDA and Triton kernels — fast compress/decompress with auto-selection (CUDA > Triton > PyTorch)
  • Calibration — per-layer centroid optimization from real model data (for models that deviate from Gaussian)
  • Header-only C++ libraryinclude/turbokv/*.cuh can be included in any CUDA project

flashinfer-fork — Attention Kernels

FlashInfer is a GPU kernel library for attention computation. The fork adds:

  • Fused decode kernel — reads compressed KV cache directly with no decompression step. Unpacks bits, looks up centroids, and computes attention in one pass. Runs at 771 GB/s (near hardware ceiling on RTX 5090).
  • TQ prefill loader — decompresses KV inline during FlashInfer's existing optimized prefill kernel. No custom prefill attention needed.

vllm-fork — Serving Engine Integration

vLLM is the production serving engine. The fork adds:

  • TurboKV attention backend — manages compressed KV cache pages
  • CUDAGraph decode — zero CPU overhead, fully graph-captured (385 tok/s on RTX 5090)
  • Three engine pathsnative_tq (fused), flash_attn (decompress+FA), flashinfer (decompress+FI)

Key Architecture Decisions

Why three repos?

Each has a different rate of change and different maintainers. Keeping them separate means a vLLM upgrade doesn't break the kernels, and a kernel optimization doesn't require re-testing the serving engine. Thin forwarding shims (shims/) let the forks import from turbokv without code duplication.

Why a fused decode kernel?

Decode is memory-bound — the GPU mostly waits for data, not computes. Decompressing to a buffer first means 3x the memory traffic (read compressed, write decompressed, read again for attention). The fused kernel reads compressed data once and computes in-place. With TQ4, it reads 4x less data than bf16. Result: 2x faster decode.

Why NOT a fused prefill kernel?

Prefill is compute-bound — the GPU is busy doing matrix multiplication on tensor cores, not waiting for memory. Reading 4x less data doesn't help when compute is the bottleneck. We tested a standalone tensor-core prefill kernel and it was 3-10x slower than FlashInfer at realistic lengths. Instead, we plug TQ decompression into FlashInfer's prefill as a custom data loader — FlashInfer handles tiling and pipelining, we just provide decompressed values.

Why are the codebooks Gaussian?

The fast Hadamard transform makes post-rotation KV values converge to a Gaussian distribution (central limit theorem). We verified this empirically — on Qwen3.5, the excess kurtosis is -0.07 (nearly perfect Gaussian). Lloyd's algorithm on Gaussian data gives fixed optimal centroids, so no per-model calibration is needed. The calibration infrastructure exists for models that might deviate, but Qwen3.5 doesn't need it.

Performance

On RTX 5090 (24GB) with Qwen3.5-0.8B, TQ4:

Metric Value
Decode throughput 385 tok/s (CUDAGraph)
Decode bandwidth 771 GB/s (43% of peak)
Prefill throughput 15K+ tok/s (FlashInfer inline dequant)
KV cache compression 4x (TQ4) to 8x (TQ2)
Max context tested 250K tokens (needle-in-haystack passing)
Quality (TQ4) cosine similarity > 0.999 vs bf16

Quick Start

pip install -e .
from turbokv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")

# Compress
k_packed, k_norms = codec.compress_k(key_vectors)   # bf16 -> 4-bit packed
v_packed, v_norms = codec.compress_v(value_vectors)

# Decompress
k_recon = codec.decompress_k(k_packed, k_norms)     # 4-bit packed -> bf16

For serving with vLLM, see the vllm-fork README.

Project Structure

turbokv/
    codec.py                 # TurboKVCodec API
    core.py                  # Codebook, rotation, bit packing primitives
    calibrate_online.py      # Lloyd's per-layer optimization
    calibrate_static.py      # Scale/shift calibration
    page_metadata.py         # CUDAGraph-safe page metadata (CUDA + Triton)
    kernels/
        triton_kernels.py    # Triton compress/decompress (2-8 bit)
        cuda/
            include/turbokv/   # Header-only C++ library
            standalone_kernels/   # Fused decode kernels (warpspec, wmma)
shims/
    flashinfer/              # Forwarding headers for flashinfer-fork
    vllm/                    # Forwarding imports for vllm-fork
scripts/
    calibrate_model.py       # Generate per-layer calibration from model weights
    gen_codebooks.py         # Pre-compute codebook tables
tests/
    test_codec.py            # Roundtrip, rotation orthogonality
    test_kernels.py          # CUDA + Triton kernel correctness

Supported Configurations

Bit Width Centroids Compression Use Case
TQ2 4 8x Maximum compression, slight quality trade-off
TQ3 8 5.3x Balanced
TQ4 16 4x Recommended default, negligible quality loss

Supported head dimensions: 64, 128, 256, 512 (zero-padded to next power of 2 for WHT).

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tqkv-0.1.0.tar.gz (61.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tqkv-0.1.0-py3-none-any.whl (58.9 kB view details)

Uploaded Python 3

File details

Details for the file tqkv-0.1.0.tar.gz.

File metadata

  • Download URL: tqkv-0.1.0.tar.gz
  • Upload date:
  • Size: 61.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for tqkv-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0a4e33bd568ed0093e7eed426b24d7d1b26958734ee6aee764170fb7978e0e46
MD5 5e3fbf334af4573e249c43e4ef64dfc7
BLAKE2b-256 84681526722b22b5d8616d8df3c74854c1b8a107a90492f8e55f0f5261f18c01

See more details on using hashes here.

File details

Details for the file tqkv-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tqkv-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 58.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for tqkv-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a74fdea2533ff5ee93efb113e2cf6427edad17cfabbd41424e2ecf6677afdce7
MD5 6302d5b6d86a09467128ecc87c3054a1
BLAKE2b-256 82904c33f69c3f82fc4e737bc29565063dee10b170e6a28960834b0138e85b43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page