Near-optimal KV cache quantization for LLM inference (arXiv:2504.19874)
Project description
aither-kvcache
Near-optimal KV cache quantization for LLM inference. Implements the TurboQuant algorithm from Zandieh et al. (arXiv:2504.19874).
Compresses KV cache vectors to 2-4 bits per value with MSE within 2.7x of the information-theoretic lower bound. No calibration data. No retraining. Works on streaming tokens.
Installation
pip install aither-kvcache
What This Does
Every transformer layer stores key-value vectors in a KV cache during inference. At FP16, each 128-dim vector costs 256 bytes. At FP8, 128 bytes. This library compresses them to 36-68 bytes (2-4 bit) with near-zero quality loss.
| Bits | Bytes per vector | vs FP16 | vs FP8 |
|---|---|---|---|
| 4 | 68 | 3.8x | 1.9x |
| 3 | 52 | 4.9x | 2.5x |
| 2 | 36 | 7.1x | 3.6x |
Where This Fits
In a custom inference loop
If you manage your own KV cache, drop encode() where you write to cache
and decode() where you read from it:
from turboquant import TurboQuant
import torch
tq = TurboQuant(head_dim=128, bits=4, device="cuda")
# Your model produces key/value projections each step
# key_proj: [batch, num_kv_heads, head_dim]
# BEFORE: store raw FP16
# kv_cache[layer][block, pos] = key_proj # 256 bytes per vector
# AFTER: store compressed
packed, norms = tq.encode(key_proj) # 68 bytes per vector
# Store packed (uint8) and norms (float32) instead of the raw tensor
# When attention needs the keys:
key_restored = tq.decode(packed, norms) # back to FP16
# Use key_restored in your attention computation
In a paged KV cache (like vLLM's block manager)
Paged caches store KV data in fixed-size blocks indexed by a block table. The library handles arbitrary batch dimensions, so it works with paged layouts:
tq = TurboQuant(head_dim=128, bits=4, device="cuda")
# Block-structured cache: [num_blocks, block_size, num_kv_heads, head_dim]
# Compress the entire cache or individual blocks:
# Compress a block of 16 tokens across 8 heads
block = cache[block_idx] # [16, 8, 128] FP16
packed, norms = tq.encode(block) # [16, 8, 64] uint8 + [16, 8] f32
# Decompress when needed for attention
restored = tq.decode(packed, norms) # [16, 8, 128] FP16
Zero-buffer attention (the real goal)
The fused attention kernel computes attention directly from compressed data without decompressing first. This is where the memory savings actually happen — no decompression buffer means the compressed cache IS the only cache:
from turboquant import TurboQuant
from turboquant.fused_attention import TQPagedAttention
tq = TurboQuant(head_dim=128, bits=4, device="cuda")
attn = TQPagedAttention(tq, num_query_heads=32)
# query: [num_seqs, num_query_heads, head_dim] — from model
# k_packed: [num_blocks, block_size, num_kv_heads, 64] — compressed keys
# k_norms: [num_blocks, block_size, num_kv_heads] — key norms
# block_tables: [num_seqs, max_blocks_per_seq] — maps sequences to blocks
# context_lens: [num_seqs] — tokens per sequence
output = attn.forward(
query, k_packed, k_norms, v_packed, v_norms,
block_tables, context_lens,
)
The math: rotate the query forward once (Pi @ q), dot-product against
codebook-decoded values in the rotated domain, accumulate weighted values
in the rotated domain, rotate back once (Pi^T @ acc). Two matrix multiplies
total regardless of context length. Everything else is codebook lookups and
dot products.
This is a PyTorch reference implementation (correct, tested, works on CPU and GPU). A production Triton kernel is the next step.
As a research tool
Validate the paper's theoretical bounds on your own data:
tq = TurboQuant(head_dim=128, bits=4)
result = tq.validate(num_vectors=50000)
print(f"MSE: {result['mse']:.6f}")
print(f"Theory range: [{result['mse_theory_lower']:.6f}, {result['mse_theory_upper']:.6f}]")
print(f"Ratio to lower bound: {result['mse_ratio_to_lower']:.2f}x (paper claims <= 2.7x)")
Compare compression vs quality at different bit widths:
python -m turboquant.bench
Compute custom codebooks for non-standard dimensions:
from turboquant.codebook import compute_codebook_scipy
centroids, boundaries, mse = compute_codebook_scipy(d=256, bits=3)
What This Does NOT Do (Yet)
-
Does not integrate with vLLM/llama.cpp/TensorRT-LLM out of the box. The encode/decode primitives are framework-agnostic. Framework integration requires hooking into the specific cache allocation and attention paths. We have a working vLLM v0.15 integration in our production stack (280K tokens on RTX 5090) but it's not portable yet.
-
The fused attention is a PyTorch reference, not a Triton kernel. It's correct (16/16 tests pass, validated against decompress+FlashAttn) but not optimized for production throughput. The Triton kernel is next.
-
Does not handle KV cache management (block allocation, eviction, prefix caching). It compresses and decompresses vectors. Your inference framework handles the rest.
Algorithm
- Normalize: extract L2 norm, project onto unit sphere
- Rotate: multiply by a fixed random orthogonal matrix Pi (data-oblivious, generated once from a seed). Makes each coordinate ~N(0, 1/d).
- Quantize: each coordinate independently via precomputed Lloyd-Max codebook
- Pack: indices into uint8 bytes (2 values/byte at 4-bit, 4 values/byte at 2-bit)
- Store: packed bytes + float32 norm
Decoding reverses steps 4-1.
Validated MSE
| Bits | MSE | Theory Lower | Theory Upper | Ratio to LB |
|---|---|---|---|---|
| 4 | 0.0095 | 0.0039 | 0.0184 | 2.4x |
| 3 | 0.0345 | 0.0156 | 0.0736 | 2.2x |
| 2 | 0.1175 | 0.0625 | 0.2945 | 1.9x |
API Reference
class TurboQuant:
def __init__(self, head_dim=128, bits=4, seed=42, device="cuda", ...)
def encode(self, x: Tensor) -> Tuple[Tensor, Tensor] # -> (packed, norms)
def decode(self, packed: Tensor, norms: Tensor) -> Tensor
def validate(self, num_vectors=10000) -> dict
def benchmark(self, num_vectors=32768) -> dict
def compression_ratio(self) -> float
def memory_report(self, seq_len, num_layers=32, num_kv_heads=8) -> dict
class TQPagedAttention:
def __init__(self, tq: TurboQuant, num_query_heads: int)
def forward(self, query, k_packed, k_norms, v_packed, v_norms,
block_tables, context_lens, block_size=16) -> Tensor
Reference
@article{zandieh2025turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
journal={arXiv preprint arXiv:2504.19874},
year={2025}
}
License
CC BY 4.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aither_kvcache-0.2.0.tar.gz.
File metadata
- Download URL: aither_kvcache-0.2.0.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
426ef2696d8adbb4fb22a797a0720d48855cc972b09c134d2705c09b292131a4
|
|
| MD5 |
6d86430f7cf11790c62555f8f05dbd1f
|
|
| BLAKE2b-256 |
626a723d2fe0b3ae6ef3191f1e318ddde8a5ea5bab36ef1ff35fca27fbcb0645
|
Provenance
The following attestation bundles were made for aither_kvcache-0.2.0.tar.gz:
Publisher:
publish.yml on Aitherium/aitherkvcache
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aither_kvcache-0.2.0.tar.gz -
Subject digest:
426ef2696d8adbb4fb22a797a0720d48855cc972b09c134d2705c09b292131a4 - Sigstore transparency entry: 1187726308
- Sigstore integration time:
-
Permalink:
Aitherium/aitherkvcache@79a31b791bb4890d4f1754f8e0c87f00accf9335 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Aitherium
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@79a31b791bb4890d4f1754f8e0c87f00accf9335 -
Trigger Event:
release
-
Statement type:
File details
Details for the file aither_kvcache-0.2.0-py3-none-any.whl.
File metadata
- Download URL: aither_kvcache-0.2.0-py3-none-any.whl
- Upload date:
- Size: 28.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
502881baba54ae0f08f16b16f784d4c42784f6019854c2d9f36ca1af1a8c6138
|
|
| MD5 |
cfa7253d5ed8116472c7b89ee1d918c5
|
|
| BLAKE2b-256 |
7180ace626d3041acf18f8d2d84631eb3cddf6c0858df881741eafadaf7d5921
|
Provenance
The following attestation bundles were made for aither_kvcache-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on Aitherium/aitherkvcache
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aither_kvcache-0.2.0-py3-none-any.whl -
Subject digest:
502881baba54ae0f08f16b16f784d4c42784f6019854c2d9f36ca1af1a8c6138 - Sigstore transparency entry: 1187726312
- Sigstore integration time:
-
Permalink:
Aitherium/aitherkvcache@79a31b791bb4890d4f1754f8e0c87f00accf9335 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Aitherium
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@79a31b791bb4890d4f1754f8e0c87f00accf9335 -
Trigger Event:
release
-
Statement type: