Near-optimal KV cache quantization for LLM inference (arXiv:2504.19874)
Project description
aither-kvcache
Near-optimal KV cache quantization for LLM inference. Implements the TurboQuant algorithm from Zandieh et al. (arXiv:2504.19874).
Compresses KV cache vectors to 2-4 bits per value with MSE within 2.7x of the information-theoretic lower bound. No calibration data. No retraining. Works on streaming tokens.
Installation
pip install aither-kvcache # core library
pip install aither-kvcache[vllm] # + vLLM plugin (v0.15+)
pip install aither-kvcache[triton] # + fused GPU kernels
pip install aither-kvcache[all] # everything
Quick Start
from turboquant import TurboQuant
tq = TurboQuant(head_dim=128, bits=4, device="cuda")
packed, norms = tq.encode(kv_vectors) # [..., 128] float16 -> [..., 64] uint8 + [...] f32
decoded = tq.decode(packed, norms) # [..., 64] uint8 + [...] f32 -> [..., 128] float16
vLLM Integration
Works with vLLM v0.15+ via the official plugin system. No monkey-patching.
pip install aither-kvcache[vllm]
vllm serve your-model --attention-backend CUSTOM
The plugin auto-registers at startup in all vLLM processes (API server + engine workers) via Python entry points. It provides:
- TurboQuantBackend: registered as the
CUSTOMattention backend - TurboQuantImpl: fused TQ decode (single-token) + standard Triton prefill (multi-token)
- TQGPUCache: GPU-resident TQ-compressed KV storage with DDR5 cold tier (spill/warm)
- ColdTierCache: Phase 1 fallback — async background GPU-to-CPU TQ encode
Decode reads directly from TQ-compressed GPU storage — no decompression buffer. 3.8x compression vs FP16 at 4-bit, up to 7.1x at 2-bit (1.9x vs FP8 at 4-bit).
# Env vars:
AITHER_TQ_BITS=4 # 2, 3, or 4 (default: 4)
AITHER_TQ_FUSED=1 # 1 = fused decode (default), 0 = standard fallback
# Or register manually in your own code:
from turboquant.vllm import register
register()
Where This Fits
Custom inference loop
If you manage your own KV cache, drop encode() where you write and decode() where you read:
from turboquant import TurboQuant
tq = TurboQuant(head_dim=128, bits=4, device="cuda")
# Write to cache: compress
packed, norms = tq.encode(key_proj) # [batch, heads, 128] -> [batch, heads, 64] uint8
# Read from cache: decompress
key_restored = tq.decode(packed, norms) # -> [batch, heads, 128] float16
Paged KV cache
Works with block-structured caches (like vLLM's). Handles arbitrary batch dimensions:
# Compress a block of 16 tokens across 8 heads
block = cache[block_idx] # [16, 8, 128]
packed, norms = tq.encode(block) # [16, 8, 64] uint8 + [16, 8] f32
restored = tq.decode(packed, norms) # [16, 8, 128]
Zero-buffer fused attention
Compute attention directly from compressed data without ever decompressing:
from turboquant.fused_attention import TQPagedAttention
attn = TQPagedAttention(tq, num_query_heads=32)
output = attn.forward(
query, k_packed, k_norms, v_packed, v_norms,
block_tables, context_lens,
)
The math: rotate the query forward once, dot-product in the rotated domain against codebook-decoded values, accumulate weighted values in the rotated domain, rotate back once. Two matrix multiplies total regardless of context length.
Uses fused Triton kernels on GPU (Ampere through Blackwell). Falls back to PyTorch reference on CPU.
Set AITHER_TQ_FORCE_TRITON=1 on Blackwell (SM_120) GPUs -- validated on RTX 5090 at 26 tok/s.
Research / benchmarking
tq = TurboQuant(head_dim=128, bits=4)
print(tq.validate(num_vectors=50000))
python -m turboquant.bench
Compression Ratios
For head_dim=128:
| Bits | Bytes/vector | vs FP16 | vs FP8 |
|---|---|---|---|
| 4 | 68 | 3.8x | 1.9x |
| 3 | 52 | 4.9x | 2.5x |
| 2 | 36 | 7.1x | 3.6x |
Validated MSE
| Bits | MSE | Theory Lower | Theory Upper | Ratio to LB |
|---|---|---|---|---|
| 4 | 0.0095 | 0.0039 | 0.0184 | 2.4x |
| 3 | 0.0345 | 0.0156 | 0.0736 | 2.2x |
| 2 | 0.1175 | 0.0625 | 0.2945 | 1.9x |
Algorithm
- Normalize: extract L2 norm, project onto unit sphere
- Rotate: multiply by a fixed random orthogonal matrix (data-oblivious). Makes each coordinate ~N(0, 1/d).
- Quantize: each coordinate via precomputed Lloyd-Max codebook
- Pack: indices into uint8 bytes
- Store: packed bytes + float32 norm
Decoding reverses steps 4-1.
API Reference
class TurboQuant:
def __init__(self, head_dim=128, bits=4, seed=42, device="cuda", ...)
def encode(self, x: Tensor) -> Tuple[Tensor, Tensor]
def decode(self, packed: Tensor, norms: Tensor) -> Tensor
def validate(self, num_vectors=10000) -> dict
def benchmark(self, num_vectors=32768) -> dict
def compression_ratio(self) -> float
def memory_report(self, seq_len, num_layers=32, num_kv_heads=8) -> dict
class TQPagedAttention:
def __init__(self, tq: TurboQuant, num_query_heads: int)
def forward(self, query, k_packed, k_norms, v_packed, v_norms,
block_tables, context_lens, block_size=16) -> Tensor
Reference
@article{zandieh2025turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
journal={arXiv preprint arXiv:2504.19874},
year={2025}
}
License
CC BY 4.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aither_kvcache-0.5.0.tar.gz.
File metadata
- Download URL: aither_kvcache-0.5.0.tar.gz
- Upload date:
- Size: 35.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d589e6f3e7ac80dd62017d01eceaef2e19c562e27041928e5ead98663a3bf77f
|
|
| MD5 |
91a2a5c2528d683141a1747ab744b482
|
|
| BLAKE2b-256 |
3d46ef1250feff5dfbcdcde7f08907d140c4fa73cfe7c4b962c9cefeaebdf315
|
Provenance
The following attestation bundles were made for aither_kvcache-0.5.0.tar.gz:
Publisher:
publish.yml on Aitherium/aitherkvcache
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aither_kvcache-0.5.0.tar.gz -
Subject digest:
d589e6f3e7ac80dd62017d01eceaef2e19c562e27041928e5ead98663a3bf77f - Sigstore transparency entry: 1200268394
- Sigstore integration time:
-
Permalink:
Aitherium/aitherkvcache@d281c468171fdf1212d49895f2716aa7b6e361fc -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/Aitherium
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d281c468171fdf1212d49895f2716aa7b6e361fc -
Trigger Event:
release
-
Statement type:
File details
Details for the file aither_kvcache-0.5.0-py3-none-any.whl.
File metadata
- Download URL: aither_kvcache-0.5.0-py3-none-any.whl
- Upload date:
- Size: 33.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b63a429033b16a3e14530a4c942dd949badef04dbe90732e828ba92eaaa680e
|
|
| MD5 |
cd5574dad15883335c0d4736976280f6
|
|
| BLAKE2b-256 |
120c689af651f330a45bf33d86f17dec6b5e8cc16a1c06f58954a10602166e53
|
Provenance
The following attestation bundles were made for aither_kvcache-0.5.0-py3-none-any.whl:
Publisher:
publish.yml on Aitherium/aitherkvcache
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aither_kvcache-0.5.0-py3-none-any.whl -
Subject digest:
8b63a429033b16a3e14530a4c942dd949badef04dbe90732e828ba92eaaa680e - Sigstore transparency entry: 1200268411
- Sigstore integration time:
-
Permalink:
Aitherium/aitherkvcache@d281c468171fdf1212d49895f2716aa7b6e361fc -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/Aitherium
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d281c468171fdf1212d49895f2716aa7b6e361fc -
Trigger Event:
release
-
Statement type: