Skip to main content

Near-optimal KV cache quantization for LLM inference (arXiv:2504.19874)

Project description

aither-kvcache

Near-optimal KV cache quantization for LLM inference. Implements the TurboQuant algorithm from Zandieh et al. (arXiv:2504.19874).

Compresses KV cache vectors to 2-4 bits per value with MSE within 2.7x of the information-theoretic lower bound. No calibration data. No retraining. Works on streaming tokens.

Installation

pip install aither-kvcache            # core library
pip install aither-kvcache[vllm]      # + vLLM plugin (v0.15+)
pip install aither-kvcache[triton]    # + fused GPU kernels
pip install aither-kvcache[all]       # everything

Quick Start

from turboquant import TurboQuant

tq = TurboQuant(head_dim=128, bits=4, device="cuda")

packed, norms = tq.encode(kv_vectors)   # [..., 128] float16 -> [..., 64] uint8 + [...] f32
decoded = tq.decode(packed, norms)       # [..., 64] uint8 + [...] f32 -> [..., 128] float16

vLLM Integration

Works with vLLM v0.15+ via the official plugin system. No monkey-patching.

pip install aither-kvcache[vllm]
vllm serve your-model --attention-backend CUSTOM

The plugin auto-registers at startup in all vLLM processes (API server + engine workers) via Python entry points. It provides:

  • TurboQuantBackend: registered as the CUSTOM attention backend
  • TurboQuantImpl: fused TQ decode (single-token) + standard Triton prefill (multi-token)
  • TQGPUCache: GPU-resident TQ-compressed KV storage with DDR5 cold tier (spill/warm)
  • ColdTierCache: Phase 1 fallback — async background GPU-to-CPU TQ encode

Decode reads directly from TQ-compressed GPU storage — no decompression buffer. 3.8x compression vs FP16 at 4-bit, up to 7.1x at 2-bit (1.9x vs FP8 at 4-bit).

# Env vars:
AITHER_TQ_BITS=4              # 2, 3, or 4 (default: 4)
AITHER_TQ_FUSED=1             # 1 = fused decode, 0 = standard fallback
AITHER_TQ_EAGER=0             # 0 = torch.compile+CUDA graphs (recommended)
AITHER_TQ_FORCE_TRITON=1      # Required on Blackwell (SM_100+)

Validated: 224 tok/s decode at 5 concurrent on RTX 5090 (Blackwell SM_120), 288 tok/s peak at 10 concurrent.

Hook-based integration (v0.9.1+)

For maximum performance with torch.compile + CUDA graphs, use the hook-based approach instead of the custom backend. This monkey-patches TritonAttentionImpl.forward() to intercept encode/decode without registering a custom backend (avoids Inductor corruption bugs):

from turboquant.vllm.hooks import apply_tq_hooks

# Call AFTER vLLM model is loaded:
apply_tq_hooks()

The hook path merges encode + fused attention into a single @torch.compiler.disable call per layer, eliminating redundant graph breaks and CPU-GPU synchronization. Measured: 40 tok/s single-request decode on RTX 5090, up from 11 tok/s with separate encode/decode calls.

# Or register the plugin-based backend:
from turboquant.vllm import register
register()

Where This Fits

Custom inference loop

If you manage your own KV cache, drop encode() where you write and decode() where you read:

from turboquant import TurboQuant

tq = TurboQuant(head_dim=128, bits=4, device="cuda")

# Write to cache: compress
packed, norms = tq.encode(key_proj)       # [batch, heads, 128] -> [batch, heads, 64] uint8

# Read from cache: decompress
key_restored = tq.decode(packed, norms)   # -> [batch, heads, 128] float16

Paged KV cache

Works with block-structured caches (like vLLM's). Handles arbitrary batch dimensions:

# Compress a block of 16 tokens across 8 heads
block = cache[block_idx]                   # [16, 8, 128]
packed, norms = tq.encode(block)           # [16, 8, 64] uint8 + [16, 8] f32
restored = tq.decode(packed, norms)        # [16, 8, 128]

Zero-buffer fused attention

Compute attention directly from compressed data without ever decompressing:

from turboquant.fused_attention import TQPagedAttention

attn = TQPagedAttention(tq, num_query_heads=32)
output = attn.forward(
    query, k_packed, k_norms, v_packed, v_norms,
    block_tables, context_lens,
)

The math: rotate the query forward once, dot-product in the rotated domain against codebook-decoded values, accumulate weighted values in the rotated domain, rotate back once. Two matrix multiplies total regardless of context length.

Uses fused Triton kernels on GPU (Ampere through Blackwell). Falls back to PyTorch reference on CPU. Set AITHER_TQ_FORCE_TRITON=1 on Blackwell (SM_120) GPUs -- validated on RTX 5090 at 26 tok/s.

Research / benchmarking

tq = TurboQuant(head_dim=128, bits=4)
print(tq.validate(num_vectors=50000))
python -m turboquant.bench

Compression Ratios

For head_dim=128:

Bits Bytes/vector vs FP16 vs FP8
4 68 3.8x 1.9x
3 52 4.9x 2.5x
2 36 7.1x 3.6x

Validated MSE

Bits MSE Theory Lower Theory Upper Ratio to LB
4 0.0095 0.0039 0.0184 2.4x
3 0.0345 0.0156 0.0736 2.2x
2 0.1175 0.0625 0.2945 1.9x

Algorithm

  1. Normalize: extract L2 norm, project onto unit sphere
  2. Rotate: multiply by a fixed random orthogonal matrix (data-oblivious). Makes each coordinate ~N(0, 1/d).
  3. Quantize: each coordinate via precomputed Lloyd-Max codebook
  4. Pack: indices into uint8 bytes
  5. Store: packed bytes + float32 norm

Decoding reverses steps 4-1.

API Reference

class TurboQuant:
    def __init__(self, head_dim=128, bits=4, seed=42, device="cuda", ...)
    def encode(self, x: Tensor) -> Tuple[Tensor, Tensor]
    def decode(self, packed: Tensor, norms: Tensor) -> Tensor
    def validate(self, num_vectors=10000) -> dict
    def benchmark(self, num_vectors=32768) -> dict
    def compression_ratio(self) -> float
    def memory_report(self, seq_len, num_layers=32, num_kv_heads=8) -> dict

class TQPagedAttention:
    def __init__(self, tq: TurboQuant, num_query_heads: int)
    def forward(self, query, k_packed, k_norms, v_packed, v_norms,
                block_tables, context_lens, block_size=16) -> Tensor

Reference

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}

License

CC BY 4.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aither_kvcache-0.9.1.tar.gz (69.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aither_kvcache-0.9.1-py3-none-any.whl (70.6 kB view details)

Uploaded Python 3

File details

Details for the file aither_kvcache-0.9.1.tar.gz.

File metadata

  • Download URL: aither_kvcache-0.9.1.tar.gz
  • Upload date:
  • Size: 69.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aither_kvcache-0.9.1.tar.gz
Algorithm Hash digest
SHA256 500e9ccf56fc294ad29738b5218872e9d9f93d811a2c0095c7ba0b604afa6ef7
MD5 60b75b232b6dca588fc160ae003e29fe
BLAKE2b-256 09a51f842f49cc99359ce4a7af5348e17d7e2993eb170f8d2119beb85dcb93f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for aither_kvcache-0.9.1.tar.gz:

Publisher: publish.yml on Aitherium/aitherkvcache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file aither_kvcache-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: aither_kvcache-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 70.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aither_kvcache-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f21ccf9d3826db151abcfbcabede9979c9803fbf1a356e84e3daa1b9d558aca3
MD5 d2f0368b240d778138b41a13e1670da1
BLAKE2b-256 51adc559d5510553687d220457c0c9fb67cd8bc003c20afcac5e57d34f94ef19

See more details on using hashes here.

Provenance

The following attestation bundles were made for aither_kvcache-0.9.1-py3-none-any.whl:

Publisher: publish.yml on Aitherium/aitherkvcache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page