Near-optimal KV cache quantization for LLM inference (arXiv:2504.19874)

These details have not been verified by PyPI

Project links

Project description

aither-kvcache

Near-optimal KV cache quantization for LLM inference. Implements the TurboQuant algorithm from Zandieh et al. (arXiv:2504.19874).

Compresses KV cache vectors to 2-4 bits per value with MSE within 2.7x of the information-theoretic lower bound. No calibration data. No retraining. Works on streaming tokens.

Installation

pip install aither-kvcache            # core library
pip install aither-kvcache[vllm]      # + vLLM plugin (v0.15+)
pip install aither-kvcache[triton]    # + fused GPU kernels
pip install aither-kvcache[all]       # everything

Quick Start

from turboquant import TurboQuant

tq = TurboQuant(head_dim=128, bits=4, device="cuda")

packed, norms = tq.encode(kv_vectors)   # [..., 128] float16 -> [..., 64] uint8 + [...] f32
decoded = tq.decode(packed, norms)       # [..., 64] uint8 + [...] f32 -> [..., 128] float16

vLLM Integration

Works with vLLM v0.15+ via the official plugin system. No monkey-patching.

pip install aither-kvcache[vllm]
VLLM_ATTENTION_BACKEND=CUSTOM vllm serve your-model

The plugin auto-registers at startup in all vLLM processes (API server + engine workers) via Python entry points. It provides:

TurboQuantBackend: registered as the CUSTOM attention backend
TurboQuantImpl: handles attention using vLLM's Triton kernels + async TQ compression
ColdTierCache: background GPU-to-CPU transfer + TQ encode on a separate thread, zero sync on the attention hot path

Every token is TQ-compressed to a CPU cold tier in the background. The cold tier provides decompress_blocks() for future block warming (prefix cache from compressed data).

# Or register manually in your own code:
from turboquant.vllm import register
register()

Where This Fits

Custom inference loop

If you manage your own KV cache, drop encode() where you write and decode() where you read:

from turboquant import TurboQuant

tq = TurboQuant(head_dim=128, bits=4, device="cuda")

# Write to cache: compress
packed, norms = tq.encode(key_proj)       # [batch, heads, 128] -> [batch, heads, 64] uint8

# Read from cache: decompress
key_restored = tq.decode(packed, norms)   # -> [batch, heads, 128] float16

Paged KV cache

Works with block-structured caches (like vLLM's). Handles arbitrary batch dimensions:

# Compress a block of 16 tokens across 8 heads
block = cache[block_idx]                   # [16, 8, 128]
packed, norms = tq.encode(block)           # [16, 8, 64] uint8 + [16, 8] f32
restored = tq.decode(packed, norms)        # [16, 8, 128]

Zero-buffer fused attention

Compute attention directly from compressed data without ever decompressing:

from turboquant.fused_attention import TQPagedAttention

attn = TQPagedAttention(tq, num_query_heads=32)
output = attn.forward(
    query, k_packed, k_norms, v_packed, v_norms,
    block_tables, context_lens,
)

The math: rotate the query forward once, dot-product in the rotated domain against codebook-decoded values, accumulate weighted values in the rotated domain, rotate back once. Two matrix multiplies total regardless of context length.

This is a PyTorch reference implementation. A production Triton kernel is next.

Research / benchmarking

tq = TurboQuant(head_dim=128, bits=4)
print(tq.validate(num_vectors=50000))

python -m turboquant.bench

Compression Ratios

For head_dim=128:

Bits	Bytes/vector	vs FP16	vs FP8
4	68	3.8x	1.9x
3	52	4.9x	2.5x
2	36	7.1x	3.6x

Validated MSE

Bits	MSE	Theory Lower	Theory Upper	Ratio to LB
4	0.0095	0.0039	0.0184	2.4x
3	0.0345	0.0156	0.0736	2.2x
2	0.1175	0.0625	0.2945	1.9x

Algorithm

Normalize: extract L2 norm, project onto unit sphere
Rotate: multiply by a fixed random orthogonal matrix (data-oblivious). Makes each coordinate ~N(0, 1/d).
Quantize: each coordinate via precomputed Lloyd-Max codebook
Pack: indices into uint8 bytes
Store: packed bytes + float32 norm

Decoding reverses steps 4-1.

API Reference

class TurboQuant:
    def __init__(self, head_dim=128, bits=4, seed=42, device="cuda", ...)
    def encode(self, x: Tensor) -> Tuple[Tensor, Tensor]
    def decode(self, packed: Tensor, norms: Tensor) -> Tensor
    def validate(self, num_vectors=10000) -> dict
    def benchmark(self, num_vectors=32768) -> dict
    def compression_ratio(self) -> float
    def memory_report(self, seq_len, num_layers=32, num_kv_heads=8) -> dict

class TQPagedAttention:
    def __init__(self, tq: TurboQuant, num_query_heads: int)
    def forward(self, query, k_packed, k_norms, v_packed, v_norms,
                block_tables, context_lens, block_size=16) -> Tensor

Reference

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}

License

CC BY 4.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.1.0

Apr 15, 2026

2.0.1

Apr 7, 2026

2.0.0

Apr 7, 2026

1.3.1

Apr 5, 2026

1.3.0

Apr 5, 2026

1.2.1

Apr 5, 2026

1.2.0

Apr 5, 2026

1.1.1

Apr 4, 2026

1.1.0

Apr 3, 2026

0.9.2

Apr 2, 2026

0.9.1

Apr 2, 2026

0.8.1

Apr 1, 2026

0.8.0

Mar 31, 2026

0.7.0

Mar 30, 2026

0.6.0

Mar 30, 2026

0.5.0

Mar 30, 2026

0.4.0

Mar 28, 2026

This version

0.3.0

Mar 27, 2026

0.2.0

Mar 27, 2026

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aither_kvcache-0.3.0.tar.gz (33.4 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aither_kvcache-0.3.0-py3-none-any.whl (29.9 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file aither_kvcache-0.3.0.tar.gz.

File metadata

Download URL: aither_kvcache-0.3.0.tar.gz
Upload date: Mar 27, 2026
Size: 33.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for aither_kvcache-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`053597e808229cb2bbbee23af543f51b0b7919927ec896201ffcd8065fde054c`
MD5	`6188443468b432ab31daa55172f6c43b`
BLAKE2b-256	`894eab8b4ae616b06e67c1a856200b1aa7dcc490feb6baf52a96e5daa2abda30`

See more details on using hashes here.

File details

Details for the file aither_kvcache-0.3.0-py3-none-any.whl.

File metadata

Download URL: aither_kvcache-0.3.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 29.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for aither_kvcache-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c13bdd62dba1ada082b35d5920d7b1a7efbaf1315b7b99e6700a2cc759ae195`
MD5	`2312282cc9b7488773eef23c9eea80b5`
BLAKE2b-256	`ebc128e87881b715801a5336ed27cadbcb3393b3654aa42728e92ff51ea6400c`

See more details on using hashes here.

aither-kvcache 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

aither-kvcache

Installation

Quick Start

vLLM Integration

Where This Fits

Custom inference loop

Paged KV cache

Zero-buffer fused attention

Research / benchmarking

Compression Ratios

Validated MSE

Algorithm

API Reference

Reference

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes