Near-optimal KV cache quantization for LLM inference (arXiv:2504.19874)

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

aither_wzns

These details have not been verified by PyPI

Project links

Paper

Project description

aither-kvcache

Near-optimal KV cache quantization for LLM inference. Implements the TurboQuant algorithm from Zandieh et al. (arXiv:2504.19874).

Compresses KV cache vectors to 2-4 bits per value with MSE within 2.7x of the information-theoretic lower bound. No calibration data. No retraining. Works on streaming tokens.

Installation

pip install aither-kvcache

What This Does

Every transformer layer stores key-value vectors in a KV cache during inference. At FP16, each 128-dim vector costs 256 bytes. At FP8, 128 bytes. This library compresses them to 36-68 bytes (2-4 bit) with near-zero quality loss.

Bits	Bytes per vector	vs FP16	vs FP8
4	68	3.8x	1.9x
3	52	4.9x	2.5x
2	36	7.1x	3.6x

Where This Fits

In a custom inference loop

If you manage your own KV cache, drop encode() where you write to cache and decode() where you read from it:

from turboquant import TurboQuant
import torch

tq = TurboQuant(head_dim=128, bits=4, device="cuda")

# Your model produces key/value projections each step
# key_proj: [batch, num_kv_heads, head_dim]

# BEFORE: store raw FP16
# kv_cache[layer][block, pos] = key_proj          # 256 bytes per vector

# AFTER: store compressed
packed, norms = tq.encode(key_proj)                # 68 bytes per vector
# Store packed (uint8) and norms (float32) instead of the raw tensor

# When attention needs the keys:
key_restored = tq.decode(packed, norms)             # back to FP16
# Use key_restored in your attention computation

In a paged KV cache (like vLLM's block manager)

Paged caches store KV data in fixed-size blocks indexed by a block table. The library handles arbitrary batch dimensions, so it works with paged layouts:

tq = TurboQuant(head_dim=128, bits=4, device="cuda")

# Block-structured cache: [num_blocks, block_size, num_kv_heads, head_dim]
# Compress the entire cache or individual blocks:

# Compress a block of 16 tokens across 8 heads
block = cache[block_idx]                           # [16, 8, 128] FP16
packed, norms = tq.encode(block)                   # [16, 8, 64] uint8 + [16, 8] f32

# Decompress when needed for attention
restored = tq.decode(packed, norms)                # [16, 8, 128] FP16

Zero-buffer attention (the real goal)

The fused attention kernel computes attention directly from compressed data without decompressing first. This is where the memory savings actually happen — no decompression buffer means the compressed cache IS the only cache:

from turboquant import TurboQuant
from turboquant.fused_attention import TQPagedAttention

tq = TurboQuant(head_dim=128, bits=4, device="cuda")
attn = TQPagedAttention(tq, num_query_heads=32)

# query:        [num_seqs, num_query_heads, head_dim]  — from model
# k_packed:     [num_blocks, block_size, num_kv_heads, 64]  — compressed keys
# k_norms:      [num_blocks, block_size, num_kv_heads]  — key norms
# block_tables: [num_seqs, max_blocks_per_seq]  — maps sequences to blocks
# context_lens: [num_seqs]  — tokens per sequence

output = attn.forward(
    query, k_packed, k_norms, v_packed, v_norms,
    block_tables, context_lens,
)

The math: rotate the query forward once (Pi @ q), dot-product against codebook-decoded values in the rotated domain, accumulate weighted values in the rotated domain, rotate back once (Pi^T @ acc). Two matrix multiplies total regardless of context length. Everything else is codebook lookups and dot products.

This is a PyTorch reference implementation (correct, tested, works on CPU and GPU). A production Triton kernel is the next step.

As a research tool

Validate the paper's theoretical bounds on your own data:

tq = TurboQuant(head_dim=128, bits=4)
result = tq.validate(num_vectors=50000)
print(f"MSE: {result['mse']:.6f}")
print(f"Theory range: [{result['mse_theory_lower']:.6f}, {result['mse_theory_upper']:.6f}]")
print(f"Ratio to lower bound: {result['mse_ratio_to_lower']:.2f}x (paper claims <= 2.7x)")

Compare compression vs quality at different bit widths:

python -m turboquant.bench

Compute custom codebooks for non-standard dimensions:

from turboquant.codebook import compute_codebook_scipy
centroids, boundaries, mse = compute_codebook_scipy(d=256, bits=3)

What This Does NOT Do (Yet)

Does not integrate with vLLM/llama.cpp/TensorRT-LLM out of the box. The encode/decode primitives are framework-agnostic. Framework integration requires hooking into the specific cache allocation and attention paths. We have a working vLLM v0.15 integration in our production stack (280K tokens on RTX 5090) but it's not portable yet.
The fused attention is a PyTorch reference, not a Triton kernel. It's correct (16/16 tests pass, validated against decompress+FlashAttn) but not optimized for production throughput. The Triton kernel is next.
Does not handle KV cache management (block allocation, eviction, prefix caching). It compresses and decompresses vectors. Your inference framework handles the rest.

Algorithm

Normalize: extract L2 norm, project onto unit sphere
Rotate: multiply by a fixed random orthogonal matrix Pi (data-oblivious, generated once from a seed). Makes each coordinate ~N(0, 1/d).
Quantize: each coordinate independently via precomputed Lloyd-Max codebook
Pack: indices into uint8 bytes (2 values/byte at 4-bit, 4 values/byte at 2-bit)
Store: packed bytes + float32 norm

Decoding reverses steps 4-1.

Validated MSE

Bits	MSE	Theory Lower	Theory Upper	Ratio to LB
4	0.0095	0.0039	0.0184	2.4x
3	0.0345	0.0156	0.0736	2.2x
2	0.1175	0.0625	0.2945	1.9x

API Reference

class TurboQuant:
    def __init__(self, head_dim=128, bits=4, seed=42, device="cuda", ...)
    def encode(self, x: Tensor) -> Tuple[Tensor, Tensor]    # -> (packed, norms)
    def decode(self, packed: Tensor, norms: Tensor) -> Tensor
    def validate(self, num_vectors=10000) -> dict
    def benchmark(self, num_vectors=32768) -> dict
    def compression_ratio(self) -> float
    def memory_report(self, seq_len, num_layers=32, num_kv_heads=8) -> dict

class TQPagedAttention:
    def __init__(self, tq: TurboQuant, num_query_heads: int)
    def forward(self, query, k_packed, k_norms, v_packed, v_norms,
                block_tables, context_lens, block_size=16) -> Tensor

Reference

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}

License

CC BY 4.0

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

aither_wzns

These details have not been verified by PyPI

Project links

Paper

Release history Release notifications | RSS feed

2.1.0

Apr 15, 2026

2.0.1

Apr 7, 2026

2.0.0

Apr 7, 2026

1.3.1

Apr 5, 2026

1.3.0

Apr 5, 2026

1.2.1

Apr 5, 2026

1.2.0

Apr 5, 2026

1.1.1

Apr 4, 2026

1.1.0

Apr 3, 2026

0.9.2

Apr 2, 2026

0.9.1

Apr 2, 2026

0.8.1

Apr 1, 2026

0.8.0

Mar 31, 2026

0.7.0

Mar 30, 2026

0.6.0

Mar 30, 2026

0.5.0

Mar 30, 2026

0.4.0

Mar 28, 2026

0.3.0

Mar 27, 2026

This version

0.2.0

Mar 27, 2026

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aither_kvcache-0.2.0.tar.gz (30.8 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aither_kvcache-0.2.0-py3-none-any.whl (28.0 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file aither_kvcache-0.2.0.tar.gz.

File metadata

Download URL: aither_kvcache-0.2.0.tar.gz
Upload date: Mar 27, 2026
Size: 30.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aither_kvcache-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`426ef2696d8adbb4fb22a797a0720d48855cc972b09c134d2705c09b292131a4`
MD5	`6d86430f7cf11790c62555f8f05dbd1f`
BLAKE2b-256	`626a723d2fe0b3ae6ef3191f1e318ddde8a5ea5bab36ef1ff35fca27fbcb0645`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aither_kvcache-0.2.0.tar.gz:

Publisher: publish.yml on Aitherium/aitherkvcache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aither_kvcache-0.2.0.tar.gz
- Subject digest: 426ef2696d8adbb4fb22a797a0720d48855cc972b09c134d2705c09b292131a4
- Sigstore transparency entry: 1187726308
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: Aitherium/aitherkvcache@79a31b791bb4890d4f1754f8e0c87f00accf9335
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Aitherium
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@79a31b791bb4890d4f1754f8e0c87f00accf9335
- Trigger Event: release

File details

Details for the file aither_kvcache-0.2.0-py3-none-any.whl.

File metadata

Download URL: aither_kvcache-0.2.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 28.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aither_kvcache-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`502881baba54ae0f08f16b16f784d4c42784f6019854c2d9f36ca1af1a8c6138`
MD5	`cfa7253d5ed8116472c7b89ee1d918c5`
BLAKE2b-256	`7180ace626d3041acf18f8d2d84631eb3cddf6c0858df881741eafadaf7d5921`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aither_kvcache-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Aitherium/aitherkvcache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aither_kvcache-0.2.0-py3-none-any.whl
- Subject digest: 502881baba54ae0f08f16b16f784d4c42784f6019854c2d9f36ca1af1a8c6138
- Sigstore transparency entry: 1187726312
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: Aitherium/aitherkvcache@79a31b791bb4890d4f1754f8e0c87f00accf9335
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Aitherium
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@79a31b791bb4890d4f1754f8e0c87f00accf9335
- Trigger Event: release

aither-kvcache 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

aither-kvcache

Installation

What This Does

Where This Fits

In a custom inference loop

In a paged KV cache (like vLLM's block manager)

Zero-buffer attention (the real goal)

As a research tool

What This Does NOT Do (Yet)

Algorithm

Validated MSE

API Reference

Reference

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance