Near-optimal KV cache quantization for LLM inference (arXiv:2504.19874)

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

aither_wzns

These details have not been verified by PyPI

Project links

Paper

Project description

TurboQuant

Near-optimal KV cache quantization for LLM inference. Implements the algorithm from Zandieh et al., "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate" (arXiv:2504.19874).

Compresses KV cache vectors to 2-4 bits per value with MSE within 2.7x of the information-theoretic lower bound. No calibration data. No retraining. Works online (one vector at a time).

Installation

pip install turboquant

Optional extras:

pip install turboquant[triton]   # GPU-fused quantize/dequantize kernels
pip install turboquant[scipy]    # Custom codebook computation via Lloyd-Max
pip install turboquant[dev]      # pytest for running tests

Quick Start

from turboquant import TurboQuant

tq = TurboQuant(head_dim=128, bits=4, device="cuda")

# Encode: FP16 vectors -> packed uint8 + norms
packed, norms = tq.encode(kv_vectors)   # kv_vectors: [..., 128] float16

# Decode: packed representation -> reconstructed vectors
decoded = tq.decode(packed, norms)      # decoded: [..., 128] float16

# Validate MSE against theory
print(tq.validate())

Algorithm

TurboQuant applies three steps to each KV cache vector:

Normalize -- extract L2 norm, project onto the unit sphere S^{d-1}.
Random rotation -- multiply by a fixed orthogonal matrix Pi. This makes each coordinate approximately Gaussian N(0, 1/d), regardless of the input distribution. The rotation is data-oblivious (generated once from a seed).
Optimal scalar quantization -- quantize each coordinate independently using a precomputed Lloyd-Max codebook for N(0, 1/d). Pack indices into uint8 bytes.

Storage per vector: ceil(d * bits / 8) bytes for indices + 4 bytes for the float32 norm.

Decoding reverses the process: unpack indices, look up codebook centroids, apply inverse rotation Pi^T, rescale by the stored norm.

Compression Ratios

Ratios for head_dim=128 (256 bytes at FP16, 128 bytes at FP8):

Bits	Packed Size	Ratio vs FP16	Ratio vs FP8
4	68 bytes	3.8x	1.9x
3	52 bytes	4.9x	2.5x
2	36 bytes	7.1x	3.6x

Validated MSE

MSE for unit vectors on S^{127} (d=128). Theory bounds from the paper:

Bits	MSE (measured)	Theory Lower	Theory Upper	Ratio to LB
4	0.0095	0.0039	0.0184	2.4x
3	0.0345	0.0156	0.0736	2.2x
2	0.1175	0.0625	0.2945	1.9x

All measured values are within the paper's upper bound and well below the worst-case ratio of 3*pi/2 = 4.71x.

Fused Attention (TQPagedAttention)

The key optimization for inference: compute attention scores and accumulate values in the rotated domain without ever materializing a decompression buffer.

from turboquant.fused_attention import TQPagedAttention

attn = TQPagedAttention(tq, num_query_heads=32)
output = attn.forward(
    query,          # [num_seqs, num_query_heads, head_dim]
    k_packed,       # [num_blocks, block_size, num_kv_heads, packed_dim]
    k_norms,        # [num_blocks, block_size, num_kv_heads]
    v_packed,       # same layout as k_packed
    v_norms,        # same layout as k_norms
    block_tables,   # [num_seqs, max_blocks_per_seq]
    context_lens,   # [num_seqs]
)

The math:

q_rot = Pi @ q                                          # rotate query once
score_i = ||k_i|| * dot(q_rot, y_hat_k_i) / sqrt(d)    # score in rotated domain
acc += softmax_weight_i * ||v_i|| * y_hat_v_i           # accumulate rotated V
output = Pi^T @ normalize(acc)                           # rotate back once

This reads packed uint8 indices directly, avoids the O(seq_len * head_dim) decompression buffer, and uses the even/odd nibble split to compute dot products without interleaving after 4-bit unpacking.

API Reference

TurboQuant

class TurboQuant:
    def __init__(self, config=None, *, head_dim=128, bits=4, seed=42,
                 use_hadamard=False, device="cuda", dtype=torch.float16,
                 use_triton=True): ...

    def encode(self, x: Tensor) -> Tuple[Tensor, Tensor]: ...
    def decode(self, packed: Tensor, norms: Tensor) -> Tensor: ...
    def validate(self, num_vectors=10000, device=None) -> dict: ...
    def benchmark(self, num_vectors=32768, warmup=10, iters=100) -> dict: ...
    def compression_ratio(self) -> float: ...
    def memory_report(self, seq_len, num_layers=32, num_kv_heads=8) -> dict: ...

TurboQuantConfig

@dataclass
class TurboQuantConfig:
    head_dim: int = 128          # Must be power of 2
    bits: int = 4                # 2, 3, or 4
    seed: int = 42               # RNG seed for rotation matrix
    use_hadamard: bool = False   # True = Randomized Hadamard Transform
    hadamard_rounds: int = 3     # RHT rounds (>= 3 for near-Haar)
    device: str = "cuda"
    dtype: torch.dtype = torch.float16
    use_triton: bool = True      # Try fused Triton kernels on GPU

TQPagedAttention

class TQPagedAttention:
    def __init__(self, tq: TurboQuant, num_query_heads: int): ...

    def forward(self, query, k_packed, k_norms, v_packed, v_norms,
                block_tables, context_lens, block_size=16,
                num_kv_heads=None) -> Tensor: ...

Benchmark

Run the built-in benchmark to validate correctness and measure throughput:

python -m turboquant.bench

This reports:

MSE vs theoretical bounds for each bit-width
Encode/decode throughput (vectors/second)
KV cache memory usage for common model configurations
Maximum context length estimates for given GPU memory

Codebook Computation

The package includes hardcoded Lloyd-Max codebooks for 1-4 bit quantization of N(0,1). For custom configurations, compute codebooks from scratch:

from turboquant.codebook import compute_codebook_scipy

centroids, boundaries, mse = compute_codebook_scipy(d=128, bits=3)

Requires scipy (install with pip install turboquant[scipy]).

Reference

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Han, Insu and Daliri, Majid and Karbasi, Amin},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}

License

CC BY 4.0 -- see LICENSE file.

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

aither_wzns

These details have not been verified by PyPI

Project links

Paper

Release history Release notifications | RSS feed

2.1.0

Apr 15, 2026

2.0.1

Apr 7, 2026

2.0.0

Apr 7, 2026

1.3.1

Apr 5, 2026

1.3.0

Apr 5, 2026

1.2.1

Apr 5, 2026

1.2.0

Apr 5, 2026

1.1.1

Apr 4, 2026

1.1.0

Apr 3, 2026

0.9.2

Apr 2, 2026

0.9.1

Apr 2, 2026

0.8.1

Apr 1, 2026

0.8.0

Mar 31, 2026

0.7.0

Mar 30, 2026

0.6.0

Mar 30, 2026

0.5.0

Mar 30, 2026

0.4.0

Mar 28, 2026

0.3.0

Mar 27, 2026

0.2.0

Mar 27, 2026

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aither_kvcache-0.1.0.tar.gz (26.4 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aither_kvcache-0.1.0-py3-none-any.whl (22.3 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file aither_kvcache-0.1.0.tar.gz.

File metadata

Download URL: aither_kvcache-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 26.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aither_kvcache-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6959c6aff3af23f9a40f8779d2b00620edd3e2ddd6031efb80803613a10074a5`
MD5	`c9a5771973c5bad79b068e34e0677747`
BLAKE2b-256	`7a19c3762ca985f71498b9a402541d79afa866eeefa0e4d6d70fc0fca40b4ff8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aither_kvcache-0.1.0.tar.gz:

Publisher: publish.yml on Aitherium/aitherkvcache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aither_kvcache-0.1.0.tar.gz
- Subject digest: 6959c6aff3af23f9a40f8779d2b00620edd3e2ddd6031efb80803613a10074a5
- Sigstore transparency entry: 1187687040
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: Aitherium/aitherkvcache@3959add99cb9a1072e02db502027b366a69e72ed
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Aitherium
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3959add99cb9a1072e02db502027b366a69e72ed
- Trigger Event: release

File details

Details for the file aither_kvcache-0.1.0-py3-none-any.whl.

File metadata

Download URL: aither_kvcache-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 22.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aither_kvcache-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`49ffb51e2d16a80d280463c24451a50a041fff946f38f5b95c6380f2ef69a4f6`
MD5	`c99141035fcc812030dda84637da81cd`
BLAKE2b-256	`aa5dced68d01164aa1d7ec3da2490d784dd2cb120bdf7e4813bf95d62097c493`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aither_kvcache-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Aitherium/aitherkvcache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aither_kvcache-0.1.0-py3-none-any.whl
- Subject digest: 49ffb51e2d16a80d280463c24451a50a041fff946f38f5b95c6380f2ef69a4f6
- Sigstore transparency entry: 1187687056
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: Aitherium/aitherkvcache@3959add99cb9a1072e02db502027b366a69e72ed
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Aitherium
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3959add99cb9a1072e02db502027b366a69e72ed
- Trigger Event: release

aither-kvcache 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TurboQuant

Installation

Quick Start

Algorithm

Compression Ratios

Validated MSE

Fused Attention (TQPagedAttention)

API Reference

TurboQuant

TurboQuantConfig

TQPagedAttention

Benchmark

Codebook Computation

Reference

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance