TurboQuant KV cache compression for vLLM — fused Triton kernels, 3.76x compression, 3.7x faster decode on RTX 4090
Project description
turboquant-vllm
TurboQuant KV cache compression as a drop-in vLLM plugin. 3.76x compression, near-identical output quality, one CLI flag to enable.
First open-source TurboQuant implementation — paper to working vLLM plugin in 72 hours.
Install
pip install turboquant-vllm[vllm]
Or with uv:
uv add turboquant-vllm --extra vllm
Quick Start (vLLM)
The TQ4 attention backend registers automatically via vLLM's plugin system:
vllm serve allenai/Molmo2-4B --attention-backend CUSTOM
No code changes required. The plugin compresses KV cache pages to 68 bytes/token/head (vs 256 bytes FP16).
Quick Start (HuggingFace)
from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
# Pass cache (not the wrapper) to model.generate()
# Compression happens transparently on every cache.update()
Benchmark Results
Molmo2-4B (bfloat16, 36 layers) on RTX 4090 — 11K visual tokens from 2fps video + 256 generation tokens:
| Mode | KV Cache | Compression | Output Quality | Overhead |
|---|---|---|---|---|
| FP16 baseline | 1,639 MiB | 1.0x | -- | -- |
| TQ3 (3-bit) | 845 MiB | 1.94x | ~95% cosine similarity | 2.35x |
| TQ4 (full dequant) | 435 MiB | 3.76x | ~97% cosine similarity | 3.36x |
| TQ4 (incremental) | 435 MiB | 3.76x | ~97% cosine, 100+ matching tokens | 1.78x |
How It Works
Implements Google's TurboQuant algorithm (ICLR 2026):
- Random orthogonal rotation maps each KV vector onto coordinates that follow a known Beta distribution
- Lloyd-Max scalar quantization finds optimal centroids for that distribution at 3-4 bits per coordinate
- Nibble packing stores two 4-bit indices per byte for 3.76x compression
- Incremental dequantization only decompresses new tokens each decode step, keeping overhead at 1.78x
What Gets Compressed
| Data | Compressed | Format |
|---|---|---|
| Key cache vectors | Yes | uint8 nibble-packed indices + fp32 norms |
| Value cache vectors | Yes | uint8 nibble-packed indices + fp32 norms |
| Rotation matrices | No | Generated once per layer from fixed seed |
| Lloyd-Max codebook | No | Computed once, shared across all layers |
Roadmap
- Core TurboQuant algorithm (Lloyd-Max, MSE quantizer, compressors)
- CompressedDynamicCache with incremental dequantization
- vLLM TQ4 attention backend plugin
- Fused Triton kernels (17.8x Q@K^T speedup, Flash Attention fusion)
- Container image with turboquant-vllm baked in
- Full Flash Attention fusion with fp32 online softmax
- SageAttention-style INT8 path
Documentation
- Architecture -- Module map, dependency DAG, data flow diagrams
- Roadmap -- Detailed implementation status and experiment results
- Development Guide -- Setup, build, test, lint commands
Citation
@inproceedings{zandieh2025turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Han, Insu and Daliri, Majid and Karbasi, Amin},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turboquant_vllm-1.2.1.tar.gz.
File metadata
- Download URL: turboquant_vllm-1.2.1.tar.gz
- Upload date:
- Size: 59.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e447646e6332fd273a739eff94780403d9466f53b1140d5acbf2e87db0488053
|
|
| MD5 |
5101b21b6c4f64c5b3665ef61394a083
|
|
| BLAKE2b-256 |
9cc8853d2cdbdd7cc00a510183d521d943813a73ead2b61b6b3ab62289a105c4
|
File details
Details for the file turboquant_vllm-1.2.1-py3-none-any.whl.
File metadata
- Download URL: turboquant_vllm-1.2.1-py3-none-any.whl
- Upload date:
- Size: 77.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8bf8a36fb13fbc8338b99ce6f640e88f2f5313f0cafcfd03f7e6b2c44d0add4
|
|
| MD5 |
68be57f4a547b272d125290dfecebbbb
|
|
| BLAKE2b-256 |
1a3a1820bb4f85c6c2ed31630ec9c5b89518d3e310375ead0e0be927fc66058c
|