Skip to main content

TurboQuant KV cache compression for vLLM — fused Triton kernels, 3.76x compression, 3.7x faster decode on RTX 4090

Project description

PyPI Python License Ruff docs vetted

turboquant-vllm

TurboQuant KV cache compression as a drop-in vLLM plugin. 3.76x compression, near-identical output quality, one CLI flag to enable.

First open-source TurboQuant implementation — paper to working vLLM plugin in 72 hours.

Install

pip install turboquant-vllm[vllm]

Or with uv:

uv add turboquant-vllm --extra vllm

Quick Start (vLLM)

The TQ4 attention backend registers automatically via vLLM's plugin system:

vllm serve allenai/Molmo2-4B --attention-backend CUSTOM

No code changes required. The plugin compresses KV cache pages to 68 bytes/token/head (vs 256 bytes FP16).

Quick Start (HuggingFace)

from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache

cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)

# Pass cache (not the wrapper) to model.generate()
# Compression happens transparently on every cache.update()

Benchmark Results

Molmo2-4B (bfloat16, 36 layers) on RTX 4090 — 11K visual tokens from 2fps video + 256 generation tokens:

Mode KV Cache Compression Output Quality Overhead
FP16 baseline 1,639 MiB 1.0x -- --
TQ3 (3-bit) 845 MiB 1.94x ~95% cosine similarity 2.35x
TQ4 (full dequant) 435 MiB 3.76x ~97% cosine similarity 3.36x
TQ4 (incremental) 435 MiB 3.76x ~97% cosine, 100+ matching tokens 1.78x

How It Works

Implements Google's TurboQuant algorithm (ICLR 2026):

  1. Random orthogonal rotation maps each KV vector onto coordinates that follow a known Beta distribution
  2. Lloyd-Max scalar quantization finds optimal centroids for that distribution at 3-4 bits per coordinate
  3. Nibble packing stores two 4-bit indices per byte for 3.76x compression
  4. Incremental dequantization only decompresses new tokens each decode step, keeping overhead at 1.78x

What Gets Compressed

Data Compressed Format
Key cache vectors Yes uint8 nibble-packed indices + fp32 norms
Value cache vectors Yes uint8 nibble-packed indices + fp32 norms
Rotation matrices No Generated once per layer from fixed seed
Lloyd-Max codebook No Computed once, shared across all layers

Roadmap

  • Core TurboQuant algorithm (Lloyd-Max, MSE quantizer, compressors)
  • CompressedDynamicCache with incremental dequantization
  • vLLM TQ4 attention backend plugin
  • Fused Triton kernels (17.8x Q@K^T speedup, Flash Attention fusion)
  • Container image with turboquant-vllm baked in
  • Full Flash Attention fusion with fp32 online softmax
  • SageAttention-style INT8 path

Documentation

  • Architecture -- Module map, dependency DAG, data flow diagrams
  • Roadmap -- Detailed implementation status and experiment results
  • Development Guide -- Setup, build, test, lint commands

Citation

@inproceedings{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Han, Insu and Daliri, Majid and Karbasi, Amin},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025}
}

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_vllm-1.2.1.tar.gz (59.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turboquant_vllm-1.2.1-py3-none-any.whl (77.3 kB view details)

Uploaded Python 3

File details

Details for the file turboquant_vllm-1.2.1.tar.gz.

File metadata

  • Download URL: turboquant_vllm-1.2.1.tar.gz
  • Upload date:
  • Size: 59.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for turboquant_vllm-1.2.1.tar.gz
Algorithm Hash digest
SHA256 e447646e6332fd273a739eff94780403d9466f53b1140d5acbf2e87db0488053
MD5 5101b21b6c4f64c5b3665ef61394a083
BLAKE2b-256 9cc8853d2cdbdd7cc00a510183d521d943813a73ead2b61b6b3ab62289a105c4

See more details on using hashes here.

File details

Details for the file turboquant_vllm-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: turboquant_vllm-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 77.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for turboquant_vllm-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8bf8a36fb13fbc8338b99ce6f640e88f2f5313f0cafcfd03f7e6b2c44d0add4
MD5 68be57f4a547b272d125290dfecebbbb
BLAKE2b-256 1a3a1820bb4f85c6c2ed31630ec9c5b89518d3e310375ead0e0be927fc66058c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page