TurboQuant KV cache compression for vLLM — fused Triton kernels, 3.76x compression, 3.7x faster decode on RTX 4090

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

turboquant-vllm

TurboQuant KV cache compression as a drop-in vLLM plugin. 3.76x compression, near-identical output quality, one CLI flag to enable.

First open-source TurboQuant implementation — paper to working vLLM plugin in 72 hours.

Install

pip install turboquant-vllm[vllm]

Or with uv:

uv add turboquant-vllm --extra vllm

Quick Start (vLLM)

The TQ4 attention backend registers automatically via vLLM's plugin system:

vllm serve allenai/Molmo2-4B --attention-backend CUSTOM

No code changes required. The plugin compresses KV cache pages to 68 bytes/token/head (vs 256 bytes FP16).

Quick Start (HuggingFace)

from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache

cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)

# Pass cache (not the wrapper) to model.generate()
# Compression happens transparently on every cache.update()

Benchmark Results

Molmo2-4B (bfloat16, 36 layers) on RTX 4090 — 11K visual tokens from 2fps video + 256 generation tokens:

Mode	KV Cache	Compression	Output Quality	Overhead
FP16 baseline	1,639 MiB	1.0x	--	--
TQ3 (3-bit)	845 MiB	1.94x	~95% cosine similarity	2.35x
TQ4 (full dequant)	435 MiB	3.76x	~97% cosine similarity	3.36x
TQ4 (incremental)	435 MiB	3.76x	~97% cosine, 100+ matching tokens	1.78x

How It Works

Implements Google's TurboQuant algorithm (ICLR 2026):

Random orthogonal rotation maps each KV vector onto coordinates that follow a known Beta distribution
Lloyd-Max scalar quantization finds optimal centroids for that distribution at 3-4 bits per coordinate
Nibble packing stores two 4-bit indices per byte for 3.76x compression
Incremental dequantization only decompresses new tokens each decode step, keeping overhead at 1.78x

What Gets Compressed

Data	Compressed	Format
Key cache vectors	Yes	uint8 nibble-packed indices + fp32 norms
Value cache vectors	Yes	uint8 nibble-packed indices + fp32 norms
Rotation matrices	No	Generated once per layer from fixed seed
Lloyd-Max codebook	No	Computed once, shared across all layers

Roadmap

Core TurboQuant algorithm (Lloyd-Max, MSE quantizer, compressors)
CompressedDynamicCache with incremental dequantization
vLLM TQ4 attention backend plugin
Fused Triton kernels (17.8x Q@K^T speedup, Flash Attention fusion)
Container image with turboquant-vllm baked in
Full Flash Attention fusion with fp32 online softmax
SageAttention-style INT8 path

Documentation

Architecture -- Module map, dependency DAG, data flow diagrams
Roadmap -- Detailed implementation status and experiment results
Development Guide -- Setup, build, test, lint commands

Citation

@inproceedings{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Han, Insu and Daliri, Majid and Karbasi, Amin},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025}
}

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Alberto-Codes

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.5.0

Apr 8, 2026

1.4.1

Apr 4, 2026

1.4.0

Apr 1, 2026

1.3.0

Mar 31, 2026

1.2.2

Mar 30, 2026

1.2.1

Mar 30, 2026

This version

1.2.0

Mar 29, 2026

1.1.1

Mar 28, 2026

1.1.0

Mar 27, 2026

1.0.0

Mar 27, 2026

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_vllm-1.2.0.tar.gz (58.1 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turboquant_vllm-1.2.0-py3-none-any.whl (76.2 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file turboquant_vllm-1.2.0.tar.gz.

File metadata

Download URL: turboquant_vllm-1.2.0.tar.gz
Upload date: Mar 29, 2026
Size: 58.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for turboquant_vllm-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`df1fccd8cb2bc50ce71261fe01f73ce9b058b201624e954521b98fc55a1a2607`
MD5	`14abf4604f483572ef7b120d751cd5a0`
BLAKE2b-256	`edb1fda87255a127bee075764eb30b591277bb0c826bf704cd4840114e440526`

See more details on using hashes here.

File details

Details for the file turboquant_vllm-1.2.0-py3-none-any.whl.

File metadata

Download URL: turboquant_vllm-1.2.0-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 76.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for turboquant_vllm-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e6f2c65206b855002b4456c9dc941a524bc369254362c49e96c8a56835c4be3`
MD5	`ce5f0a00e1bd23487ef8457d3251d86a`
BLAKE2b-256	`289ad37a6c26fe2470f8e3538546c62bb182e1b36910cbb23a66c47aaef71095`

See more details on using hashes here.

turboquant-vllm 1.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

turboquant-vllm

Install

Quick Start (vLLM)

Quick Start (HuggingFace)

Benchmark Results

How It Works

What Gets Compressed

Roadmap

Documentation

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes