TurboQuant KV cache compression for vLLM — fused Triton kernels, 3.76x compression, 3.7x faster decode on RTX 4090

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

TurboQuant Consumer

Implementation of Google's TurboQuant algorithm (arXiv 2504.19874, ICLR 2026) for compressing transformer KV caches on consumer GPUs. Validated on Molmo2 vision-language models with real video inference on an RTX 4090.

Headline Results

3.76x KV cache compression with near-identical output quality on Molmo2-4B processing 11K-token Seinfeld video clips:

Mode	KV Cache	Compression	Output Quality	Overhead
FP16 baseline	1,639 MiB	1.0x	--	--
TQ3 (3-bit uint8)	845 MiB	1.94x	Coherent, different details	2.35x slower
TQ4 full-cache dequant	435 MiB	3.76x	Near-identical (100+ tokens match)	3.36x slower
TQ4 incremental dequant	435 MiB	3.76x	Near-identical (100+ tokens match)	1.78x slower

First TurboQuant implementation validated on a vision-language model (VLM) with video input.

What's Here

Core algorithm -- Lloyd-Max codebook solver, TurboQuantMSE (Stage 1), TurboQuantProd (Stage 2 with QJL correction)
CompressedDynamicCache -- Drop-in KV cache wrapper storing uint8 indices + fp32 norms with incremental dequantization (only new tokens per decode step). At bits=4, indices are nibble-packed (two per byte) for 3.76x compression at 1.78x overhead.
Benchmark harness -- A/B testing CLI comparing baseline vs compressed on any HuggingFace model
62 tests -- Including long-sequence regression tests (36 layers, 1024 tokens) that catch precision bugs

Quickstart

# Install
git clone https://github.com/Alberto-Codes/turboquant-consumer.git
cd turboquant-consumer
uv sync

# Run tests
uv run pytest tests/ -v

# Benchmark on Molmo2-4B (requires GPU + model weights)
uv run python -m turboquant_consumer.benchmark \
    --model allenai/Molmo2-4B \
    --bits 4 --compressed \
    --video /path/to/video.mp4 \
    --max-new-tokens 256

Usage

from transformers import DynamicCache
from turboquant_consumer import CompressedDynamicCache

# Wrap any HuggingFace DynamicCache
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)

# Pass cache (not the wrapper) to model.generate()
# Compression happens transparently on every cache.update()

Key Findings

FP16 norms are a trap. At 10K+ tokens across 36 layers, fp16 norm precision loss compounds and flips low-confidence logits. Always use fp32.
QJL is invisible in drop-in mode. Standard attention does Q @ K.T on decompressed keys -- QJL correction only helps with a custom attention kernel. Using QJL wastes 1 bit of MSE resolution.
TQ4 nibble beats TQ3 unpacked. 4-bit with nibble packing gives 3.76x compression and ~97% cosine similarity. 3-bit unpacked gives only 1.94x at ~95%. Packing 3-bit indices across byte boundaries is hard and only 30% better.
Peak VRAM is activation-dominated. KV cache is ~9% of peak VRAM during prefill. Compression savings are real in permanent storage but invisible to max_memory_allocated().

Hardware Tested

Component	Spec
GPU	NVIDIA RTX 4090 (24 GB GDDR6X)
CPU	AMD 7800X3D
RAM	128 GB DDR5
Model	Molmo2-4B (bfloat16)
Workload	Seinfeld clips, ~11K visual tokens at 2fps

Docs

docs/ARCHITECTURE.md -- Module map, dependency DAG, data flow diagrams, design decisions
docs/ROADMAP.md -- Implementation status, next steps, key lessons
experiments/logs/ -- All 5 experiment logs with full results

Fused Triton Kernel (WIP)

The current production path uses incremental dequantization (P3): only new tokens are dequantized each decode step, reducing overhead from 3.36x to 1.78x without any custom kernels. The fused Triton kernel below is a future optimization path that fuses nibble unpacking, centroid lookup, and rotation (pre-rotation trick) into a single GPU pass:

Metric	Result
Q@K^T micro-benchmark speedup	17.8x at 11K tokens
Cosine similarity vs unfused reference	1.0 (exact match)
Single-layer Molmo2-4B integration	Correct output
Multi-layer integration	WIP -- needs full Flash Attention-style fusion (fused softmax+V)

Key finding: A fused Q@K^T-only kernel does not maintain SDPA precision when composed across 36 layers. Full Flash Attention-style fusion (Q@K^T + softmax + @V in one kernel) is required for multi-layer correctness.

Status

Pre-alpha / WIP. The implementation is validated end-to-end with 3.76x compression and 1.78x overhead (Experiment 005, incremental dequantization). The fused Triton kernel achieves 17.8x on the Q@K^T micro-benchmark with perfect cosine similarity, and single-layer integration on Molmo2-4B produces correct output. Multi-layer integration is in progress -- it requires full Flash Attention-style fusion (softmax+V) to maintain precision across all 36 layers.

Reference

@inproceedings{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Han, Insu and Daliri, Majid and Karbasi, Amin},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025}
}

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Alberto-Codes

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.5.0

Apr 8, 2026

1.4.1

Apr 4, 2026

1.4.0

Apr 1, 2026

1.3.0

Mar 31, 2026

1.2.2

Mar 30, 2026

1.2.1

Mar 30, 2026

1.2.0

Mar 29, 2026

1.1.1

Mar 28, 2026

1.1.0

Mar 27, 2026

1.0.0

Mar 27, 2026

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_vllm-0.1.0.tar.gz (46.4 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turboquant_vllm-0.1.0-py3-none-any.whl (60.2 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file turboquant_vllm-0.1.0.tar.gz.

File metadata

Download URL: turboquant_vllm-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 46.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for turboquant_vllm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`92aa67633934bf571413170420df3a60f90274bbd884d1dc3a2c9fedfbe23371`
MD5	`c310cc92f1aa76785769aa30aab66ac5`
BLAKE2b-256	`9121513861f94ae3e009bb9af6b04331d5c99a89128e9092568fa06db3d9407a`

See more details on using hashes here.

File details

Details for the file turboquant_vllm-0.1.0-py3-none-any.whl.

File metadata

Download URL: turboquant_vllm-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 60.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for turboquant_vllm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f947e497feb1c51566db6cfc529fc4a5ff2c8c5acdc616447f0eb1656346016`
MD5	`230e217a9bba703d791ceea3161bbff1`
BLAKE2b-256	`d09f85b5aed01886abe47e8f4a401683cabe8a7ab23540e18ec3d31434a1d0ea`

See more details on using hashes here.

turboquant-vllm 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

TurboQuant Consumer

Headline Results

What's Here

Quickstart

Usage

Key Findings

Hardware Tested

Docs

Fused Triton Kernel (WIP)

Status

Reference

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes