TurboQuant KV cache compression for LLM inference — cuTile GPU kernels

These details have not been verified by PyPI

Project description

turboquant-gpu

TurboQuant-GPU

5.02x KV cache compression for LLM inference — cuTile kernels with automatic PyTorch fallback.

pip install turboquant-gpu

Works on any NVIDIA GPU. Uses cuTile kernels when available, otherwise falls back to PyTorch automatically — no driver upgrades or manual config needed.

quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant_gpu import TurboQuantEngine
import torch

model_id = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="cuda")
tok   = AutoTokenizer.from_pretrained(model_id)

engine = TurboQuantEngine(head_dim=128, total_bits=3, device="cuda")
result = engine.generate(model, tok, "The University of Waterloo is known for ")

print(result["text"])
print(f"{result['tokens']} tokens | {result['stats']['ratio']:.2f}x compression")

install

pip install turboquant-gpu

For cuTile acceleration (optional, requires CUDA 13.0+ driver):

pip install cuda-tile[tileiras] --extra-index-url https://pypi.nvidia.com

If you skip cuda-tile or your driver is older, everything still works via PyTorch.

how it works

Implements the TurboQuant algorithm:

normalize + rotate — random orthogonal rotation (Pi) makes coordinates near-Gaussian
Lloyd-Max quantize — optimal 3-bit scalar quantization against N(0, 1/d), shared codebook for K and V

For HuggingFace integration, keys and values are both compressed and decompressed via fused kernels — a single kernel launch compresses both K and V, and a single launch decompresses both. The reconstructed FP16 tensors are packed into a standard DynamicCache that HuggingFace's attention uses directly. No model changes needed.

The package also ships fused attention kernels with QJL bias correction (2-bit Lloyd-Max keys + 1-bit sign sketch of the quantization residual). These perform scoring, online softmax, and V accumulation in one kernel with on-chip V decompression. They're fully implemented but not yet wired into the HuggingFace path — integrating them requires replacing the model's internal attention, which is model-specific. This is a candidate for a cuTile Gym contribution.

step-by-step api

engine = TurboQuantEngine(head_dim=128, total_bits=3, device="cuda")

# after model prefill:
compressed = engine.compress_kv_cache(out.past_key_values)
cache      = engine.build_cache(compressed)
stats      = engine.compression_stats(out.past_key_values)

# or just do it all in one call:
result = engine.generate(model, tokenizer, "your prompt here")

# auto-tune for your specific GPU:
engine.auto_tune(seq_len=512)

gpu support

Written in cuTile for cross-architecture portability. Falls back to PyTorch if cuTile or a compatible driver isn't available.

GPU	cuTile kernels	PyTorch fallback
A100 (Ampere, sm_80)	CUDA 13.2+ driver	always works
H100 (Hopper, sm_90)	not yet supported by tileiras	always works
RTX 4090 (Ada, sm_89)	CUDA 13.2+ driver	always works
B200/B300 (Blackwell, sm_100)	CUDA 13.0+ driver	always works
Any other CUDA GPU	depends on tileiras	always works

kernels

HuggingFace path (used by default):

kernel	what it does
`compress_kv_3bit`	fused K+V compression, 3-bit shared codebook, single launch
`decompress_kv_3bit`	fused K+V decompression, single launch
`compress_values_3bit / 2bit`	separate fallback for K or V individually
`decompress_3bit / 2bit`	separate fallback decompression

Fused attention path (included, not in HuggingFace API):

kernel	what it does
`compress_keys_2bit_qjl`	2-bit Lloyd-Max + 1-bit QJL signs for keys
`fused_attention`	QJL-corrected scores + online softmax + V accumulation
`fused_attention_vfused_3bit`	same + on-chip V decompression from compressed indices
`attention_scores`	score-only (no softmax), for debugging

license

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.8

Apr 9, 2026

0.1.7

Apr 9, 2026

0.1.6

Apr 9, 2026

0.1.5

Apr 8, 2026

0.1.4

Apr 5, 2026

0.1.3

Apr 5, 2026

0.1.2

Apr 5, 2026

0.1.1

Apr 5, 2026

0.1.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_gpu-0.1.8.tar.gz (29.2 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turboquant_gpu-0.1.8-py3-none-any.whl (25.6 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file turboquant_gpu-0.1.8.tar.gz.

File metadata

Download URL: turboquant_gpu-0.1.8.tar.gz
Upload date: Apr 9, 2026
Size: 29.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for turboquant_gpu-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`6605527173354544837a8cdcafe292e41a5e97960f21c21653d686329a3fd469`
MD5	`951d4dfa546535718bb8e212a49a310f`
BLAKE2b-256	`8427e1184474f7188c54098bc6d59f2ac3d335dfeeb04cf615b05314f5a6e23a`

See more details on using hashes here.

File details

Details for the file turboquant_gpu-0.1.8-py3-none-any.whl.

File metadata

Download URL: turboquant_gpu-0.1.8-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 25.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for turboquant_gpu-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dd4f7cd11bcb890632ad6c68bffab82b2dac33e2695dfbf626c044b251eea8c6`
MD5	`bead2a65e20df834cfd5274cc97dd3b3`
BLAKE2b-256	`0565c691e1ca1e60fb8cd6bb5932a2ebb9086b7e901de10779ce66586215d002`

See more details on using hashes here.

turboquant-gpu 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

turboquant-gpu

quick start

install

how it works

step-by-step api

gpu support

kernels

license

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes