TurboQuant KV cache compression for LLM inference — cuTile GPU kernels
Project description
turboquant-gpu
5.02x KV cache compression for LLM inference — cuTile kernels with automatic PyTorch fallback.
pip install turboquant-gpu
Works on any NVIDIA GPU. Uses cuTile kernels when available, otherwise falls back to PyTorch automatically — no driver upgrades or manual config needed.
quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant_gpu import TurboQuantEngine
import torch
model_id = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="cuda")
tok = AutoTokenizer.from_pretrained(model_id)
engine = TurboQuantEngine(head_dim=128, total_bits=3, device="cuda")
result = engine.generate(model, tok, "The University of Waterloo is known for ")
print(result["text"])
print(f"{result['tokens']} tokens | {result['stats']['ratio']:.2f}x compression")
install
pip install turboquant-gpu
For cuTile acceleration (optional, requires CUDA 13.0+ driver):
pip install cuda-tile[tileiras] --extra-index-url https://pypi.nvidia.com
If you skip cuda-tile or your driver is older, everything still works via PyTorch.
how it works
Implements the TurboQuant algorithm:
- normalize + rotate — random orthogonal rotation (Pi) makes coordinates near-Gaussian
- Lloyd-Max quantize — optimal scalar quantization against N(0, 1/d)
Both keys and values are compressed to 3-bit via MSE-optimal Lloyd-Max quantization, then reconstructed for standard attention. The package also includes fused attention kernels with QJL bias correction (1-bit sign sketch of the quantization residual), but these are not yet exposed in the high-level API.
step-by-step api
engine = TurboQuantEngine(head_dim=128, total_bits=3, device="cuda")
# after model prefill:
compressed = engine.compress_kv_cache(out.past_key_values)
cache = engine.build_cache(compressed)
stats = engine.compression_stats(out.past_key_values)
# or just do it all in one call:
result = engine.generate(model, tokenizer, "your prompt here")
# auto-tune for your specific GPU:
engine.auto_tune(seq_len=512)
gpu support
Written in cuTile for cross-architecture portability. Falls back to PyTorch if cuTile or a compatible driver isn't available.
| GPU | cuTile kernels | PyTorch fallback |
|---|---|---|
| A100 (Ampere, sm_80) | CUDA 13.2+ driver | always works |
| H100 (Hopper, sm_90) | not yet supported by tileiras | always works |
| RTX 4090 (Ada, sm_89) | CUDA 13.2+ driver | always works |
| B200/B300 (Blackwell, sm_100) | CUDA 13.0+ driver | always works |
| Any other CUDA GPU | depends on tileiras | always works |
kernels
| kernel | what it does | status |
|---|---|---|
compress_kv_3bit |
fused K+V compression in a single kernel launch | used (default) |
compress_keys |
key-only: normalize → rotate → Lloyd-Max → QJL signs | fallback |
compress_values |
value-only: normalize → rotate → Lloyd-Max | fallback |
decompress_values |
dequantize → un-rotate(Pi) → scale by norms | used |
attention_scores |
asymmetric dot product with QJL correction | included, not in API |
fused_attention |
scores + online softmax + V accumulation | included, not in API |
license
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turboquant_gpu-0.1.4.tar.gz.
File metadata
- Download URL: turboquant_gpu-0.1.4.tar.gz
- Upload date:
- Size: 18.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0b9482e3c6c6a272e266116b97b2c7b6ba221b7a37164c67a5406d1cb05f52e
|
|
| MD5 |
23388c1fa76120f23f37c5c0e59cfb3e
|
|
| BLAKE2b-256 |
a66f5e5ab111ab436c29dd21d6189f383f5d985cc6e14df7cc28d6496d9ae026
|
File details
Details for the file turboquant_gpu-0.1.4-py3-none-any.whl.
File metadata
- Download URL: turboquant_gpu-0.1.4-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f5da31768f653f6f95623beace174d520045d4f63b142c40d27b874c57c3ee9
|
|
| MD5 |
e3ff84bf8a3375d7fe7d22b634994a8a
|
|
| BLAKE2b-256 |
c1a95b35ffe845bf94002fb9cb6e0fa206de2509162664330c67201782785c19
|