Skip to main content

TurboQuant KV cache compression for LLM inference — cuTile GPU kernels

Project description

turboquant-gpu

TurboQuant-GPU

5.02x KV cache compression for LLM inference — cuTile kernels with automatic PyTorch fallback.

pip install turboquant-gpu

Works on any NVIDIA GPU. Uses cuTile kernels when available, otherwise falls back to PyTorch automatically — no driver upgrades or manual config needed.

quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant_gpu import TurboQuantEngine
import torch

model_id = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="cuda")
tok   = AutoTokenizer.from_pretrained(model_id)

engine = TurboQuantEngine(head_dim=128, total_bits=3, device="cuda")
result = engine.generate(model, tok, "The University of Waterloo is known for ")

print(result["text"])
print(f"{result['tokens']} tokens | {result['stats']['ratio']:.2f}x compression")

install

pip install turboquant-gpu

For cuTile acceleration (optional, requires CUDA 13.0+ driver):

pip install cuda-tile[tileiras] --extra-index-url https://pypi.nvidia.com

If you skip cuda-tile or your driver is older, everything still works via PyTorch.

how it works

Implements the TurboQuant algorithm:

  1. normalize + rotate — random orthogonal rotation (Pi) makes coordinates near-Gaussian
  2. Lloyd-Max quantize — optimal scalar quantization against N(0, 1/d)

Both keys and values are compressed to 3-bit via MSE-optimal Lloyd-Max quantization, then reconstructed for standard attention. The package also includes fused attention kernels with QJL bias correction (1-bit sign sketch of the quantization residual), but these are not yet exposed in the high-level API.

step-by-step api

engine = TurboQuantEngine(head_dim=128, total_bits=3, device="cuda")

# after model prefill:
compressed = engine.compress_kv_cache(out.past_key_values)
cache      = engine.build_cache(compressed)
stats      = engine.compression_stats(out.past_key_values)

# or just do it all in one call:
result = engine.generate(model, tokenizer, "your prompt here")

# auto-tune for your specific GPU:
engine.auto_tune(seq_len=512)

gpu support

Written in cuTile for cross-architecture portability. Falls back to PyTorch if cuTile or a compatible driver isn't available.

GPU cuTile kernels PyTorch fallback
A100 (Ampere, sm_80) CUDA 13.2+ driver always works
H100 (Hopper, sm_90) not yet supported by tileiras always works
RTX 4090 (Ada, sm_89) CUDA 13.2+ driver always works
B200/B300 (Blackwell, sm_100) CUDA 13.0+ driver always works
Any other CUDA GPU depends on tileiras always works

kernels

kernel what it does status
compress_kv_3bit fused K+V compression in a single kernel launch used (default)
compress_keys key-only: normalize → rotate → Lloyd-Max → QJL signs fallback
compress_values value-only: normalize → rotate → Lloyd-Max fallback
decompress_values dequantize → un-rotate(Pi) → scale by norms used
attention_scores asymmetric dot product with QJL correction included, not in API
fused_attention scores + online softmax + V accumulation included, not in API

license

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_gpu-0.1.4.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turboquant_gpu-0.1.4-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file turboquant_gpu-0.1.4.tar.gz.

File metadata

  • Download URL: turboquant_gpu-0.1.4.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for turboquant_gpu-0.1.4.tar.gz
Algorithm Hash digest
SHA256 f0b9482e3c6c6a272e266116b97b2c7b6ba221b7a37164c67a5406d1cb05f52e
MD5 23388c1fa76120f23f37c5c0e59cfb3e
BLAKE2b-256 a66f5e5ab111ab436c29dd21d6189f383f5d985cc6e14df7cc28d6496d9ae026

See more details on using hashes here.

File details

Details for the file turboquant_gpu-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: turboquant_gpu-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for turboquant_gpu-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2f5da31768f653f6f95623beace174d520045d4f63b142c40d27b874c57c3ee9
MD5 e3ff84bf8a3375d7fe7d22b634994a8a
BLAKE2b-256 c1a95b35ffe845bf94002fb9cb6e0fa206de2509162664330c67201782785c19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page