TurboQuant KV cache compression for local LLM inference
Project description
tqai
TurboQuant KV cache compression for local LLM inference.
Compresses the KV cache to ~3 bits per channel with 80%+ memory savings and near-zero quality loss on 8B+ models. Supports both PyTorch (CPU/CUDA) and MLX (Apple Silicon).
Based on TurboQuant (Google Research, ICLR 2026).
Installation
# Homebrew (macOS)
brew install alphawavesystems/tap/tqai
# PyPI
pip install tqai
# With PyTorch backend
pip install tqai[torch]
# With MLX backend (Apple Silicon)
pip install tqai[mlx]
Quick Start
HuggingFace Transformers (PyTorch)
import tqai
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
# One line to enable KV cache compression
cache = tqai.patch(model, bits_k=4, bits_v=2)
inputs = tokenizer("Explain quantum entanglement:", return_tensors="pt")
output = model.generate(**inputs, past_key_values=cache, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))
MLX (Apple Silicon)
import tqai
import mlx_lm
model, tokenizer = mlx_lm.load("mlx-community/Llama-3.1-8B-Instruct-4bit")
# One line to enable KV cache compression
tqai.patch(model, bits_k=4, bits_v=2, backend="mlx")
response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement:", max_tokens=200)
print(response)
# Restore original behaviour when done
tqai.unpatch(model)
Compression Configs
| Config | Avg Bits | Memory Saved | Recommended For |
|---|---|---|---|
bits_k=4, bits_v=2 |
3.0 | 80% | Production (best quality/compression balance) |
bits_k=3, bits_v=2 |
2.5 | 84% | Extended context windows |
bits_k=4, bits_v=3 |
3.5 | 78% | Quality-sensitive applications |
How It Works
tqai implements Stage 1 of TurboQuant (PolarQuant):
- Random orthogonal rotation — Rotates KV vectors by a fixed Haar-distributed matrix to spread information across all coordinates
- Lloyd-Max scalar quantization — Quantizes each coordinate independently using precomputed optimal codebooks
- Norm preservation — Stores vector norms separately in FP16
No training, calibration, or model-specific tuning required. The same codebooks work for any model.
Quality Results
Tested on Apple Silicon with various model sizes:
| Model | Baseline | + tqai K4/V2 | + tqai K3/V2 |
|---|---|---|---|
| Qwen 0.5B | Good | Degraded | Poor |
| Qwen 3B bf16 | Excellent | Good | Degraded |
| Llama 8B Q4 | Excellent | Excellent | Excellent |
| Qwen 14B Q4 | Excellent | Excellent | Excellent |
Quality is near-identical to baseline on 8B+ parameter models.
CLI
tqai includes a command-line tool for quick testing without writing code:
# Show environment and library info
tqai info
# Run quantization accuracy benchmark
tqai benchmark
tqai benchmark --bits-k 3 --bits-v 2 --head-dim 128
# Generate text with TurboQuant compression
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit
tqai run "Explain gravity" --model Qwen/Qwen2.5-3B-Instruct --backend torch
# Compare baseline vs compressed output side by side
tqai compare "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit
# Pre-convert a model for faster startup
tqai convert --model mlx-community/Llama-3.1-8B-Instruct-4bit --output ./llama-tqai/
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit --tqai-config ./llama-tqai/
# Run without compression (baseline)
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit --no-tqai
Advanced Options
cache = tqai.patch(
model,
bits_k=4, # Bits per key coordinate (2, 3, or 4)
bits_v=2, # Bits per value coordinate (2, 3, or 4)
sink_tokens=4, # Keep first N tokens uncompressed (attention sinks)
backend="torch", # Force backend: "torch" or "mlx"
device="cuda", # PyTorch device (ignored for MLX)
)
Running Tests
# Install dev dependencies
pip install tqai[dev]
# Unit + accuracy tests (175 tests, <1s)
pytest tests/ --ignore=tests/test_e2e_models.py --ignore=tests/test_e2e_large_models.py
# End-to-end with real models (requires model downloads)
pytest tests/test_e2e_models.py -v -s
Project Structure
src/tqai/
├── __init__.py # patch(), unpatch(), TurboQuantConfig
├── config.py # Configuration dataclass
├── quantizer.py # PolarQuantizer (core algorithm)
├── backend/ # PyTorch + MLX abstraction
├── codebook/ # Lloyd-Max codebooks (precomputed)
└── cache/ # HuggingFace + mlx-lm integrations
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tqai-0.1.0.tar.gz.
File metadata
- Download URL: tqai-0.1.0.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6e1fa0d21d6d5e426f5d33bd08811f745fd0bb33b708ec09852b380d2a4b1c7
|
|
| MD5 |
7abea73fe52a674b37e14e22ce98f7fd
|
|
| BLAKE2b-256 |
0b39cfdf72bf1a4cde5d1ed65470e5cc2643f95a8a13537368cf035a4bd60b38
|
Provenance
The following attestation bundles were made for tqai-0.1.0.tar.gz:
Publisher:
release.yml on AlphaWaveSystems/tqai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tqai-0.1.0.tar.gz -
Subject digest:
d6e1fa0d21d6d5e426f5d33bd08811f745fd0bb33b708ec09852b380d2a4b1c7 - Sigstore transparency entry: 1237305558
- Sigstore integration time:
-
Permalink:
AlphaWaveSystems/tqai@2bcf7ac8f86f6881ba9e1f93286d14061d2a4092 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/AlphaWaveSystems
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2bcf7ac8f86f6881ba9e1f93286d14061d2a4092 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tqai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tqai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
375328ad51d96614446bc1f3575fce7d3a98f5c98d816fd7e269b324c86982de
|
|
| MD5 |
8f7ddbfde5798673ef85fd3a3748ab27
|
|
| BLAKE2b-256 |
7294e6a3faf08c5143a74b46c0f6f19b7c411ab828b754f31bdb4a83620e3fb2
|
Provenance
The following attestation bundles were made for tqai-0.1.0-py3-none-any.whl:
Publisher:
release.yml on AlphaWaveSystems/tqai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tqai-0.1.0-py3-none-any.whl -
Subject digest:
375328ad51d96614446bc1f3575fce7d3a98f5c98d816fd7e269b324c86982de - Sigstore transparency entry: 1237305577
- Sigstore integration time:
-
Permalink:
AlphaWaveSystems/tqai@2bcf7ac8f86f6881ba9e1f93286d14061d2a4092 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/AlphaWaveSystems
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2bcf7ac8f86f6881ba9e1f93286d14061d2a4092 -
Trigger Event:
push
-
Statement type: