Skip to main content

TurboQuant KV cache compression for local LLM inference

Project description

tqai

PyPI version License: MIT CI Python 3.10+

TurboQuant KV cache compression for local LLM inference.

Compresses the KV cache to ~3 bits per channel with 80%+ memory savings and near-zero quality loss on 8B+ models. Supports both PyTorch (CPU/CUDA) and MLX (Apple Silicon).

Based on TurboQuant (Google Research, ICLR 2026).

Installation

# Homebrew (macOS)
brew install alphawavesystems/tap/tqai

# PyPI
pip install tqai

# With PyTorch backend
pip install tqai[torch]

# With MLX backend (Apple Silicon)
pip install tqai[mlx]

Quick Start

HuggingFace Transformers (PyTorch)

import tqai
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# One line to enable KV cache compression
cache = tqai.patch(model, bits_k=4, bits_v=2)

inputs = tokenizer("Explain quantum entanglement:", return_tensors="pt")
output = model.generate(**inputs, past_key_values=cache, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))

MLX (Apple Silicon)

import tqai
import mlx_lm

model, tokenizer = mlx_lm.load("mlx-community/Llama-3.1-8B-Instruct-4bit")

# One line to enable KV cache compression
tqai.patch(model, bits_k=4, bits_v=2, backend="mlx")

response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement:", max_tokens=200)
print(response)

# Restore original behaviour when done
tqai.unpatch(model)

Compression Configs

Config Avg Bits Memory Saved Recommended For
bits_k=4, bits_v=2 3.0 80% Production (best quality/compression balance)
bits_k=3, bits_v=2 2.5 84% Extended context windows
bits_k=4, bits_v=3 3.5 78% Quality-sensitive applications

How It Works

TurboQuant consists of two stages. tqai implements PolarQuant (the core compression) and deliberately omits QJL (the residual correction), which independent research found to degrade softmax-based attention quality.

  1. Random orthogonal rotation — Rotates KV vectors by a fixed Haar-distributed matrix to spread information uniformly across all coordinates
  2. Lloyd-Max scalar quantization — Quantizes each coordinate independently using precomputed optimal codebooks derived from the known post-rotation distribution
  3. Norm preservation — Stores vector norms separately in FP16 for lossless magnitude reconstruction

No training, calibration, or model-specific tuning required. Fully data-oblivious — the same codebooks work for any model.

Quality Results

Tested on Apple Silicon with various model sizes:

Model Baseline + tqai K4/V2 + tqai K3/V2
Qwen 0.5B Good Degraded Poor
Qwen 3B bf16 Excellent Good Degraded
Llama 8B Q4 Excellent Excellent Excellent
Qwen 14B Q4 Excellent Excellent Excellent

Quality is near-identical to baseline on 8B+ parameter models.

CLI

tqai includes a command-line tool for quick testing without writing code:

# Show environment and library info
tqai info

# Run quantization accuracy benchmark
tqai benchmark
tqai benchmark --bits-k 3 --bits-v 2 --head-dim 128

# Generate text with TurboQuant compression
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit
tqai run "Explain gravity" --model Qwen/Qwen2.5-3B-Instruct --backend torch

# Compare baseline vs compressed output side by side
tqai compare "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit

# Pre-convert a model for faster startup
tqai convert --model mlx-community/Llama-3.1-8B-Instruct-4bit --output ./llama-tqai/
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit --tqai-config ./llama-tqai/

# Run without compression (baseline)
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit --no-tqai

Advanced Options

cache = tqai.patch(
    model,
    bits_k=4,           # Bits per key coordinate (2, 3, or 4)
    bits_v=2,           # Bits per value coordinate (2, 3, or 4)
    sink_tokens=4,      # Keep first N tokens uncompressed (attention sinks)
    backend="torch",    # Force backend: "torch" or "mlx"
    device="cuda",      # PyTorch device (ignored for MLX)
)

Running Tests

# Install dev dependencies
pip install tqai[dev]

# Unit + accuracy tests (175 tests, <1s)
pytest tests/ --ignore=tests/test_e2e_models.py --ignore=tests/test_e2e_large_models.py

# End-to-end with real models (requires model downloads)
pytest tests/test_e2e_models.py -v -s

Project Structure

src/tqai/
├── __init__.py          # patch(), unpatch(), TurboQuantConfig
├── config.py            # Configuration dataclass
├── quantizer.py         # PolarQuantizer (core algorithm)
├── backend/             # PyTorch + MLX abstraction
├── codebook/            # Lloyd-Max codebooks (precomputed)
└── cache/               # HuggingFace + mlx-lm integrations

Paper

This library implements the TurboQuant algorithm from Google Research:

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni ICLR 2026 | arXiv:2504.19874 | Google Research Blog

Related work:

  • PolarQuant (AISTATS 2026) — Random rotation + polar coordinate quantization (the core of tqai)
  • QJL (AAAI 2025) — Quantized Johnson-Lindenstrauss residual correction (omitted in tqai — hurts softmax attention)

Contributing

See CONTRIBUTING.md. All commits require a DCO sign-off (git commit -s).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tqai-0.2.0.tar.gz (72.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tqai-0.2.0-py3-none-any.whl (52.1 kB view details)

Uploaded Python 3

File details

Details for the file tqai-0.2.0.tar.gz.

File metadata

  • Download URL: tqai-0.2.0.tar.gz
  • Upload date:
  • Size: 72.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tqai-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b744b298433f6b6a6fa7325b1f734a5e8ad1bcc6ec817da407e0fc9850cdaf7b
MD5 0ffa47a450c95a1908e4f0c31b8d9b65
BLAKE2b-256 d6295b65d9e54cd245587568b6c459a266254ea58b31fb93177913fa5177274d

See more details on using hashes here.

Provenance

The following attestation bundles were made for tqai-0.2.0.tar.gz:

Publisher: release.yml on AlphaWaveSystems/tqai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tqai-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tqai-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 52.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tqai-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b8d53c6f6f7f7a64bb3a0798d3ba004226d8a71b1f3e0fadaead78560e44843
MD5 01c3059c74da6358284f9d6b93bfc85b
BLAKE2b-256 d9a757dbb31513f818c4e721ba4c33138fd21cf74973a221f40e590eac6a7dd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for tqai-0.2.0-py3-none-any.whl:

Publisher: release.yml on AlphaWaveSystems/tqai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page