Skip to main content

TurboQuant KV cache compression for local LLM inference

Project description

tqai

PyPI version License: MIT CI Python 3.10+

TurboQuant KV cache compression for local LLM inference.

Compresses the KV cache to ~3 bits per channel with 80%+ memory savings and near-zero quality loss on 8B+ models. Supports both PyTorch (CPU/CUDA) and MLX (Apple Silicon).

Based on TurboQuant (Google Research, ICLR 2026).

Installation

# Homebrew (macOS)
brew install alphawavesystems/tap/tqai

# PyPI
pip install tqai

# With PyTorch backend
pip install tqai[torch]

# With MLX backend (Apple Silicon)
pip install tqai[mlx]

Quick Start

HuggingFace Transformers (PyTorch)

import tqai
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# One line to enable KV cache compression
cache = tqai.patch(model, bits_k=4, bits_v=2)

inputs = tokenizer("Explain quantum entanglement:", return_tensors="pt")
output = model.generate(**inputs, past_key_values=cache, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))

MLX (Apple Silicon)

import tqai
import mlx_lm

model, tokenizer = mlx_lm.load("mlx-community/Llama-3.1-8B-Instruct-4bit")

# One line to enable KV cache compression
tqai.patch(model, bits_k=4, bits_v=2, backend="mlx")

response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement:", max_tokens=200)
print(response)

# Restore original behaviour when done
tqai.unpatch(model)

Compression Configs

Config Avg Bits Memory Saved Recommended For
bits_k=4, bits_v=2 3.0 80% Production (best quality/compression balance)
bits_k=3, bits_v=2 2.5 84% Extended context windows
bits_k=4, bits_v=3 3.5 78% Quality-sensitive applications

How It Works

tqai implements Stage 1 of TurboQuant (PolarQuant):

  1. Random orthogonal rotation — Rotates KV vectors by a fixed Haar-distributed matrix to spread information across all coordinates
  2. Lloyd-Max scalar quantization — Quantizes each coordinate independently using precomputed optimal codebooks
  3. Norm preservation — Stores vector norms separately in FP16

No training, calibration, or model-specific tuning required. The same codebooks work for any model.

Quality Results

Tested on Apple Silicon with various model sizes:

Model Baseline + tqai K4/V2 + tqai K3/V2
Qwen 0.5B Good Degraded Poor
Qwen 3B bf16 Excellent Good Degraded
Llama 8B Q4 Excellent Excellent Excellent
Qwen 14B Q4 Excellent Excellent Excellent

Quality is near-identical to baseline on 8B+ parameter models.

CLI

tqai includes a command-line tool for quick testing without writing code:

# Show environment and library info
tqai info

# Run quantization accuracy benchmark
tqai benchmark
tqai benchmark --bits-k 3 --bits-v 2 --head-dim 128

# Generate text with TurboQuant compression
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit
tqai run "Explain gravity" --model Qwen/Qwen2.5-3B-Instruct --backend torch

# Compare baseline vs compressed output side by side
tqai compare "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit

# Pre-convert a model for faster startup
tqai convert --model mlx-community/Llama-3.1-8B-Instruct-4bit --output ./llama-tqai/
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit --tqai-config ./llama-tqai/

# Run without compression (baseline)
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit --no-tqai

Advanced Options

cache = tqai.patch(
    model,
    bits_k=4,           # Bits per key coordinate (2, 3, or 4)
    bits_v=2,           # Bits per value coordinate (2, 3, or 4)
    sink_tokens=4,      # Keep first N tokens uncompressed (attention sinks)
    backend="torch",    # Force backend: "torch" or "mlx"
    device="cuda",      # PyTorch device (ignored for MLX)
)

Running Tests

# Install dev dependencies
pip install tqai[dev]

# Unit + accuracy tests (175 tests, <1s)
pytest tests/ --ignore=tests/test_e2e_models.py --ignore=tests/test_e2e_large_models.py

# End-to-end with real models (requires model downloads)
pytest tests/test_e2e_models.py -v -s

Project Structure

src/tqai/
├── __init__.py          # patch(), unpatch(), TurboQuantConfig
├── config.py            # Configuration dataclass
├── quantizer.py         # PolarQuantizer (core algorithm)
├── backend/             # PyTorch + MLX abstraction
├── codebook/            # Lloyd-Max codebooks (precomputed)
└── cache/               # HuggingFace + mlx-lm integrations

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tqai-0.1.0.tar.gz (34.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tqai-0.1.0-py3-none-any.whl (29.0 kB view details)

Uploaded Python 3

File details

Details for the file tqai-0.1.0.tar.gz.

File metadata

  • Download URL: tqai-0.1.0.tar.gz
  • Upload date:
  • Size: 34.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tqai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d6e1fa0d21d6d5e426f5d33bd08811f745fd0bb33b708ec09852b380d2a4b1c7
MD5 7abea73fe52a674b37e14e22ce98f7fd
BLAKE2b-256 0b39cfdf72bf1a4cde5d1ed65470e5cc2643f95a8a13537368cf035a4bd60b38

See more details on using hashes here.

Provenance

The following attestation bundles were made for tqai-0.1.0.tar.gz:

Publisher: release.yml on AlphaWaveSystems/tqai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tqai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tqai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tqai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 375328ad51d96614446bc1f3575fce7d3a98f5c98d816fd7e269b324c86982de
MD5 8f7ddbfde5798673ef85fd3a3748ab27
BLAKE2b-256 7294e6a3faf08c5143a74b46c0f6f19b7c411ab828b754f31bdb4a83620e3fb2

See more details on using hashes here.

Provenance

The following attestation bundles were made for tqai-0.1.0-py3-none-any.whl:

Publisher: release.yml on AlphaWaveSystems/tqai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page