TurboQuant KV cache compression for local LLM inference

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pbertsch

These details have not been verified by PyPI

Project description

tqai

TurboQuant KV cache compression for local LLM inference.

Compresses the KV cache to ~3 bits per channel with 80%+ memory savings and near-zero quality loss on 8B+ models. Supports both PyTorch (CPU/CUDA) and MLX (Apple Silicon).

Based on TurboQuant (Google Research, ICLR 2026).

Installation

# Homebrew (macOS)
brew install alphawavesystems/tap/tqai

# PyPI
pip install tqai

# With PyTorch backend
pip install tqai[torch]

# With MLX backend (Apple Silicon)
pip install tqai[mlx]

Quick Start

HuggingFace Transformers (PyTorch)

import tqai
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# One line to enable KV cache compression
cache = tqai.patch(model, bits_k=4, bits_v=2)

inputs = tokenizer("Explain quantum entanglement:", return_tensors="pt")
output = model.generate(**inputs, past_key_values=cache, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))

MLX (Apple Silicon)

import tqai
import mlx_lm

model, tokenizer = mlx_lm.load("mlx-community/Llama-3.1-8B-Instruct-4bit")

# One line to enable KV cache compression
tqai.patch(model, bits_k=4, bits_v=2, backend="mlx")

response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement:", max_tokens=200)
print(response)

# Restore original behaviour when done
tqai.unpatch(model)

Compression Configs

Config	Avg Bits	Memory Saved	Recommended For
`bits_k=4, bits_v=2`	3.0	80%	Production (best quality/compression balance)
`bits_k=3, bits_v=2`	2.5	84%	Extended context windows
`bits_k=4, bits_v=3`	3.5	78%	Quality-sensitive applications

How It Works

TurboQuant consists of two stages. tqai implements PolarQuant (the core compression) and deliberately omits QJL (the residual correction), which independent research found to degrade softmax-based attention quality.

Random orthogonal rotation — Rotates KV vectors by a fixed Haar-distributed matrix to spread information uniformly across all coordinates
Lloyd-Max scalar quantization — Quantizes each coordinate independently using precomputed optimal codebooks derived from the known post-rotation distribution
Norm preservation — Stores vector norms separately in FP16 for lossless magnitude reconstruction

No training, calibration, or model-specific tuning required. Fully data-oblivious — the same codebooks work for any model.

Quality Results

Tested on Apple Silicon with various model sizes:

Model	Baseline	+ tqai K4/V2	+ tqai K3/V2
Qwen 0.5B	Good	Degraded	Poor
Qwen 3B bf16	Excellent	Good	Degraded
Llama 8B Q4	Excellent	Excellent	Excellent
Qwen 14B Q4	Excellent	Excellent	Excellent

Quality is near-identical to baseline on 8B+ parameter models.

CLI

tqai includes a command-line tool for quick testing without writing code:

# Show environment and library info
tqai info

# Run quantization accuracy benchmark
tqai benchmark
tqai benchmark --bits-k 3 --bits-v 2 --head-dim 128

# Generate text with TurboQuant compression
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit
tqai run "Explain gravity" --model Qwen/Qwen2.5-3B-Instruct --backend torch

# Compare baseline vs compressed output side by side
tqai compare "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit

# Pre-convert a model for faster startup
tqai convert --model mlx-community/Llama-3.1-8B-Instruct-4bit --output ./llama-tqai/
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit --tqai-config ./llama-tqai/

# Run without compression (baseline)
tqai run "Explain gravity" --model mlx-community/Llama-3.1-8B-Instruct-4bit --no-tqai

Advanced Options

cache = tqai.patch(
    model,
    bits_k=4,           # Bits per key coordinate (2, 3, or 4)
    bits_v=2,           # Bits per value coordinate (2, 3, or 4)
    sink_tokens=4,      # Keep first N tokens uncompressed (attention sinks)
    backend="torch",    # Force backend: "torch" or "mlx"
    device="cuda",      # PyTorch device (ignored for MLX)
)

Running Tests

# Install dev dependencies
pip install tqai[dev]

# Unit + accuracy tests (175 tests, <1s)
pytest tests/ --ignore=tests/test_e2e_models.py --ignore=tests/test_e2e_large_models.py

# End-to-end with real models (requires model downloads)
pytest tests/test_e2e_models.py -v -s

Project Structure

src/tqai/
├── __init__.py          # patch(), unpatch(), TurboQuantConfig
├── config.py            # Configuration dataclass
├── quantizer.py         # PolarQuantizer (core algorithm)
├── backend/             # PyTorch + MLX abstraction
├── codebook/            # Lloyd-Max codebooks (precomputed)
└── cache/               # HuggingFace + mlx-lm integrations

Paper

This library implements the TurboQuant algorithm from Google Research:

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni ICLR 2026 | arXiv:2504.19874 | Google Research Blog

Related work:

PolarQuant (AISTATS 2026) — Random rotation + polar coordinate quantization (the core of tqai)
QJL (AAAI 2025) — Quantized Johnson-Lindenstrauss residual correction (omitted in tqai — hurts softmax attention)

Contributing

See CONTRIBUTING.md. All commits require a DCO sign-off (git commit -s).

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pbertsch

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

Apr 22, 2026

0.5.0

Apr 22, 2026

0.4.1

Apr 22, 2026

0.4.0

Apr 7, 2026

0.3.1

Apr 5, 2026

This version

0.2.0

Apr 5, 2026

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tqai-0.2.0.tar.gz (72.9 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tqai-0.2.0-py3-none-any.whl (52.1 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file tqai-0.2.0.tar.gz.

File metadata

Download URL: tqai-0.2.0.tar.gz
Upload date: Apr 5, 2026
Size: 72.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tqai-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b744b298433f6b6a6fa7325b1f734a5e8ad1bcc6ec817da407e0fc9850cdaf7b`
MD5	`0ffa47a450c95a1908e4f0c31b8d9b65`
BLAKE2b-256	`d6295b65d9e54cd245587568b6c459a266254ea58b31fb93177913fa5177274d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tqai-0.2.0.tar.gz:

Publisher: release.yml on AlphaWaveSystems/tqai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tqai-0.2.0.tar.gz
- Subject digest: b744b298433f6b6a6fa7325b1f734a5e8ad1bcc6ec817da407e0fc9850cdaf7b
- Sigstore transparency entry: 1238540774
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: AlphaWaveSystems/tqai@7e9790aa32330a7f6ced97575b679b00997fb713
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/AlphaWaveSystems
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7e9790aa32330a7f6ced97575b679b00997fb713
- Trigger Event: push

File details

Details for the file tqai-0.2.0-py3-none-any.whl.

File metadata

Download URL: tqai-0.2.0-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 52.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tqai-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b8d53c6f6f7f7a64bb3a0798d3ba004226d8a71b1f3e0fadaead78560e44843`
MD5	`01c3059c74da6358284f9d6b93bfc85b`
BLAKE2b-256	`d9a757dbb31513f818c4e721ba4c33138fd21cf74973a221f40e590eac6a7dd2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tqai-0.2.0-py3-none-any.whl:

Publisher: release.yml on AlphaWaveSystems/tqai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tqai-0.2.0-py3-none-any.whl
- Subject digest: 5b8d53c6f6f7f7a64bb3a0798d3ba004226d8a71b1f3e0fadaead78560e44843
- Sigstore transparency entry: 1238540779
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: AlphaWaveSystems/tqai@7e9790aa32330a7f6ced97575b679b00997fb713
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/AlphaWaveSystems
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7e9790aa32330a7f6ced97575b679b00997fb713
- Trigger Event: push

tqai 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

tqai

Installation

Quick Start

HuggingFace Transformers (PyTorch)

MLX (Apple Silicon)

Compression Configs

How It Works

Quality Results

CLI

Advanced Options

Running Tests

Project Structure

Paper

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance