TurboQuant KV cache compression for local LLM inference

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pbertsch

These details have not been verified by PyPI

Project description

tqai

TurboQuant KV cache compression for local LLM inference.

Compresses the KV cache to ~3 bits per channel with 80%+ memory savings and zero perplexity change on 8B+ models. Supports both PyTorch (CPU/CUDA) and MLX (Apple Silicon).

Based on TurboQuant (Google Research, ICLR 2026).

Installation

# PyPI
pip install tqai

# With PyTorch backend
pip install tqai[torch]

# With MLX backend (Apple Silicon)
pip install tqai[mlx]

# Global CLI install (no venv management)
pipx install tqai
pipx inject tqai mlx mlx-lm   # add MLX backend
pipx inject tqai torch         # or PyTorch backend

Quick Start

HuggingFace Transformers (PyTorch)

import tqai
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# One line to enable KV cache compression
cache = tqai.patch(model, bits_k=4, bits_v=2)

inputs = tokenizer("Explain quantum entanglement:", return_tensors="pt")
output = model.generate(**inputs, past_key_values=cache, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))

MLX (Apple Silicon)

import tqai
import mlx_lm

model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-8bit")

# One line to enable KV cache compression
tqai.patch(model, bits_k=4, bits_v=2, backend="mlx")

response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement:", max_tokens=200)
print(response)

# Restore original behaviour when done
tqai.unpatch(model)

Benchmark Results

All results measured on Apple Silicon (MLX). Full data in benchmarks/results/.

Perplexity — zero change across every model and config

Model	Baseline PPL	+ tqai K4/V2	+ tqai K3/V2
Qwen2.5-0.5B bf16	4.34	4.34	4.34
Qwen2.5-3B bf16	2.49	2.49	2.49
Llama-3.1-8B Q4	2.95	2.95	2.95
Qwen2.5-7B Q8	2.40	2.40	2.40
Qwen2.5-14B Q4	2.22	2.22	2.22
Gemma 4 E4B Q4	126.94	126.94	—

Δppl = 0.00 across all models and compression configs tested (6 models, 12 quantization variants).

Throughput (MLX, v0.3.1 — fused Metal kernels + incremental cache)

Model	Baseline	kv-only	Retention	vs v0.2
Qwen2.5-0.5B bf16	326 tok/s	118 tok/s	36%	was 10%
Qwen2.5-3B bf16	71 tok/s	66 tok/s	93%	was 26%
Llama-3.1-8B Q4	102 tok/s	85 tok/s	84%	was 22%
Qwen2.5-7B Q8	63 tok/s	59 tok/s	93%	was 38%
Qwen2.5-14B Q4	56 tok/s	50 tok/s	89%	was 25%
Gemma 4 E4B Q4	113 tok/s	47 tok/s	41%	new

v0.3.1 eliminated the O(n²) per-token reconstruction overhead via an incremental dequantized buffer (KIVI-inspired). Fused Metal kernels (mx.fast.metal_kernel) handle quantize/dequantize in single GPU dispatches. Larger models (3B+) now retain 84–93% of baseline throughput.

Compression Configs

KV Cache

Config	Avg Bits	Memory Saved	Recommended For
`bits_k=4, bits_v=2`	3.0	80%	Production — best quality/compression balance
`bits_k=3, bits_v=2`	2.5	84%	Extended context windows
`bits_k=4, bits_v=3`	3.5	78%	Quality-sensitive applications

Named Configs (CLI)

Config	KV	Hidden	FFN	Use Case
`kv-only`	K4/V2	—	—	KV memory savings only
`kv+hidden8`	K4/V2	8-bit	—	KV + hidden state compression
`kv+hidden6`	K4/V2	6-bit	—	More aggressive hidden compression
`kv+ffn8`	K4/V2	—	8-bit	KV + FFN activation compression
`all8`	K4/V2	8-bit	8-bit	Full compression at 8-bit
`all6`	K4/V2	6-bit	6-bit	Full compression at 6-bit
`aggressive`	K3/V2	6-bit	6-bit	Maximum compression

How It Works

tqai implements PolarQuant — the core of TurboQuant Stage 1 — via three steps applied to each KV vector at generation time:

Random orthogonal rotation — Rotates KV vectors by a fixed Haar-distributed matrix to spread information uniformly across all coordinates
Lloyd-Max scalar quantization — Quantizes each coordinate independently using precomputed optimal codebooks derived from the known post-rotation distribution
Norm preservation — Stores the vector norm separately in FP16 for lossless magnitude reconstruction

No training, calibration, or model-specific tuning required. Fully data-oblivious — the same codebooks work for any model.

Cache Strategies (v0.3.1)

tqai supports three cache reconstruction strategies to balance speed and quality:

Strategy	Per-token cost	Quality	Use case
`incremental` (default)	O(1)	Same as full	Production — 2–3x faster than v0.2
`residual`	O(1)	Better (recent tokens exact)	Quality-sensitive, long context
`full`	O(n)	Baseline	Debugging, compatibility

# Use residual strategy — last 128 tokens kept uncompressed (KIVI-style)
tqai.patch(model, bits_k=4, bits_v=2, cache_strategy="residual", residual_window=128)

Codebook Solvers (v0.3.1)

Beyond the default Lloyd-Max solver, tqai offers evolutionary and fuzzy codebook optimizers for build-time codebook generation:

CMA-ES (arXiv:1710.05311) — Evolutionary refinement of Lloyd-Max codebooks, ~0.5% MSE improvement
Fuzzy C-means (arXiv:1908.05033) — Soft assignment with temperature annealing
Attention-aware objective (arXiv:2402.14866) — Optimizes codebooks to preserve softmax attention scores rather than raw MSE

QJL Stage 2 (opt-in)

tqai optionally implements QJL (Johnson-Lindenstrauss residual sketch), which corrects the systematic inner-product bias left by Stage 1:

cache = tqai.patch(model, bits_k=4, bits_v=2, use_qjl=True, qjl_sketch_size=64)

QJL trades bias reduction for added variance. For softmax-based attention, variance typically dominates — this is why QJL is off by default. Enable it for very low bit-widths, non-softmax attention, or research use.

CLI

# Show environment and library info
tqai info

# Quantization accuracy benchmark
tqai benchmark
tqai benchmark --bits-k 3 --bits-v 2 --head-dim 128

# Generate text with compression
tqai run "Explain gravity" --model mlx-community/Qwen2.5-7B-Instruct-8bit
tqai run "Explain gravity" --model Qwen/Qwen2.5-3B-Instruct --backend torch
tqai run "Explain gravity" --model mlx-community/Qwen2.5-7B-Instruct-8bit --config aggressive

# Run with QJL Stage 2
tqai run "Explain gravity" --model Qwen/Qwen2.5-3B-Instruct --use-qjl

# Compare baseline vs compressed side by side
tqai compare "Explain gravity" --model mlx-community/Qwen2.5-7B-Instruct-8bit

# Pre-convert a model for faster startup
tqai convert --model mlx-community/Qwen2.5-7B-Instruct-8bit --output ./qwen7b-tqai/

# Baseline (no compression)
tqai run "Explain gravity" --model mlx-community/Qwen2.5-7B-Instruct-8bit --no-tqai

Advanced Options

cache = tqai.patch(
    model,
    bits_k=4,              # Bits per key coordinate (2–8)
    bits_v=2,              # Bits per value coordinate (2–8)
    sink_tokens=4,         # Keep first N tokens uncompressed (attention sinks)
    backend="torch",       # Force backend: "torch" or "mlx"
    device="cuda",         # PyTorch device (ignored for MLX)
    use_qjl=False,         # Enable QJL Stage 2 residual correction (research)
    qjl_sketch_size=64,    # JL sketch dimension (tradeoff: quality vs memory)
    cache_strategy="auto", # "auto" (incremental), "residual", or "full"
    residual_window=128,   # Recent tokens kept uncompressed (residual strategy)
)

Running Tests

# Install dev dependencies
pip install tqai[dev]

# Unit + accuracy tests (~293 tests, <40s)
pytest tests/ --ignore=tests/test_e2e_models.py --ignore=tests/test_e2e_large_models.py

# End-to-end with real models (requires model downloads)
pytest tests/test_e2e_models.py -v -s

# Large model E2E (7B–14B, requires ~20GB disk)
pytest tests/test_e2e_large_models.py -v -s

Project Structure

src/tqai/
├── __init__.py          # patch(), unpatch(), TurboQuantConfig
├── config.py            # Configuration dataclass
├── quantizer.py         # PolarQuantizer (core algorithm + QJL Stage 2)
├── kernels/             # Fused Metal GPU kernels (quantize + dequantize)
├── hooks.py             # Forward-pass activation compression hooks
├── module_utils.py      # Transformer layer inspection utilities
├── backend/             # PyTorch + MLX abstraction layer
├── codebook/            # Codebook solvers (Lloyd-Max, CMA-ES, fuzzy) + precomputed data
└── cache/               # HuggingFace DynamicCache + mlx-lm KVCache integrations

benchmarks/
├── benchmark_forward.py # KV + activation compression throughput benchmark
├── benchmark_metal.py   # Metal kernel vs Python path microbenchmark
├── eval_perplexity.py   # Perplexity evaluation helper
└── results/             # Benchmark JSON results + FINDINGS.md

Paper

This library implements the TurboQuant algorithm from Google Research:

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni ICLR 2026 | arXiv:2504.19874 | Google Research Blog

Related work:

PolarQuant (AISTATS 2026) — Random rotation + polar coordinate quantization (the core of tqai)
QJL (AAAI 2025) — Quantized Johnson-Lindenstrauss residual correction (available in tqai as use_qjl=True)
KIVI (ICML 2024) — Residual buffer strategy for KV cache compression
KVQuant (NeurIPS 2024) — Fused dequant-attention kernel design
APTQ (2024) — Attention-aware post-training quantization (attention-aware codebook objective)
DSQ (2019) — Differentiable soft quantization (fuzzy codebook solver)
IDE-LBG (2017) — Evolutionary codebook optimization (CMA-ES solver)

Contributing

See CONTRIBUTING.md. All commits require a DCO sign-off (git commit -s).

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pbertsch

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

Apr 22, 2026

0.5.0

Apr 22, 2026

0.4.1

Apr 22, 2026

0.4.0

Apr 7, 2026

This version

0.3.1

Apr 5, 2026

0.2.0

Apr 5, 2026

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tqai-0.3.1.tar.gz (79.8 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tqai-0.3.1-py3-none-any.whl (61.5 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file tqai-0.3.1.tar.gz.

File metadata

Download URL: tqai-0.3.1.tar.gz
Upload date: Apr 5, 2026
Size: 79.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tqai-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`23e9e2a281742d85a8af0a8bae005b2b79ed4d5178366a7a358177cf0a9c5a27`
MD5	`40668dec89ccaf0c2e02b2e139f48fad`
BLAKE2b-256	`dd06610169c99e2e8b88d8ba9757a2b2f4453b55c4a7bf944b24720a330fbd44`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tqai-0.3.1.tar.gz:

Publisher: release.yml on AlphaWaveSystems/tqai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tqai-0.3.1.tar.gz
- Subject digest: 23e9e2a281742d85a8af0a8bae005b2b79ed4d5178366a7a358177cf0a9c5a27
- Sigstore transparency entry: 1239298308
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: AlphaWaveSystems/tqai@31ca14945d222c481795ee144a2ea663493d664c
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/AlphaWaveSystems
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@31ca14945d222c481795ee144a2ea663493d664c
- Trigger Event: push

File details

Details for the file tqai-0.3.1-py3-none-any.whl.

File metadata

Download URL: tqai-0.3.1-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 61.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tqai-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b0440e2dfb247d01a14e31ad072803cda982526dc882088a25581d3d88c2f18d`
MD5	`72fc938ddcf5e62e2eca8c231a529521`
BLAKE2b-256	`908f3de5e8710955982cd303eb97d7089d011a0b3f30fdb7ce24431a81eed04d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tqai-0.3.1-py3-none-any.whl:

Publisher: release.yml on AlphaWaveSystems/tqai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tqai-0.3.1-py3-none-any.whl
- Subject digest: b0440e2dfb247d01a14e31ad072803cda982526dc882088a25581d3d88c2f18d
- Sigstore transparency entry: 1239298310
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: AlphaWaveSystems/tqai@31ca14945d222c481795ee144a2ea663493d664c
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/AlphaWaveSystems
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@31ca14945d222c481795ee144a2ea663493d664c
- Trigger Event: push

tqai 0.3.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

tqai

Installation

Quick Start

HuggingFace Transformers (PyTorch)

MLX (Apple Silicon)

Benchmark Results

Perplexity — zero change across every model and config

Throughput (MLX, v0.3.1 — fused Metal kernels + incremental cache)

Compression Configs

KV Cache

Named Configs (CLI)

How It Works

Cache Strategies (v0.3.1)

Codebook Solvers (v0.3.1)

QJL Stage 2 (opt-in)

CLI

Advanced Options

Running Tests

Project Structure

Paper

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance