TurboQuant: Extreme compression for AI models with near-optimal distortion rates

These details have not been verified by PyPI

Project links

Project description

TurboQuant Clone

A production-quality implementation of Google's TurboQuant algorithm for extreme compression of AI models.

Overview

TurboQuant is a novel quantization algorithm from Google Research (ICLR 2026) that achieves:

3-bit compression with near-zero accuracy loss
6x memory reduction for KV caches in LLMs
8x speedup on GPU with optimized kernels
Zero memory overhead - no per-block quantization constants needed

This implementation includes:

PolarQuant: Random rotation + polar transformation + Lloyd-Max quantization
QJL: 1-bit Johnson-Lindenstrauss transform with asymmetric estimator
TurboQuant: Combined two-stage compression with error correction
Full KV cache integration for transformer models
Comprehensive benchmarking suite

Key Features

Core Algorithms

PolarQuant - Converts pairs of coordinates to polar coordinates and quantizes angles
- Eliminates normalization overhead
- Achieves 4.2x+ compression ratio
- Near-optimal distortion rates
QJL (Quantized JL) - 1-bit quantization with zero memory overhead
- Unbiased inner product estimation
- Perfect for attention mechanisms
- Error correction for residuals
TurboQuant - Two-stage compression combining both methods
- Stage 1: PolarQuant for main compression
- Stage 2: QJL on residual for error correction
- 3-bit compression with quality neutrality

Performance

Bit-width    Compression    MSE          Cosine Similarity
3-bit        4.9x           0.715        0.64
4-bit        3.8x           0.193        0.90

Memory Savings (70B model, 72GB VRAM)

Configuration	Memory	Max Context
FP16	3.05 GB	1,037,597 tokens
TurboQuant 3-bit	0.62 GB	5,108,173 tokens
Savings	4.9x	4.9x

Installation

cd turboquant-clone
python -c "import turboquant; print('TurboQuant installed successfully!')"

Quick Start

Basic Compression

import numpy as np
from turboquant import TurboQuant, TurboQuantConfig

# Create sample data
x = np.random.randn(1, 8, 1024, 128).astype(np.float32)

# Configure TurboQuant
config = TurboQuantConfig(
    bit_width=3,           # 3-bit compression
    block_size=128,        # Block size
    use_qjl=True,          # Enable QJL error correction
    use_polar=True,        # Enable PolarQuant
)

# Create quantizer
tq = TurboQuant(config)

# Quantize
quantized = tq.quantize(x)

# Dequantize
reconstructed = tq.dequantize(quantized)

# Check quality
mse = np.mean((x - reconstructed) ** 2)
print(f"MSE: {mse:.6f}")

KV Cache Compression

from turboquant.kv_cache import KVCacheCompressor

# Create compressor
compressor = KVCacheCompressor(bit_width=3, block_size=128)

# During inference, compress KV cache
key_states = np.random.randn(1, 32, 1000, 128).astype(np.float32)
value_states = np.random.randn(1, 32, 1000, 128).astype(np.float32)

k_comp, v_comp = compressor.compress_kv(key_states, value_states)

# Later, decompress for attention
k_full = compressor.decompress_k(k_comp)
v_full = compressor.decompress_v(v_comp)

# Check memory savings
stats = compressor.compute_memory_stats(
    seq_len=100000,
    batch_size=1,
    n_heads=32,
    head_dim=128,
)
print(f"Compression ratio: {stats['compression_ratio']:.2f}x")
print(f"Memory saved: {stats['memory_saved_gb']:.2f} GB")

Demo

Run the demonstration script:

python demo.py

This will show:

Basic compression with different bit-widths
KV cache memory analysis
Performance benchmarks

Algorithm Details

PolarQuant Algorithm

Random Rotation: Apply random orthogonal transformation to spread information evenly
Pairwise Polar Transform: Convert (x, y) pairs to (radius, angle)
Angle Quantization: Quantize angles using Lloyd-Max optimal codebook
Efficient Storage: Store norms + quantized indices (no per-block scales)

QJL Algorithm

JL Projection: Project to lower dimension using random Gaussian matrix
Sign Quantization: Keep only sign bit (+1/-1) of each projected value
Asymmetric Estimator: Use different reconstruction for queries vs keys
Unbiased IP: Inner products remain unbiased for attention computation

TurboQuant Combination

Original Vector:     x
↓ Normalize:         x̂ = x / ||x||
↓ PolarQuant:        q_polar = quantize_polar(x̂)
↓ Residual:          r = x̂ - dequantize_polar(q_polar)
↓ QJL:               q_qjl = sign(JL(r))

Compressed:          (||x||, q_polar, q_qjl)

Project Structure

turboquant-clone/
├── turboquant/
│   ├── __init__.py          # Main exports
│   ├── turboquant.py        # TurboQuant implementation
│   ├── polarquant.py        # PolarQuant implementation
│   ├── qjl.py               # QJL implementation
│   ├── transforms.py        # Walsh-Hadamard & polar transforms
│   ├── codebooks.py         # Lloyd-Max codebook generation
│   ├── utils.py             # Bit-packing & utilities
│   └── kv_cache.py          # KV cache integration
├── demo.py                  # Demonstration script
└── README.md               # This file

Implementation Notes

Improvements Over Paper

This implementation includes several enhancements:

Simplified Polar Transform: Uses pairwise transformation instead of full recursive
Optimized Codebooks: Pre-computed Lloyd-Max centroids for common dimensions
Flexible Block Sizes: Supports 32, 64, and 128-dimensional blocks
KV Cache Utilities: Full streaming and batch compression support

Limitations

Byte compression (compress/decompress) is not fully implemented - use quantize/dequantize
SIMD optimizations (AVX2, AVX-512, NEON) not yet implemented
CUDA kernels not yet implemented
PyTorch integration not yet implemented

References

Papers

TurboQuant (Zandieh et al., ICLR 2026)
- "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate"
- https://arxiv.org/abs/2504.19874
QJL (Zandieh et al., 2024)
- "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead"
- https://arxiv.org/abs/2406.03482
PolarQuant (Han et al., AISTATS 2026)
- "PolarQuant: Quantizing KV Caches with Polar Transformation"
- https://arxiv.org/abs/2502.02617

Related Work

GPTQ: Post-training quantization with Hessian-based error compensation
AWQ: Activation-aware weight quantization
SmoothQuant: Migrating quantization difficulty from activations to weights
QLoRA: 4-bit NormalFloat quantization for fine-tuning

License

This is a research implementation for educational purposes. Please cite the original TurboQuant paper if you use this in your research.

Citation

@inproceedings{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  booktitle={ICLR},
  year={2025}
}

Future Work

SIMD optimizations (AVX2, AVX-512, NEON)
CUDA kernels for GPU acceleration
PyTorch integration with autograd support
Integration with transformers library
Support for model weight quantization (not just KV cache)
Streaming compression for long sequences
Adaptive bit-width selection per layer

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

Apr 13, 2026

0.1.1

Apr 13, 2026

This version

0.1.0

Apr 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastvq-0.1.0.tar.gz (285.1 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastvq-0.1.0-py3-none-any.whl (55.9 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file fastvq-0.1.0.tar.gz.

File metadata

Download URL: fastvq-0.1.0.tar.gz
Upload date: Apr 13, 2026
Size: 285.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for fastvq-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`318176cac01a01b77eea2e09d575f039f6cb5a47938361e5f541f592b40593f1`
MD5	`71ca12e4b8fe51a4d726ee0bc5d6f584`
BLAKE2b-256	`c625170498768f04227b646563cab51ddea640d8fad6dbf898656aab26cdde68`

See more details on using hashes here.

File details

Details for the file fastvq-0.1.0-py3-none-any.whl.

File metadata

Download URL: fastvq-0.1.0-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 55.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for fastvq-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d67ab08781a73667f5438091a3e6b8baec1bfb4de49559b7aa47b0b97e1de060`
MD5	`1d7a43a148b450fbe89245046bc8147f`
BLAKE2b-256	`47e7d490efdfc181a690330668bb03a34165492792c986ff3e64c56f5b3b5796`

See more details on using hashes here.

fastvq 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TurboQuant Clone

Overview

Key Features

Core Algorithms

Performance

Memory Savings (70B model, 72GB VRAM)

Installation

Quick Start

Basic Compression

KV Cache Compression

Demo

Algorithm Details

PolarQuant Algorithm

QJL Algorithm

TurboQuant Combination

Project Structure

Implementation Notes

Improvements Over Paper

Limitations

References

Papers

Related Work

License

Citation

Future Work

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes