Python implementation of the TurboQuant and QJL vector quantization algorithms
Project description
turboquant-py
Python implementation of the TurboQuant and QJL vector quantization algorithms.
turboquant-py implements the TurboQuant and QJL vector quantization algorithms from Google Research (ICLR 2026 / AISTATS 2026). It compresses high-dimensional floating-point vectors to 1-4 bits per coordinate while preserving inner products and distances with provably near-optimal distortion. The library offers two quantization modes — MSE mode for reconstruction fidelity and inner-product mode for unbiased similarity search — plus a standalone 1-bit QJL quantizer, all built on a NumPy-first core with optional PyTorch acceleration.
Installation
pip install turboquant-py
# With PyTorch acceleration (optional — supports CUDA and Apple Silicon MPS)
pip install "turboquant-py[torch]"
# From source
pip install -e .
pip install -e ".[torch]"
Quick Start
TurboQuant MSE mode
import numpy as np
from turboquant import TurboQuant
vectors = np.random.randn(1000, 384) # 1000 vectors, dim=384
tq = TurboQuant(dim=384, bit_width=2, mode="mse", seed=42)
compressed = tq.quantize(vectors)
reconstructed = tq.dequantize(compressed)
mse = float(np.mean((vectors - reconstructed) ** 2))
print(f"Reconstruction MSE: {mse:.6f}")
# Save and reload
compressed.save("my_index")
from turboquant import CompressedVectors
reloaded = CompressedVectors.load("my_index")
TurboQuant inner product mode
import numpy as np
from turboquant import TurboQuant
db = np.random.randn(10000, 768)
query = np.random.randn(768)
tq = TurboQuant(dim=768, bit_width=3, seed=42) # inner_product mode is the default
compressed = tq.quantize(db)
# Estimate inner products against all compressed database vectors
scores = tq.inner_product(query, compressed)
top10 = np.argsort(scores)[::-1][:10]
print(f"Top-10 indices: {top10}")
QJL 1-bit quantization
import numpy as np
from turboquant import QJL
db = np.random.randn(10000, 1536)
query = np.random.randn(1536)
qjl = QJL(dim=1536, seed=42)
compressed = qjl.quantize(db)
scores = qjl.inner_product(query, compressed)
top10 = np.argsort(scores)[::-1][:10]
print(f"Top-10 indices: {top10}")
API Reference
TurboQuant(dim, bit_width, mode, seed, outlier_channels, outlier_bit_width)
Multi-bit vector quantizer supporting MSE and inner-product modes.
| Parameter | Type | Default | Description |
|---|---|---|---|
dim |
int |
required | Input vector dimensionality |
bit_width |
int |
required | Bits per coordinate (1-4; inner-product mode requires 2+) |
mode |
str |
"inner_product" |
"mse" for reconstruction quality; "inner_product" for unbiased similarity search |
seed |
int | None |
None |
Random seed for the rotation matrix and QJL projection |
outlier_channels |
int |
0 |
Number of high-magnitude channels to quantize at higher precision |
outlier_bit_width |
int | None |
None |
Bit-width for outlier channels |
Methods:
quantize(vectors)— compress an(n, dim)array; returnsCompressedVectorsdequantize(compressed)— reconstruct approximate originals; returns(n, dim)arrayinner_product(query, compressed)— estimate inner products; returns(n,)scoresquantize_batched(vectors, batch_size, output_path, entropy_encode)— stream-quantize large collections to disk
QJL(dim, projection_dim, seed)
1-bit quantizer using the Quantized Johnson-Lindenstrauss transform. Applies a random Gaussian projection followed by sign-bit quantization to produce an unbiased inner-product estimator with no reconstruction capability.
| Parameter | Type | Default | Description |
|---|---|---|---|
dim |
int |
required | Input vector dimensionality |
projection_dim |
int | None |
dim |
Projection dimension (must be <= dim); higher values improve accuracy at the cost of storage |
seed |
int | None |
None |
Random seed for the projection matrix |
Methods:
quantize(vectors)— compress to 1-bit sign representation; returnsCompressedVectorsinner_product(query, compressed)— estimate inner products; returns(n,)scores
CompressedVectors
In-memory container for quantized vectors. Holds the bit-packed quantization indices, per-vector L2 norms, and any auxiliary arrays (QJL signs, residual norms for inner-product mode). Supports slicing with [start:end] and merging with CompressedVectors.concatenate(parts).
Save/load:
compressed.save("path/to/dir") # bit-packed on disk
compressed.save("path/to/dir", entropy_encode=True) # Huffman-encoded (~5% smaller at 4-bit)
loaded = CompressedVectors.load("path/to/dir")
loaded = CompressedVectors.load("path/to/dir", mmap_mode="r") # memory-mapped
Key attributes: indices, norms, dim, bit_width, num_vectors, metadata, extra_arrays
CompressedStore
On-disk vector store backed by memory-mapped arrays. Reconstructs the original quantizer from saved metadata (using the stored seed to regenerate rotation/projection matrices) and supports brute-force top-k search without loading all vectors into RAM.
store = CompressedStore.load("path/to/dir")
results = store.search(query, k=10) # returns list[tuple[int, float]], sorted descending by score
Supports all modes: MSE, inner-product, QJL, and outlier configurations are all reconstructed automatically from the saved metadata.
Properties: dim, num_vectors, bit_width, mode, vectors
compute_codebook(dim, bit_width) / get_codebook(dim, bit_width)
compute_codebook runs Lloyd-Max optimization on the Beta distribution that describes coordinates of randomly rotated unit vectors, returning (centroids, boundaries) arrays of sizes 2^bit_width and 2^bit_width + 1.
get_codebook is an lru_cache-wrapped convenience wrapper around compute_codebook. Precomputed codebooks are shipped for common dimensions (64-4096); unsupported dimensions are computed on-the-fly.
compute_theoretical_savings(dim, bit_width)
Compute the theoretical entropy and Huffman coding savings for a given configuration, as described in Section 3.1 of the TurboQuant paper.
from turboquant import compute_theoretical_savings
savings = compute_theoretical_savings(dim=256, bit_width=4)
# {'entropy': 3.742, 'avg_bits_huffman': 3.779, 'savings_pct': 5.5}
Storage
On-disk format uses a directory containing:
| File | Contents |
|---|---|
meta.json |
Dimensions, bit-width, mode, seed, outlier config, encoding flags |
indices.npy |
Bit-packed quantization indices (or indices.huffman if entropy-encoded) |
norms.npy |
Per-vector L2 norms |
huffman_table.json |
Huffman coding table (only when entropy-encoded) |
Additional .npy files |
QJL sign vectors, residual norms, outlier indices (mode-dependent) |
Rotation and projection matrices are not stored — they are reconstructed deterministically from the saved seed, reducing storage overhead.
Entropy encoding
Indices can optionally be Huffman-encoded when saving, providing lossless compression. Savings are measured vs fixed-width encoding (not vs bit-packing):
| Bit-width | Shannon entropy | Huffman avg bits | Savings |
|---|---|---|---|
| 1 | 1.000 | 1.000 | 0.0% |
| 2 | 1.911 | 1.989 | 0.5% |
| 3 | 2.819 | 2.876 | 4.1% |
| 4 | 3.742 | 3.779 | 5.5% |
compressed.save("my_index", entropy_encode=True)
# Loading is automatic — detects encoding from metadata
loaded = CompressedVectors.load("my_index")
Supported Bit-Widths
Results on synthetic unit vectors (dim=768, n=1000).
| Bit-width | Compression ratio | Reconstruction MSE | Recall@1 (MSE mode) | Recall@10 (MSE mode) |
|---|---|---|---|---|
| 1 | 32x | 0.000473 | 0.19 | 0.65 |
| 2 | 16x | 0.000153 | 0.46 | 0.91 |
| 3 | 10.7x | 0.000045 | 0.67 | 0.99 |
| 4 | 8x | 0.000013 | 0.73 | 1.00 |
Benchmarks
All results use all-MiniLM-L6-v2 embeddings (dim=384, n=2000 database, n=200 queries, seed=42).
MSE distortion and recall on neural embeddings
| Method | Bit-width | MSE | Recall@1 | Recall@10 |
|---|---|---|---|---|
| NaiveUniform | 2 | 0.001079 | 0.675 | 0.721 |
| NaiveUniform | 3 | 0.000195 | 0.830 | 0.850 |
| NaiveUniform | 4 | 0.000042 | 0.895 | 0.913 |
| TurboQuant-mse | 2 | 0.000305 | 0.755 | 0.817 |
| TurboQuant-mse | 3 | 0.000090 | 0.895 | 0.878 |
| TurboQuant-mse | 4 | 0.000025 | 0.895 | 0.918 |
| TurboQuant-inner_product | 2 | 0.000946 | 0.605 | 0.713 |
| TurboQuant-inner_product | 3 | 0.000305 | 0.760 | 0.820 |
| TurboQuant-inner_product | 4 | 0.000090 | 0.895 | 0.881 |
| QJL (1-bit) | 1-4 | — | 0.580 | 0.703 |
TurboQuant-mse achieves 3.5x lower MSE than naive uniform quantization at 2 bits. At 4 bits, TurboQuant-inner_product matches QJL recall while storing structured residuals that enable better approximation.
Comparison vs. uniform quantization at 2 bits (dim=384)
| Method | MSE | Recall@1 | Recall@10 |
|---|---|---|---|
| NaiveUniform | 0.000856 | 0.30 | 0.80 |
| PerChannelUniform | 0.001032 | 0.29 | 0.82 |
| RandProj+Uniform | 0.001027 | 0.28 | 0.76 |
| TurboQuant-mse | 0.000304 | 0.44 | 0.95 |
TurboQuant-mse reduces MSE by 2.8x over NaiveUniform and improves Recall@10 from 80% to 95% at the same 2-bit budget.
Acceleration
All core operations are implemented in NumPy. When PyTorch is installed, matrix operations (rotation, projection, batch inner products) dispatch to PyTorch tensors automatically.
| Backend | Device detection | Notes |
|---|---|---|
| NumPy (default) | Always available | Fastest on CPU for most workloads |
| PyTorch + CUDA | Auto-detected | Best for large-scale GPU workloads |
| PyTorch + MPS | Auto-detected | Apple Silicon GPU; uses float32 internally (MPS does not support float64) |
On CPU, NumPy and PyTorch perform similarly. GPU acceleration benefits primarily the matrix operations (rotation, projection); the scalar quantization is already vectorized in NumPy.
How It Works
TurboQuant MSE mode applies a random orthogonal rotation to each input vector before scalar quantization. Because coordinates of a randomly rotated unit vector follow a known Beta distribution, a precomputed Lloyd-Max codebook can be derived analytically for that distribution rather than estimated from data. Lloyd-Max quantization is provably optimal for a fixed scalar quantizer, and the rotation ensures coordinates match the distribution the codebook was designed for. Rotating back after dequantization reconstructs the original vector with near-optimal MSE.
TurboQuant inner-product mode extends MSE quantization to produce an unbiased inner-product estimator. It quantizes each vector at (b-1) bits using the MSE codebook, computes the residual between the original and the MSE reconstruction, and then applies the QJL transform (random Gaussian projection followed by sign extraction) to that residual. The stored representation consists of the MSE quantization indices plus the sign bits and L2 norm of the residual. At query time, the inner product estimate combines the MSE dot product and a QJL correction term scaled by sqrt(pi/2) / d, which exactly cancels the bias introduced by the MSE quantizer. Using one bit for QJL on the residual and (b-1) bits for MSE thus achieves better inner-product accuracy than spending all b bits on MSE quantization alone.
QJL (Quantized Johnson-Lindenstrauss) is a 1-bit scheme that projects each key vector through a random Gaussian matrix S and stores only the sign vector sign(S·k) along with the vector norm. For a query q, the inner product is estimated as sqrt(pi/2) / m * ||k|| * <S·q, sign(S·k)>, where m is the projection dimension. This estimator is unbiased and requires only 1 bit per projected coordinate, making it suitable for extreme compression where reconstruction is not needed.
References
- TurboQuant: "TurboQuant: Redefining AI Efficiency with Extreme Compression" — arXiv:2504.19874
- QJL: Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead" (AISTATS 2026) — arXiv:2406.03482
- PolarQuant: arXiv:2502.02617
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turboquant_py-0.1.0.tar.gz.
File metadata
- Download URL: turboquant_py-0.1.0.tar.gz
- Upload date:
- Size: 163.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e0769928286218b43d36e7b080e28aed0c71e66935140096ab298bcbb6331c3
|
|
| MD5 |
4d3ace058eba16a770b2da333d104683
|
|
| BLAKE2b-256 |
a58468c08becdf7a158501c426f0af79f529025c8af70f1a1b3c9f238371082a
|
Provenance
The following attestation bundles were made for turboquant_py-0.1.0.tar.gz:
Publisher:
publish.yml on msilverblatt/turboquant-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
turboquant_py-0.1.0.tar.gz -
Subject digest:
2e0769928286218b43d36e7b080e28aed0c71e66935140096ab298bcbb6331c3 - Sigstore transparency entry: 1189460122
- Sigstore integration time:
-
Permalink:
msilverblatt/turboquant-py@5ee0950cb5656c5616e22654f454f85a16c3555f -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/msilverblatt
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5ee0950cb5656c5616e22654f454f85a16c3555f -
Trigger Event:
release
-
Statement type:
File details
Details for the file turboquant_py-0.1.0-py3-none-any.whl.
File metadata
- Download URL: turboquant_py-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c94cf8c88ae8f355ebe4acc8f920f204e00320e4911f34eece9f74b57ae3807a
|
|
| MD5 |
0689199bbcfa2345a77db71e1caa2ff8
|
|
| BLAKE2b-256 |
e2e5907ef71c0f3ac4e902709dedc3929bbf0cc02a006e286c75589e07441107
|
Provenance
The following attestation bundles were made for turboquant_py-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on msilverblatt/turboquant-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
turboquant_py-0.1.0-py3-none-any.whl -
Subject digest:
c94cf8c88ae8f355ebe4acc8f920f204e00320e4911f34eece9f74b57ae3807a - Sigstore transparency entry: 1189460129
- Sigstore integration time:
-
Permalink:
msilverblatt/turboquant-py@5ee0950cb5656c5616e22654f454f85a16c3555f -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/msilverblatt
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5ee0950cb5656c5616e22654f454f85a16c3555f -
Trigger Event:
release
-
Statement type: