Numpy-only TurboQuant vector quantization. No PyTorch, no CUDA.
Project description
turbo-quant-lite
Numpy-only vector quantization based on Google's TurboQuant algorithm. Compresses float32 vectors to 1-4 bit indices with near-optimal quality. No PyTorch, no CUDA, no model dependencies.
from turbo_quant_lite import TurboQuant
tq = TurboQuant(dim=768, bits=4)
indices, norm = tq.encode(embedding) # 3072 bytes → 388 bytes
restored = tq.decode(indices, norm) # < 1.1% MSE distortion
Why this exists
There are two existing TurboQuant implementations on PyPI:
- turboquant — PyTorch-based, focused on LLM KV cache compression. Full HuggingFace integration and GPU support. Requires PyTorch (~2 GB install).
- turboquant-vectors — Numpy-only, focused on batch vector compression and embedding privacy. Includes a
PrivateEncoderfor protecting embeddings against inversion attacks. Designed for the "compress a collection, save to disk, search the collection" workflow.
This package fills a different niche: per-vector compression for database storage. It's for applications that store embeddings row-by-row in PostgreSQL, SQLite, or Redis and need to compress each vector into a compact binary blob (388 bytes at 4-bit, dim=768) that can be stored in a bytea column or cache key.
Key differences from turboquant-vectors:
- Per-vector binary serialization —
pack()/unpack()produce compact bytes for database row storage. No file I/O required. - Zero per-vector overhead — the quantizer is shared (initialized once), compressed data is just indices + norm. No 2MB wrapper per vector.
- Direct single-vector similarity —
similarity(query, indices, norm)works on raw indices without wrapping in a collection object.
Same algorithm, same quality, designed for the database storage use case.
For embedding privacy (protecting against inversion attacks like Vec2Text), see turboquant-vectors PrivateEncoder. You can apply a secret rotation before compression — the two compose naturally as separate layers.
Install
pip install turbo-quant-lite
Or just copy turbo_quant_lite/core.py into your project. It's one file.
What is TurboQuant?
TurboQuant is a data-oblivious vector quantization algorithm from Google Research. It compresses vectors without needing training data or calibration — it works instantly on any vector from any source.
The key insight: randomly rotate a vector and each coordinate becomes approximately Gaussian with known variance. Since the distribution is known in advance, you can precompute the optimal quantization grid. This turns a hard problem (data-dependent codebook learning) into a table lookup.
Results:
- 4-bit: 8x compression, < 1.1% MSE distortion
- 3-bit: ~10x compression, < 4.3% MSE distortion
- 2-bit: 16x compression, < 17% MSE distortion
Quality is within 2.72x of the information-theoretic optimum (Shannon lower bound) at every bit width. This bound is provable and data-independent — it holds for any vector, not just your benchmark.
Paper: Zandieh, Daliri, Hadian, Mirrokni. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874
Reference implementations:
- turboquant — PyTorch, KV cache focus, GPU support
- turboquant-rs — Rust, research/verification focus
Usage
Basic encode/decode
import numpy as np
from turbo_quant_lite import TurboQuant
tq = TurboQuant(dim=768, bits=4, seed=42)
# Any float array — from OpenAI, Nebius, Cohere, local model, etc.
embedding = np.random.randn(768).astype(np.float32)
# Compress
indices, norm = tq.encode(embedding)
# indices: uint8 array (768 values, each 0-15 for 4-bit)
# norm: float (the vector's L2 norm)
# Decompress
restored = tq.decode(indices, norm)
# Quality check
mse = np.mean((embedding - restored) ** 2) / np.mean(embedding ** 2)
# mse < 0.011 for 4-bit (guaranteed by theory)
Batch operations
embeddings = np.random.randn(1000, 768)
all_indices, all_norms = tq.encode_batch(embeddings)
restored_batch = tq.decode_batch(all_indices, all_norms)
Approximate similarity (fast, no full decompression)
query = np.random.randn(768)
score = tq.similarity(query, indices, norm)
# Equivalent to cosine similarity but skips the inverse rotation
Binary serialization for storage
from turbo_quant_lite import pack, unpack
# Pack to bytes (388 bytes for 4-bit, dim=768)
data = pack(indices, norm, bits=4)
# Unpack from bytes
indices, norm = unpack(data, dim=768, bits=4)
Use with any embedding provider
# OpenAI
response = openai.embeddings.create(model="text-embedding-3-small", input="hello")
embedding = np.array(response.data[0].embedding)
compressed = tq.encode(embedding)
# Nebius
embedding = await nebius_embedder.embed_text("hello")
compressed = tq.encode(np.array(embedding))
# Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = model.encode("hello")
compressed = tq.encode(embedding)
When to use this
Good fit:
- Storing embeddings in a database (8x size reduction)
- Caching embeddings in Redis/Valkey (8x memory reduction)
- Shipping embeddings over the network (8x bandwidth reduction)
- Local vector search where you control the storage format
- Cold storage / backups of embedding collections
- Edge devices with limited memory
Not a good fit:
- pgvector search (pgvector needs float32/halfvec, no native 4-bit support yet)
- LLM KV cache compression (use turboquant with PyTorch)
- Sub-millisecond latency requirements at dim > 2048 (the rotation matmul becomes the bottleneck)
Performance
On a typical CPU (M-series Mac, modern x86), dim=768:
| Operation | Time | Notes |
|---|---|---|
encode (single) |
~1.5ms | Dominated by rotation matmul |
decode (single) |
~1.5ms | Same matmul |
encode_batch(1000) |
~400ms | Amortized 0.4ms/vector |
similarity |
~0.3ms | Skips inverse rotation |
pack |
~0.1ms | Bit packing |
Storage sizes (dim=768)
| Format | Bytes per vector | Compression |
|---|---|---|
| float32 | 3,072 | 1x |
| float16 | 1,536 | 2x |
| 4-bit TurboQuant | 388 | 7.9x |
| 3-bit TurboQuant | 292 | 10.5x |
| 2-bit TurboQuant | 196 | 15.7x |
Important: seed must match
The rotation matrix is generated from the seed. Encoding with seed=42 and decoding with seed=43 produces garbage. Use the same seed everywhere, or serialize the TurboQuant instance.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turbo_quant_lite-0.1.0.tar.gz.
File metadata
- Download URL: turbo_quant_lite-0.1.0.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec6bf7b2fb02be7a7307b6c9f17fc0d742f37c264ca4c0ed1be4366eb4bf004e
|
|
| MD5 |
3f4d444a8cc64a940a4475fc8bf6f9fc
|
|
| BLAKE2b-256 |
340d933f016be0e1ff699ef9ebad627d7c2be3cd678c8936d026995476676520
|
File details
Details for the file turbo_quant_lite-0.1.0-py3-none-any.whl.
File metadata
- Download URL: turbo_quant_lite-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce3312a1f2af26ea510ce68ffff841a289f76f829b093911504f419764197a5a
|
|
| MD5 |
1fa2ae260da304e31b02c9b623291771
|
|
| BLAKE2b-256 |
28b026a06e2ffdb65cd40bdce6ad0a53702a5e0ba70f36f41caea20a49d6eb0e
|