Numpy-only TurboQuant vector quantization. No PyTorch, no CUDA.

These details have not been verified by PyPI

Project links

Homepage

Project description

turbo-quant-lite

Numpy-only vector quantization based on Google's TurboQuant algorithm. Compresses float32 vectors to 1-4 bit indices with near-optimal quality. No PyTorch, no CUDA, no model dependencies.

from turbo_quant_lite import TurboQuant

tq = TurboQuant(dim=768, bits=4)

indices, norm = tq.encode(embedding)   # 3072 bytes → 388 bytes
restored = tq.decode(indices, norm)    # < 1.1% MSE distortion

Why this exists

There are two existing TurboQuant implementations on PyPI:

turboquant — PyTorch-based, focused on LLM KV cache compression. Full HuggingFace integration and GPU support. Requires PyTorch (~2 GB install).
turboquant-vectors — Numpy-only, focused on batch vector compression and embedding privacy. Includes a PrivateEncoder for protecting embeddings against inversion attacks. Designed for the "compress a collection, save to disk, search the collection" workflow.

This package fills a different niche: per-vector compression for database storage. It's for applications that store embeddings row-by-row in PostgreSQL, SQLite, or Redis and need to compress each vector into a compact binary blob (388 bytes at 4-bit, dim=768) that can be stored in a bytea column or cache key.

Key differences from turboquant-vectors:

Per-vector binary serialization — pack() / unpack() produce compact bytes for database row storage. No file I/O required.
Zero per-vector overhead — the quantizer is shared (initialized once), compressed data is just indices + norm. No 2MB wrapper per vector.
Direct single-vector similarity — similarity(query, indices, norm) works on raw indices without wrapping in a collection object.

Same algorithm, same quality, designed for the database storage use case.

For embedding privacy (protecting against inversion attacks like Vec2Text), see turboquant-vectors PrivateEncoder. You can apply a secret rotation before compression — the two compose naturally as separate layers.

Install

pip install turbo-quant-lite

Or just copy turbo_quant_lite/core.py into your project. It's one file.

What is TurboQuant?

TurboQuant is a data-oblivious vector quantization algorithm from Google Research. It compresses vectors without needing training data or calibration — it works instantly on any vector from any source.

The key insight: randomly rotate a vector and each coordinate becomes approximately Gaussian with known variance. Since the distribution is known in advance, you can precompute the optimal quantization grid. This turns a hard problem (data-dependent codebook learning) into a table lookup.

Results:

4-bit: 8x compression, < 1.1% MSE distortion
3-bit: ~10x compression, < 4.3% MSE distortion
2-bit: 16x compression, < 17% MSE distortion

Quality is within 2.72x of the information-theoretic optimum (Shannon lower bound) at every bit width. This bound is provable and data-independent — it holds for any vector, not just your benchmark.

Paper: Zandieh, Daliri, Hadian, Mirrokni. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874

Reference implementations:

turboquant — PyTorch, KV cache focus, GPU support
turboquant-rs — Rust, research/verification focus

Usage

Basic encode/decode

import numpy as np
from turbo_quant_lite import TurboQuant

tq = TurboQuant(dim=768, bits=4, seed=42)

# Any float array — from OpenAI, Nebius, Cohere, local model, etc.
embedding = np.random.randn(768).astype(np.float32)

# Compress
indices, norm = tq.encode(embedding)
# indices: uint8 array (768 values, each 0-15 for 4-bit)
# norm: float (the vector's L2 norm)

# Decompress
restored = tq.decode(indices, norm)

# Quality check
mse = np.mean((embedding - restored) ** 2) / np.mean(embedding ** 2)
# mse < 0.011 for 4-bit (guaranteed by theory)

Batch operations

embeddings = np.random.randn(1000, 768)

all_indices, all_norms = tq.encode_batch(embeddings)
restored_batch = tq.decode_batch(all_indices, all_norms)

Approximate similarity (fast, no full decompression)

query = np.random.randn(768)
score = tq.similarity(query, indices, norm)
# Equivalent to cosine similarity but skips the inverse rotation

Binary serialization for storage

from turbo_quant_lite import pack, unpack

# Pack to bytes (388 bytes for 4-bit, dim=768)
data = pack(indices, norm, bits=4)

# Unpack from bytes
indices, norm = unpack(data, dim=768, bits=4)

Use with any embedding provider

# OpenAI
response = openai.embeddings.create(model="text-embedding-3-small", input="hello")
embedding = np.array(response.data[0].embedding)
compressed = tq.encode(embedding)

# Nebius
embedding = await nebius_embedder.embed_text("hello")
compressed = tq.encode(np.array(embedding))

# Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = model.encode("hello")
compressed = tq.encode(embedding)

When to use this

Good fit:

Storing embeddings in a database (8x size reduction)
Caching embeddings in Redis/Valkey (8x memory reduction)
Shipping embeddings over the network (8x bandwidth reduction)
Local vector search where you control the storage format
Cold storage / backups of embedding collections
Edge devices with limited memory

Not a good fit:

pgvector search (pgvector needs float32/halfvec, no native 4-bit support yet)
LLM KV cache compression (use turboquant with PyTorch)
Sub-millisecond latency requirements at dim > 2048 (the rotation matmul becomes the bottleneck)

Performance

On a typical CPU (M-series Mac, modern x86), dim=768:

Operation	Time	Notes
`encode` (single)	~1.5ms	Dominated by rotation matmul
`decode` (single)	~1.5ms	Same matmul
`encode_batch(1000)`	~400ms	Amortized 0.4ms/vector
`similarity`	~0.3ms	Skips inverse rotation
`pack`	~0.1ms	Bit packing

Storage sizes (dim=768)

Format	Bytes per vector	Compression
float32	3,072	1x
float16	1,536	2x
4-bit TurboQuant	388	7.9x
3-bit TurboQuant	292	10.5x
2-bit TurboQuant	196	15.7x

Important: seed must match

The rotation matrix is generated from the seed. Encoding with seed=42 and decoding with seed=43 produces garbage. Use the same seed everywhere, or serialize the TurboQuant instance.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.1

Mar 28, 2026

0.1.0

Mar 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_quant_lite-0.1.1.tar.gz (9.1 kB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turbo_quant_lite-0.1.1-py3-none-any.whl (8.3 kB view details)

Uploaded Mar 28, 2026 Python 3

File details

Details for the file turbo_quant_lite-0.1.1.tar.gz.

File metadata

Download URL: turbo_quant_lite-0.1.1.tar.gz
Upload date: Mar 28, 2026
Size: 9.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for turbo_quant_lite-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3f709adedd98f643de4752e6530860f075abacb3b658aba64bcbef54124d5b05`
MD5	`69db125fdd8ce830913ba12dca03e440`
BLAKE2b-256	`681032056ac56be056297fb861824b6f9f56915d5144bbf979a30606c7003d66`

See more details on using hashes here.

File details

Details for the file turbo_quant_lite-0.1.1-py3-none-any.whl.

File metadata

Download URL: turbo_quant_lite-0.1.1-py3-none-any.whl
Upload date: Mar 28, 2026
Size: 8.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for turbo_quant_lite-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b9c362d0a577a62b9103bb52864dfc7af925e35f1d51dabaff06553f92e9ed03`
MD5	`70b21c31c5c73909182adabb0a8a4b3c`
BLAKE2b-256	`ea515645e33144640ff091ac8ecfdb0b156ea5867d4da938c402590008d67b6c`

See more details on using hashes here.

turbo-quant-lite 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

turbo-quant-lite

Why this exists

Install

What is TurboQuant?

Usage

Basic encode/decode

Batch operations

Approximate similarity (fast, no full decompression)

Binary serialization for storage

Use with any embedding provider

When to use this

Performance

Storage sizes (dim=768)

Important: seed must match

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes