Skip to main content

Numpy-only TurboQuant vector quantization. No PyTorch, no CUDA.

Project description

turbo-quant-lite

Numpy-only vector quantization based on Google's TurboQuant algorithm. Compresses float32 vectors to 1-4 bit indices with near-optimal quality. No PyTorch, no CUDA, no model dependencies.

from turbo_quant_lite import TurboQuant

tq = TurboQuant(dim=768, bits=4)

indices, norm = tq.encode(embedding)   # 3072 bytes → 388 bytes
restored = tq.decode(indices, norm)    # < 1.1% MSE distortion

Why this exists

There are two existing TurboQuant implementations on PyPI:

  • turboquant — PyTorch-based, focused on LLM KV cache compression. Full HuggingFace integration and GPU support. Requires PyTorch (~2 GB install).
  • turboquant-vectors — Numpy-only, focused on batch vector compression and embedding privacy. Includes a PrivateEncoder for protecting embeddings against inversion attacks. Designed for the "compress a collection, save to disk, search the collection" workflow.

This package fills a different niche: per-vector compression for database storage. It's for applications that store embeddings row-by-row in PostgreSQL, SQLite, or Redis and need to compress each vector into a compact binary blob (388 bytes at 4-bit, dim=768) that can be stored in a bytea column or cache key.

Key differences from turboquant-vectors:

  • Per-vector binary serializationpack() / unpack() produce compact bytes for database row storage. No file I/O required.
  • Zero per-vector overhead — the quantizer is shared (initialized once), compressed data is just indices + norm. No 2MB wrapper per vector.
  • Direct single-vector similaritysimilarity(query, indices, norm) works on raw indices without wrapping in a collection object.

Same algorithm, same quality, designed for the database storage use case.

For embedding privacy (protecting against inversion attacks like Vec2Text), see turboquant-vectors PrivateEncoder. You can apply a secret rotation before compression — the two compose naturally as separate layers.

Install

pip install turbo-quant-lite

Or just copy turbo_quant_lite/core.py into your project. It's one file.

What is TurboQuant?

TurboQuant is a data-oblivious vector quantization algorithm from Google Research. It compresses vectors without needing training data or calibration — it works instantly on any vector from any source.

The key insight: randomly rotate a vector and each coordinate becomes approximately Gaussian with known variance. Since the distribution is known in advance, you can precompute the optimal quantization grid. This turns a hard problem (data-dependent codebook learning) into a table lookup.

Results:

  • 4-bit: 8x compression, < 1.1% MSE distortion
  • 3-bit: ~10x compression, < 4.3% MSE distortion
  • 2-bit: 16x compression, < 17% MSE distortion

Quality is within 2.72x of the information-theoretic optimum (Shannon lower bound) at every bit width. This bound is provable and data-independent — it holds for any vector, not just your benchmark.

Paper: Zandieh, Daliri, Hadian, Mirrokni. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874

Reference implementations:

Usage

Basic encode/decode

import numpy as np
from turbo_quant_lite import TurboQuant

tq = TurboQuant(dim=768, bits=4, seed=42)

# Any float array — from OpenAI, Nebius, Cohere, local model, etc.
embedding = np.random.randn(768).astype(np.float32)

# Compress
indices, norm = tq.encode(embedding)
# indices: uint8 array (768 values, each 0-15 for 4-bit)
# norm: float (the vector's L2 norm)

# Decompress
restored = tq.decode(indices, norm)

# Quality check
mse = np.mean((embedding - restored) ** 2) / np.mean(embedding ** 2)
# mse < 0.011 for 4-bit (guaranteed by theory)

Batch operations

embeddings = np.random.randn(1000, 768)

all_indices, all_norms = tq.encode_batch(embeddings)
restored_batch = tq.decode_batch(all_indices, all_norms)

Approximate similarity (fast, no full decompression)

query = np.random.randn(768)
score = tq.similarity(query, indices, norm)
# Equivalent to cosine similarity but skips the inverse rotation

Binary serialization for storage

from turbo_quant_lite import pack, unpack

# Pack to bytes (388 bytes for 4-bit, dim=768)
data = pack(indices, norm, bits=4)

# Unpack from bytes
indices, norm = unpack(data, dim=768, bits=4)

Use with any embedding provider

# OpenAI
response = openai.embeddings.create(model="text-embedding-3-small", input="hello")
embedding = np.array(response.data[0].embedding)
compressed = tq.encode(embedding)

# Nebius
embedding = await nebius_embedder.embed_text("hello")
compressed = tq.encode(np.array(embedding))

# Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = model.encode("hello")
compressed = tq.encode(embedding)

When to use this

Good fit:

  • Storing embeddings in a database (8x size reduction)
  • Caching embeddings in Redis/Valkey (8x memory reduction)
  • Shipping embeddings over the network (8x bandwidth reduction)
  • Local vector search where you control the storage format
  • Cold storage / backups of embedding collections
  • Edge devices with limited memory

Not a good fit:

  • pgvector search (pgvector needs float32/halfvec, no native 4-bit support yet)
  • LLM KV cache compression (use turboquant with PyTorch)
  • Sub-millisecond latency requirements at dim > 2048 (the rotation matmul becomes the bottleneck)

Performance

On a typical CPU (M-series Mac, modern x86), dim=768:

Operation Time Notes
encode (single) ~1.5ms Dominated by rotation matmul
decode (single) ~1.5ms Same matmul
encode_batch(1000) ~400ms Amortized 0.4ms/vector
similarity ~0.3ms Skips inverse rotation
pack ~0.1ms Bit packing

Storage sizes (dim=768)

Format Bytes per vector Compression
float32 3,072 1x
float16 1,536 2x
4-bit TurboQuant 388 7.9x
3-bit TurboQuant 292 10.5x
2-bit TurboQuant 196 15.7x

Important: seed must match

The rotation matrix is generated from the seed. Encoding with seed=42 and decoding with seed=43 produces garbage. Use the same seed everywhere, or serialize the TurboQuant instance.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_quant_lite-0.1.1.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turbo_quant_lite-0.1.1-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file turbo_quant_lite-0.1.1.tar.gz.

File metadata

  • Download URL: turbo_quant_lite-0.1.1.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for turbo_quant_lite-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3f709adedd98f643de4752e6530860f075abacb3b658aba64bcbef54124d5b05
MD5 69db125fdd8ce830913ba12dca03e440
BLAKE2b-256 681032056ac56be056297fb861824b6f9f56915d5144bbf979a30606c7003d66

See more details on using hashes here.

File details

Details for the file turbo_quant_lite-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for turbo_quant_lite-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b9c362d0a577a62b9103bb52864dfc7af925e35f1d51dabaff06553f92e9ed03
MD5 70b21c31c5c73909182adabb0a8a4b3c
BLAKE2b-256 ea515645e33144640ff091ac8ecfdb0b156ea5867d4da938c402590008d67b6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page