Retrieval-validated embedding compression. 4-8x smaller vectors, proven recall.
Project description
remex
Retrieval-validated embedding compression. 2-16x smaller vectors with measured recall.
Formerly known as polar-embed.
Based on the rotation + Lloyd-Max scalar quantization insight from TurboQuant (Zandieh et al., ICLR 2026), focused on the use case that matters most: embedding storage and retrieval for RAG systems.
Quick start
from remex import Quantizer
# Compress embeddings — no training data needed
pq = Quantizer(d=384, bits=4) # d = your embedding dimension
compressed = pq.encode(embeddings) # (n, 384) float32 → compressed
indices, scores = pq.search(compressed, query, k=10)
# Save/load (bit-packed on disk)
compressed.save("index.npz")
from remex import CompressedVectors
loaded = CompressedVectors.load("index.npz")
The quantizer is fully determined by (d, bits, seed) — no training, no fitting, no index to ship.
How it works
Three steps, each with a clear purpose:
-
Random rotation — A fixed orthogonal matrix (Haar-distributed via QR decomposition) transforms any embedding distribution so that coordinates become approximately i.i.d. N(0, 1/d). This is the key insight from TurboQuant: it makes quantization data-oblivious, meaning no training data is required.
-
Lloyd-Max scalar quantization — Each coordinate is independently quantized using optimal boundaries for the N(0, 1/d) distribution. The codebook is computed from the theoretical Gaussian CDF, not from data. This produces the minimum mean-squared-error scalar quantizer for Gaussian inputs.
-
Bit-packing — Indices are stored at their actual bit width (not wasteful uint8), giving honest compression ratios. A 4-bit codebook uses 4 bits per coordinate on disk.
Norms are stored separately as float32, preserving inner-product ranking up to quantization error.
Why not QJL? TurboQuant includes a QJL (quantized Johnson-Lindenstrauss) residual correction stage for unbiased inner product estimation. We omit it because QJL adds variance that hurts retrieval — when only ranking order matters (not absolute scores), the MSE-optimal rotation + Lloyd-Max stage empirically dominates.
Matryoshka bit precision
An n-bit quantized index's top k bits are a valid k-bit code. remex exploits this: encode once at full bit-width, search at any lower precision by right-shifting indices. Centroid tables are precomputed for all bit levels.
This enables two-stage coarse-to-fine retrieval from a single encoded representation:
pq = Quantizer(d=384, bits=8)
compressed = pq.encode(corpus)
# Two-stage: coarse ADC scan at reduced bits, then full-precision rerank
indices, scores = pq.search_twostage(
compressed, query, k=10,
candidates=200, # coarse pass returns 200 candidates
coarse_precision=4, # coarse scan at 4-bit (default: bits-2)
)
The nesting incurs a small penalty vs independently optimized codebooks: ~1.2% at 4-bit, up to ~10% at 2-bit. In practice this matters little for the coarse stage, which only needs to identify the right neighborhood.
Benchmarks
Recall vs bit level (synthetic, d=384, 10k corpus, 200 queries)
| Method | Compression | MSE | R@10 | R@100 |
|---|---|---|---|---|
| remex 8-bit | 4.0x | 0.0000 | 0.987 | 0.991 |
| remex 4-bit | 7.8x | 0.0094 | 0.850 | 0.895 |
| remex 3-bit | 10.4x | 0.0343 | 0.719 | 0.800 |
| remex 2-bit | 15.4x | 0.1171 | 0.538 | 0.634 |
Real embeddings (all-MiniLM-L6-v2, d=384, 10k corpus, 500 queries)
| Method | Compression | MSE | R@10 | R@100 |
|---|---|---|---|---|
| remex 8-bit | 2.0x | 0.0000 | 0.974 | 0.995 |
| remex 4-bit | 7.8x | 0.0093 | 0.707 | 0.932 |
| remex 3-bit | 10.4x | 0.0341 | 0.599 | 0.897 |
| remex 2-bit | 16x | 0.1164 | 0.517 | 0.860 |
| FAISS PQ (m=96, trained) | 16x | 0.0341 | 0.816 | 0.946 |
| FAISS PQ (m=48, trained) | 32x | 0.0636 | 0.618 | 0.877 |
Scaling with corpus size (synthetic, 4-bit)
| Corpus | R@10 | R@100 | Encode (ms) | Search (ms) |
|---|---|---|---|---|
| 1k | 0.880 | 0.930 | 12 | 4 |
| 5k | 0.862 | 0.905 | 63 | 13 |
| 10k | 0.850 | 0.895 | 134 | 21 |
| 50k | 0.839 | 0.872 | 689 | 140 |
Full benchmark details and distribution sensitivity analysis in bench/RESULTS.md.
When to use remex / when not to
Use remex when
- You want zero training. The quantizer is deterministic and portable — just
(d, bits, seed). No codebook to train, no index to ship, no retraining when your corpus changes. - You need fast encode. Encoding is ~20μs/vector (rotation + searchsorted). Adding new vectors never requires retraining.
- 8-bit caching is enough. At 8-bit (4x compression), R@10 = 0.974 on real embeddings. Near-lossless and much cheaper than float32.
- You want coarse retrieval + reranking. 4-bit R@10=0.707 is enough for a first pass if you rerank the top candidates with a cross-encoder or full-precision search.
Do not use remex when
- You need high recall at aggressive compression on real data. At 4-bit, FAISS PQ (m=96) achieves R@10=0.816 vs remex's 0.707 on real embeddings. Data-adaptive methods exploit structure that data-oblivious methods cannot.
- Your embeddings form very tight clusters. When cluster spread σ < 0.05, 4-bit R@10 drops to 0.53 (from 0.85 at normal spread). Quantization errors flip rankings among near-identical vectors. 8-bit is much more robust (R@10 stays above 0.95).
- You need sublinear search. remex is brute-force only. For >100k vectors, consider FAISS IVF, HNSW, or similar ANN indices. remex's compact encoding can feed into an external ANN index.
Distribution sensitivity (10k corpus, 4-bit, varying cluster tightness)
| Cluster spread (σ) | 2-bit R@10 | 4-bit R@10 | 8-bit R@10 |
|---|---|---|---|
| 0.01 (very tight) | 0.163 | 0.533 | 0.954 |
| 0.05 | 0.478 | 0.831 | 0.984 |
| 0.10 | 0.532 | 0.846 | 0.987 |
| 0.30 (typical) | 0.538 | 0.850 | 0.987 |
| 1.00 (diffuse) | 0.525 | 0.848 | 0.984 |
Detection: If your 4-bit R@10 is significantly below 0.80 on a held-out set, your embeddings likely have tight clusters. Use 8-bit, or switch to a data-adaptive method.
Compression ratios
Honest packed sizes (bit-packed on disk, d=384):
| Bits | Bytes per vector | vs float32 | File size per 10k vectors |
|---|---|---|---|
| 2 | 100 | 15.4x | 0.93 MB |
| 3 | 148 | 10.4x | 1.42 MB |
| 4 | 196 | 7.8x | 1.83 MB |
| 8 | 388 | 4.0x | 3.61 MB |
Float32 baseline: 1,536 bytes/vector (15.36 MB per 10k vectors).
In-memory, indices are stored as uint8 for fast search. The PackedVectors class keeps them bit-packed in memory too, using 2-4x less RAM for sub-byte widths.
API reference
Quantizer(d, bits=4, seed=42)
Main quantizer class (formerly PolarQuantizer, which remains available as a deprecated alias).
d— Vector dimension (must match your embeddings).bits— Bits per coordinate: 1-4 or 8. Sweet spot is 3-4. Use 8 for near-lossless.seed— Random seed for the rotation matrix. Same seed = same quantizer.
Methods
encode(X) — Quantize (n, d) float32 array. Returns CompressedVectors.
decode(compressed, precision=None) — Reconstruct (n, d) float32 from compressed. Optional precision (1 to bits) for Matryoshka decode.
search(compressed, query, k=10, precision=None) — Find k nearest neighbors by approximate inner product. Caches a dequantized float32 matrix for fast repeated queries. Returns (indices, scores).
search_batch(compressed, queries, k=10, precision=None) — Batch version of search() using matrix multiplication for better throughput. Returns (indices, scores) where both are (n_queries, k).
search_adc(compressed, query, k=10, precision=None, chunk_size=4096) — Memory-efficient search via lookup-table scoring. No float32 cache — peak memory is chunk_size * d * 4 bytes (~6 MB). Slower per-query but uses ~5x less RAM. Returns (indices, scores).
search_twostage(compressed, query, k=10, candidates=500, coarse_precision=None) — Two-stage Matryoshka retrieval: ADC coarse scan (no cache) then full-precision rerank on candidates only. Memory-efficient: only the small candidate set is dequantized. Returns (indices, scores).
mse(X, precision=None) — Mean per-vector reconstruction error (L2 squared).
CompressedVectors
Container for quantized data. Created by Quantizer.encode(). Stores indices as uint8 in memory for fast search/decode.
Properties
n— Number of vectors.nbytes— Bit-packed size in bytes (honest compression).nbytes_unpacked— In-memory size (uint8 indices + float32 norms).compression_ratio—(n * d * 4) / nbytes.resident_bytes— Actual RAM including any active caches.
Methods
save(path)/load(path)— Save/load to.npzwith bit-packed indices.save_arrow(path)/load_arrow(path)— Save/load to Arrow IPC (Feather v2) format. Requirespyarrow.subset(idx)— Return a newCompressedVectorswith only the given row indices.drop_cache()— Free the dequantized float32 cache to reclaim memory.
PackedVectors
Memory-efficient packed storage. Keeps indices bit-packed in memory, unpacking on demand. Uses 2-4x less RAM than CompressedVectors for sub-byte widths.
from remex import PackedVectors
packed = PackedVectors.from_compressed(compressed) # pack in memory
packed = PackedVectors.from_rows(rows, norms, d=384, bits=4) # from DB rows
# ADC and two-stage search work directly on PackedVectors
indices, scores = pq.search_adc(packed, query, k=10)
indices, scores = pq.search_twostage(packed, query, k=10)
# Matryoshka precision reduction
packed_2bit = packed.at_precision(2)
# Convert back if needed
compressed = packed.to_compressed()
Cached search() is not supported on PackedVectors — use search_adc() or search_twostage(), or convert with to_compressed().
GPUSearcher (optional)
GPU-accelerated search wrapper. Requires CuPy or PyTorch with CUDA. Falls back to NumPy.
from remex.gpu import GPUSearcher
searcher = GPUSearcher(pq, compressed)
indices, scores = searcher.search(query, k=10)
indices, scores = searcher.search_adc(query, k=10)
indices, scores = searcher.search_twostage(query, k=10, candidates=200)
Memory profiles (100k vectors, d=384, 8-bit)
| Strategy | Resident RAM | ms/query |
|---|---|---|
search() (cached) |
192 MB | 3.9 |
search() (cold) |
39 MB | 137 |
search_adc() (no cache) |
39 MB | 152 |
search_twostage() (no cache) |
39 MB | 152 |
Choose search() when latency matters and RAM is available. Choose search_adc() or search_twostage() when memory is constrained (serverless, edge, or very large corpora).
Low-level utilities
from remex import pack, unpack, packed_nbytes
from remex import lloyd_max_codebook, nested_codebooks
pack(indices, bits)/unpack(packed, bits, n_values)— Bit-pack/unpack uint8 arrays.packed_nbytes(n_values, d, bits)— Compute packed byte count.lloyd_max_codebook(d, bits)— Generate optimal boundaries and centroids for N(0, 1/d).nested_codebooks(d, max_bits)— Build Matryoshka centroid tables for all bit levels 1..max_bits.
vs TurboQuant
TurboQuant (Zandieh et al., ICLR 2026) adds QJL (quantized Johnson-Lindenstrauss) residual correction for unbiased inner product estimates. This is important for KV cache attention, where unbiased estimation matters. For retrieval (ranking by approximate inner product), the QJL variance hurts more than the debiasing helps. remex implements only the MSE-optimal rotation + Lloyd-Max stage, which empirically dominates for nearest-neighbor search.
vs FAISS Product Quantization
| remex | FAISS PQ | |
|---|---|---|
| Training | None | Required (trains on corpus) |
| Recall at matched compression | Lower on real data | Higher (learns structure) |
| Encode speed | ~20μs/vec | ~200μs+/vec |
| Corpus updates | Re-encode only new vectors | Retrain or accept stale codebook |
| Index portability | Quantizer is (d, bits, seed) |
Must ship trained index |
| Sublinear search | No (brute-force) | Yes (IVF, HNSW) |
| GPU support | NumPy/CuPy/PyTorch fallback | Native CUDA |
Use FAISS when: You have a stable, large corpus, need sublinear search, and can afford training time.
Use remex when: You want zero training, fast encode, frequently changing corpora, or near-lossless 8-bit caching (R@10=0.974 at 4x compression).
vs scalar quantization (naive rounding)
Without the rotation step, scalar quantization on raw embeddings is catastrophically bad — embeddings are highly anisotropic (variance ratios of 10^7x across dimensions). The random rotation spreads information uniformly across coordinates, making scalar quantization viable.
At 3-bit, remex achieves 72-80% R@10 vs ~40% for naive scalar quantization on the same data.
Installation
pip install remex # from PyPI (when published)
pip install -e ".[dev]" # development: + pytest, pytest-cov
pip install -e ".[bench]" # benchmarking: + faiss-cpu, sentence-transformers
Testing
pytest # 126 tests (~6 min)
pytest tests/test_polar_embed.py -v # core tests
pytest tests/test_matryoshka.py -v # Matryoshka/nested codebook tests
pytest tests/test_adc_gpu.py -v # ADC and GPU searcher tests
pytest tests/test_packed_vectors.py -v # PackedVectors tests
References
- Zandieh et al. (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file remex-0.5.0.tar.gz.
File metadata
- Download URL: remex-0.5.0.tar.gz
- Upload date:
- Size: 36.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a812d0535a13a166859a9782bcd4085ad7916babb404f0d218271022357454f4
|
|
| MD5 |
4c1dbe0f91b7b56c0c8637c8cbb0403b
|
|
| BLAKE2b-256 |
c020e42e2b352e047ce3c0f07030272fc181ad915927f2fbed2ad404941ae590
|
Provenance
The following attestation bundles were made for remex-0.5.0.tar.gz:
Publisher:
publish.yml on oaustegard/remex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
remex-0.5.0.tar.gz -
Subject digest:
a812d0535a13a166859a9782bcd4085ad7916babb404f0d218271022357454f4 - Sigstore transparency entry: 1243453360
- Sigstore integration time:
-
Permalink:
oaustegard/remex@9421743b2db95ddf9ab832f3722a86a33b027fca -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/oaustegard
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9421743b2db95ddf9ab832f3722a86a33b027fca -
Trigger Event:
release
-
Statement type:
File details
Details for the file remex-0.5.0-py3-none-any.whl.
File metadata
- Download URL: remex-0.5.0-py3-none-any.whl
- Upload date:
- Size: 24.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed5f1cdb3439509fcff54ac27475139f3ac4e5032117d95e229ff5cfba8f81a3
|
|
| MD5 |
35acd62a51e62c355597912a1ce55a93
|
|
| BLAKE2b-256 |
e6eba25539194691dc825bc43b6f9012286df774e8fa1cf5c666ffa745876e1a
|
Provenance
The following attestation bundles were made for remex-0.5.0-py3-none-any.whl:
Publisher:
publish.yml on oaustegard/remex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
remex-0.5.0-py3-none-any.whl -
Subject digest:
ed5f1cdb3439509fcff54ac27475139f3ac4e5032117d95e229ff5cfba8f81a3 - Sigstore transparency entry: 1243453363
- Sigstore integration time:
-
Permalink:
oaustegard/remex@9421743b2db95ddf9ab832f3722a86a33b027fca -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/oaustegard
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9421743b2db95ddf9ab832f3722a86a33b027fca -
Trigger Event:
release
-
Statement type: