Retrieval-validated embedding compression. 4-8x smaller vectors, proven recall.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

remex

Retrieval-validated embedding compression. 2-16x smaller vectors with measured recall.

Based on the rotation + Lloyd-Max scalar quantization insight from TurboQuant (Zandieh et al., ICLR 2026), focused on the use case that matters most to most AI Engineers: embedding storage and retrieval for RAG systems.

Quick start

from remex import Quantizer

# Compress embeddings — no training data needed
pq = Quantizer(d=384, bits=4)            # d = your embedding dimension
compressed = pq.encode(embeddings)       # (n, 384) float32 → compressed
indices, scores = pq.search(compressed, query, k=10)

# Save/load (bit-packed on disk)
compressed.save("index.npz")
from remex import CompressedVectors
loaded = CompressedVectors.load("index.npz")

The quantizer is fully determined by (d, bits, seed) — no training, no fitting, no index to ship.

How it works

Three steps, each with a clear purpose:

Random rotation — A fixed orthogonal matrix (Haar-distributed via QR decomposition) transforms any embedding distribution so that coordinates become approximately i.i.d. N(0, 1/d). This is the key insight from TurboQuant: it makes quantization data-oblivious, meaning no training data is required.
Lloyd-Max scalar quantization — Each coordinate is independently quantized using optimal boundaries for the N(0, 1/d) distribution. The codebook is computed from the theoretical Gaussian CDF, not from data. This produces the minimum mean-squared-error scalar quantizer for Gaussian inputs.
Bit-packing — Indices are stored at their actual bit width (not wasteful uint8), giving honest compression ratios. A 4-bit codebook uses 4 bits per coordinate on disk.

Norms are stored separately as float32, preserving inner-product ranking up to quantization error.

Why not QJL? TurboQuant includes a QJL (quantized Johnson-Lindenstrauss) residual correction stage for unbiased inner product estimation. We omit it because QJL adds variance that hurts retrieval — when only ranking order matters (not absolute scores), the MSE-optimal rotation + Lloyd-Max stage empirically dominates.

Matryoshka bit precision

An n-bit quantized index's top k bits are a valid k-bit code. remex exploits this: encode once at full bit-width, search at any lower precision by right-shifting indices. Centroid tables are precomputed for all bit levels.

This enables two-stage coarse-to-fine retrieval from a single encoded representation:

pq = Quantizer(d=384, bits=8)
compressed = pq.encode(corpus)

# Two-stage: coarse ADC (Asymmetric Distance Computation) scan at reduced bits,
# then full-precision rerank
indices, scores = pq.search_twostage(
    compressed, query, k=10,
    candidates=200,          # coarse pass returns 200 candidates
    coarse_precision=4,      # coarse scan at 4-bit (default: bits-2)
)

The nesting incurs a small penalty vs independently optimized codebooks: ~1.2% at 4-bit, up to ~10% at 2-bit. In practice this matters little for the coarse stage, which only needs to identify the right neighborhood.

Benchmarks

Recall vs bit level (synthetic, d=384, 10k corpus, 200 queries)

Method	Compression	MSE	R@10	R@100
remex 8-bit	4.0x	0.0000	0.987	0.991
remex 4-bit	7.8x	0.0094	0.850	0.895
remex 3-bit	10.4x	0.0343	0.719	0.800
remex 2-bit	15.4x	0.1171	0.538	0.634

Real embeddings (all-MiniLM-L6-v2, d=384, 10k corpus, 500 queries)

Method	Compression	MSE	R@10	R@100
remex 8-bit	2.0x	0.0000	0.974	0.995
remex 4-bit	7.8x	0.0093	0.707	0.932
remex 3-bit	10.4x	0.0341	0.599	0.897
remex 2-bit	16x	0.1164	0.517	0.860
FAISS PQ (m=96, trained)	16x	0.0341	0.816	0.946
FAISS PQ (m=48, trained)	32x	0.0636	0.618	0.877

Scaling with corpus size (synthetic, 4-bit)

Corpus	R@10	R@100	Encode (ms)	Search (ms)
1k	0.880	0.930	12	4
5k	0.862	0.905	63	13
10k	0.850	0.895	134	21
50k	0.839	0.872	689	140

Full benchmark details and distribution sensitivity analysis in bench/RESULTS.md.

When to use remex / when not to

Use remex when

You want zero training. The quantizer is deterministic and portable — just (d, bits, seed). No codebook to train, no index to ship, no retraining when your corpus changes.
You need fast encode. Encoding is ~20μs/vector (rotation + searchsorted). Adding new vectors never requires retraining.
8-bit caching is enough. At 8-bit (4x compression), R@10 = 0.974 on real embeddings. Near-lossless and much cheaper than float32.
You want coarse retrieval + reranking. 4-bit R@10=0.707 is enough for a first pass if you rerank the top candidates with a cross-encoder or full-precision search.

Do not use remex when

You need high recall at aggressive compression on real data. At 4-bit, FAISS PQ (m=96) achieves R@10=0.816 vs remex's 0.707 on real embeddings. Data-adaptive methods exploit structure that data-oblivious methods cannot.
Your embeddings form very tight clusters. When cluster spread σ < 0.05, 4-bit R@10 drops to 0.53 (from 0.85 at normal spread). Quantization errors flip rankings among near-identical vectors. 8-bit is much more robust (R@10 stays above 0.95).
You need sublinear search. remex is brute-force only. For >100k vectors, consider FAISS IVF, HNSW, or similar ANN indices. remex's compact encoding can feed into an external ANN index.

Distribution sensitivity (10k corpus, 4-bit, varying cluster tightness)

Cluster spread (σ)	2-bit R@10	4-bit R@10	8-bit R@10
0.01 (very tight)	0.163	0.533	0.954
0.05	0.478	0.831	0.984
0.10	0.532	0.846	0.987
0.30 (typical)	0.538	0.850	0.987
1.00 (diffuse)	0.525	0.848	0.984

Detection: If your 4-bit R@10 is significantly below 0.80 on a held-out set, your embeddings likely have tight clusters. Use 8-bit, or switch to a data-adaptive method.

Compression ratios

Honest packed sizes (bit-packed on disk, d=384):

Bits	Bytes per vector	vs float32	File size per 10k vectors
2	100	15.4x	0.93 MB
3	148	10.4x	1.42 MB
4	196	7.8x	1.83 MB
8	388	4.0x	3.61 MB

Float32 baseline: 1,536 bytes/vector (15.36 MB per 10k vectors).

In-memory, indices are stored as uint8 for fast search. The PackedVectors class keeps them bit-packed in memory too, using 2-4x less RAM for sub-byte widths.

API reference

`Quantizer(d, bits=4, seed=42)`

Main quantizer class (formerly PolarQuantizer, which remains available as a deprecated alias).

d — Vector dimension (must match your embeddings).
bits — Bits per coordinate: 1-4 or 8. Sweet spot is 3-4. Use 8 for near-lossless.
seed — Random seed for the rotation matrix. Same seed = same quantizer.

Methods

encode(X) — Quantize (n, d) float32 array. Returns CompressedVectors.

decode(compressed, precision=None) — Reconstruct (n, d) float32 from compressed. Optional precision (1 to bits) for Matryoshka decode.

search(compressed, query, k=10, precision=None) — Find k nearest neighbors by approximate inner product. Caches a dequantized float32 matrix for fast repeated queries. Returns (indices, scores).

search_batch(compressed, queries, k=10, precision=None) — Batch version of search() using matrix multiplication for better throughput. Returns (indices, scores) where both are (n_queries, k).

search_adc(compressed, query, k=10, precision=None, chunk_size=4096) — Memory-efficient search via ADC (Asymmetric Distance Computation) lookup-table scoring. No float32 cache — peak memory is chunk_size * d * 4 bytes (~6 MB). Slower per-query but uses ~5x less RAM. Returns (indices, scores).

search_twostage(compressed, query, k=10, candidates=500, coarse_precision=None) — Two-stage Matryoshka retrieval: ADC coarse scan (no cache) then full-precision rerank on candidates only. Memory-efficient: only the small candidate set is dequantized. Returns (indices, scores).

mse(X, precision=None) — Mean per-vector reconstruction error (L2 squared).

`CompressedVectors`

Container for quantized data. Created by Quantizer.encode(). Stores indices as uint8 in memory for fast search/decode.

Properties

n — Number of vectors.
nbytes — Bit-packed size in bytes (honest compression).
nbytes_unpacked — In-memory size (uint8 indices + float32 norms).
compression_ratio — (n * d * 4) / nbytes.
resident_bytes — Actual RAM including any active caches.

Methods

save(path) / load(path) — Save/load to .npz with bit-packed indices.
save_arrow(path) / load_arrow(path) — Save/load to Arrow IPC (Feather v2) format. Requires pyarrow.
subset(idx) — Return a new CompressedVectors with only the given row indices.
drop_cache() — Free the dequantized float32 cache to reclaim memory.

`PackedVectors`

Memory-efficient packed storage. Keeps indices bit-packed in memory, unpacking on demand. Uses 2-4x less RAM than CompressedVectors for sub-byte widths.

from remex import PackedVectors

packed = PackedVectors.from_compressed(compressed)  # pack in memory
packed = PackedVectors.from_rows(rows, norms, d=384, bits=4)  # from DB rows

# ADC and two-stage search work directly on PackedVectors
indices, scores = pq.search_adc(packed, query, k=10)
indices, scores = pq.search_twostage(packed, query, k=10)

# Matryoshka precision reduction
packed_2bit = packed.at_precision(2)

# Convert back if needed
compressed = packed.to_compressed()

Cached search() is not supported on PackedVectors — use search_adc() or search_twostage(), or convert with to_compressed().

`GPUSearcher` (optional)

GPU-accelerated search wrapper. Requires CuPy or PyTorch with CUDA. Falls back to NumPy.

from remex.gpu import GPUSearcher

searcher = GPUSearcher(pq, compressed)
indices, scores = searcher.search(query, k=10)
indices, scores = searcher.search_adc(query, k=10)
indices, scores = searcher.search_twostage(query, k=10, candidates=200)

Memory profiles (100k vectors, d=384, 8-bit)

Strategy	Resident RAM	ms/query
`search()` (cached)	192 MB	3.9
`search()` (cold)	39 MB	137
`search_adc()` (no cache)	39 MB	152
`search_twostage()` (no cache)	39 MB	152

Choose search() when latency matters and RAM is available. Choose search_adc() or search_twostage() when memory is constrained (serverless, edge, or very large corpora).

Low-level utilities

from remex import pack, unpack, packed_nbytes
from remex import lloyd_max_codebook, nested_codebooks

pack(indices, bits) / unpack(packed, bits, n_values) — Bit-pack/unpack uint8 arrays.
packed_nbytes(n_values, d, bits) — Compute packed byte count.
lloyd_max_codebook(d, bits) — Generate optimal boundaries and centroids for N(0, 1/d).
nested_codebooks(d, max_bits) — Build Matryoshka centroid tables for all bit levels 1..max_bits.

vs TurboQuant

TurboQuant (Zandieh et al., ICLR 2026) adds QJL (quantized Johnson-Lindenstrauss) residual correction for unbiased inner product estimates. This is important for KV cache attention, where unbiased estimation matters. For retrieval (ranking by approximate inner product), the QJL variance hurts more than the debiasing helps. remex implements only the MSE-optimal rotation + Lloyd-Max stage, which empirically dominates for nearest-neighbor search.

vs FAISS Product Quantization

	remex	FAISS PQ
Training	None	Required (trains on corpus)
Recall at matched compression	Lower on real data	Higher (learns structure)
Encode speed	~20μs/vec	~200μs+/vec
Corpus updates	Re-encode only new vectors	Retrain or accept stale codebook
Index portability	Quantizer is `(d, bits, seed)`	Must ship trained index
Sublinear search	No (brute-force)	Yes (IVF, HNSW)
GPU support	NumPy/CuPy/PyTorch fallback	Native CUDA

Use FAISS when: You have a stable, large corpus, need sublinear search, and can afford training time.

Use remex when: You want zero training, fast encode, frequently changing corpora, or near-lossless 8-bit caching (R@10=0.974 at 4x compression).

vs scalar quantization (naive rounding)

Without the rotation step, scalar quantization on raw embeddings is catastrophically bad — embeddings are highly anisotropic (variance ratios of 10^7x across dimensions). The random rotation spreads information uniformly across coordinates, making scalar quantization viable.

At 3-bit, remex achieves 72-80% R@10 vs ~40% for naive scalar quantization on the same data.

Installation

pip install remex                       # from PyPI (when published)
pip install -e ".[dev]"                 # development: + pytest, pytest-cov
pip install -e ".[bench]"              # benchmarking: + faiss-cpu, sentence-transformers

Testing

pytest                                  # 126 tests (~6 min)
pytest tests/test_polar_embed.py -v     # core tests
pytest tests/test_matryoshka.py -v      # Matryoshka/nested codebook tests
pytest tests/test_adc_gpu.py -v         # ADC and GPU searcher tests
pytest tests/test_packed_vectors.py -v  # PackedVectors tests

References

Zandieh et al. (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874
Jégou, Douze & Schmid (2011). Product Quantization for Nearest Neighbor Search. IEEE TPAMI 33(1):117–128. IEEE Xplore — introduces Product Quantization, ADC (Asymmetric Distance Computation), and SDC for approximate nearest neighbor search.
Kusupati et al. (2022). Matryoshka Representation Learning. NeurIPS 2022. arXiv:2205.13147 — the nested/coarse-to-fine representation idea that inspires remex's bit-level nesting.
Mezzadri (2007). How to Generate Random Matrices from the Classical Compact Groups. Notices of the AMS 54(5):592–604. arXiv:math-ph/0609050 — the QR-of-Gaussian method for Haar-distributed orthogonal matrices used in remex/rotation.py.
Lloyd (1982). Least Squares Quantization in PCM. IEEE Trans. Information Theory 28(2):129–137. IEEE Xplore — optimal scalar quantization (Lloyd-Max algorithm) for minimum MSE.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

austegard

Release history Release notifications | RSS feed

This version

0.5.1

Apr 6, 2026

0.5.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remex-0.5.1.tar.gz (37.8 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

remex-0.5.1-py3-none-any.whl (24.9 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file remex-0.5.1.tar.gz.

File metadata

Download URL: remex-0.5.1.tar.gz
Upload date: Apr 6, 2026
Size: 37.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for remex-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`dac503280849d7e06ed745ebdd624e8b135c73b9d98e8cc34d5bbfb276d44210`
MD5	`70e94f2e596d13851a84f2ddc3c9de82`
BLAKE2b-256	`98eaeb3b7a59ae43220ff1b75120af49d3efa3901c0d42eed0d766a8b299cc44`

See more details on using hashes here.

Provenance

The following attestation bundles were made for remex-0.5.1.tar.gz:

Publisher: publish.yml on oaustegard/remex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: remex-0.5.1.tar.gz
- Subject digest: dac503280849d7e06ed745ebdd624e8b135c73b9d98e8cc34d5bbfb276d44210
- Sigstore transparency entry: 1243603799
- Sigstore integration time: Apr 6, 2026
Source repository:
- Permalink: oaustegard/remex@b9af18b22a896db83dcff9c260b2a5a413453745
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/oaustegard
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b9af18b22a896db83dcff9c260b2a5a413453745
- Trigger Event: release

File details

Details for the file remex-0.5.1-py3-none-any.whl.

File metadata

Download URL: remex-0.5.1-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 24.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for remex-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f2f81e11c4e2bd95f7b15d677942bae5b00ad74f2e127a39e6f844647996dc82`
MD5	`9f9d19f8a21835d74e12e41ccba05dc2`
BLAKE2b-256	`9c937f957e332c7ee2e260e89adb166774ab88851b2669c6a0bf7660ed9580c9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for remex-0.5.1-py3-none-any.whl:

Publisher: publish.yml on oaustegard/remex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: remex-0.5.1-py3-none-any.whl
- Subject digest: f2f81e11c4e2bd95f7b15d677942bae5b00ad74f2e127a39e6f844647996dc82
- Sigstore transparency entry: 1243603815
- Sigstore integration time: Apr 6, 2026
Source repository:
- Permalink: oaustegard/remex@b9af18b22a896db83dcff9c260b2a5a413453745
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/oaustegard
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b9af18b22a896db83dcff9c260b2a5a413453745
- Trigger Event: release

remex 0.5.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

remex

Quick start

How it works

Matryoshka bit precision

Benchmarks

Recall vs bit level (synthetic, d=384, 10k corpus, 200 queries)

Real embeddings (all-MiniLM-L6-v2, d=384, 10k corpus, 500 queries)

Scaling with corpus size (synthetic, 4-bit)

When to use remex / when not to

Use remex when

Do not use remex when

Distribution sensitivity (10k corpus, 4-bit, varying cluster tightness)

Compression ratios

API reference

Quantizer(d, bits=4, seed=42)

Methods

CompressedVectors

Properties

Methods

PackedVectors

GPUSearcher (optional)

Memory profiles (100k vectors, d=384, 8-bit)

Low-level utilities

vs TurboQuant

vs FAISS Product Quantization

vs scalar quantization (naive rounding)

Installation

Testing

References

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`Quantizer(d, bits=4, seed=42)`

`CompressedVectors`

`PackedVectors`

`GPUSearcher` (optional)