Fast compressed ANN search via randomized Hadamard transform and optimal Gaussian scalar quantization. Pure NumPy, no heavy dependencies.
Project description
snapvec
Fast compressed approximate nearest-neighbor search. Pure NumPy. No heavy dependencies.
snapvec ships four index types for embedding vector search, each targeting a different point on the accuracy / storage / latency frontier:
SnapIndex— training-free scalar. Implements TurboQuant (randomized Hadamard transform + Lloyd-Max scalar quantization). Works out-of-the-box on any vector distribution, no calibration or corpus sample required. ~6× / ~8× / ~12× compression at 4 / 3 / 2 bits with >0.92 recall@10 on real embeddings.ResidualSnapIndex— training-free, two-stage scalar. Cascades coarse + residual Lloyd-Max quantization to reach operating pointsSnapIndexalone cannot (5 / 6 / 7 bits/coord with recall up to 0.96), and offers a coarse-pass + rerank search mode that converges to full-reconstruction recall at O(rerank_M) candidates instead of O(N) (e.g.rerank_M = 100already saturates on the tested corpora).PQSnapIndex— train-once product quantization. Learns per-subspace k-means codebooks; delivers +15–18 pp recall@10 overSnapIndexat matched bytes/vec on modern LLM embeddings, and opens ultra-compressed modes (16 / 32 / 64 B/vec) that scalar quantization cannot reach. Cost is one offlinefit(sample)call.IVFPQSnapIndex— sub-linear search at scale. Adds an inverted-file coarse partition on top of residual PQ. Visits onlynprobe / nlistof the corpus per query: ~9× faster thanPQSnapIndexfull-scan at -0.6 pp recall, or actually exceeds full-scan recall whennprobe ≥ nlist / 8because per-cluster residuals have smaller variance than globally-centred vectors.
pip install snapvec
Which bits should I use?
| If you are… | Pick | Why |
|---|---|---|
| Building RAG / semantic search (the default) | bits=4 |
~95% recall@10 on real embeddings, 5.9× smaller than float32 |
| Squeezing massive corpora where scale beats precision | bits=2 |
11.6× smaller; recall@10 ≈ 0.83 on synthetic, higher on clustered real data |
| Willing to pay a bit of accuracy for ~25% more compression vs 4-bit | bits=3 |
7.8× smaller, now tightly packed — a genuine middle ground as of v0.3.0 |
| Need unbiased inner-product estimates (KV-cache, attention) | bits=3 or 4 + use_prod=True |
QJL correction at the cost of ~2× search latency |
Which Index should I use?
| If you need… | Use | Why |
|---|---|---|
| Maximum accuracy on LLM embeddings | PQSnapIndex |
Exploits the embedding model's natural cluster structure. 128 B/vec with PQ matches 256 B/vec with scalar Lloyd-Max on BGE-small. |
| Zero setup / plug-and-play | SnapIndex |
No .fit() needed. RHT Gaussianizes any distribution, so fixed Lloyd-Max codebooks work out of the box. |
| Massive scale (N ≫ 10⁶) | PQSnapIndex |
Enables 16 B and 32 B per-vector modes that are not physically reachable with scalar quantization (SnapIndex floors at ~128 B/vec for 384-dim embeddings). |
| Sub-linear search latency (N ≳ 10⁴) | IVFPQSnapIndex |
Inverted-file partition + residual PQ. Visits only nprobe / nlist of the corpus per query — 5–13× faster than PQSnapIndex full-scan at the same recall on BGE-small at N=20k. |
| Unbiased inner-product estimates (KV-cache, attention) | SnapIndex(use_prod=True) |
QJL residual correction; see the TurboQuant_prod section below. |
On BGE-small / SciFact (3-seed mean, disjoint train/eval split):
| Mode | Bytes/vec | Recall@10 | Δ vs scalar baseline |
|---|---|---|---|
SnapIndex(bits=2) |
128 + 4 | 0.686 | — |
PQSnapIndex(M=128, K=256, normalized=True) |
128 | 0.871 | +18.5 pp |
SnapIndex(bits=3) |
192 + 4 | 0.781 | — |
PQSnapIndex(M=192, K=256, normalized=True) |
192 | 0.937 | +15.6 pp |
SnapIndex(bits=4) |
256 + 4 | 0.870 | — |
PQSnapIndex(M=256, K=256, use_rht=True) |
256 | 0.948 | +7.8 pp |
On augmented BGE-small @ N = 20 000 (for a regime where IVF is worth it; M=192, K=256):
| Mode | Visits | Recall@10 | ms / query | Speedup vs PQ full-scan |
|---|---|---|---|---|
PQSnapIndex full-scan |
100 % | 0.937 | 5.82 | 1.0× |
IVFPQSnapIndex(nlist=256, nprobe=8) |
3.1 % | 0.910 | 0.60 | 9.7× |
IVFPQSnapIndex(nlist=256, nprobe=16) |
6.2 % | 0.931 | 0.96 | 6.1× |
IVFPQSnapIndex(nlist=256, nprobe=32) |
12.5 % | 0.940 | 1.66 | 3.5× & higher recall |
Full sweeps in experiments/bench_pq_scaleup_validation.py (scalar vs PQ, K ∈ {16, 64, 256}, 3 seeds) and experiments/bench_ivf_pq_contiguous.py (IVF-PQ, nlist ∈ {128, 256, 512}).
Quick start
import numpy as np
from snapvec import SnapIndex
# Build index
idx = SnapIndex(dim=384, bits=4) # 4-bit, ~6x compression
# ⚠ pass float32 arrays — see "Input dtype" below
embeddings = np.asarray(embeddings, dtype=np.float32)
idx.add_batch(ids=list(range(N)), vectors=embeddings)
# Query
results = idx.search(query_vector.astype(np.float32), k=10) # [(id, score), ...]
# Persist
idx.save("my_index.snpv")
idx2 = SnapIndex.load("my_index.snpv") # atomic save, v1/v2/v3 compatible
Quick start — PQSnapIndex (higher recall, one-off fit)
from snapvec import PQSnapIndex
# Build: choose M subspaces dividing dim (or pdim if use_rht=True) and K ≤ 256 centroids.
# Storage = M bytes/vec when normalized=True (+4 bytes otherwise).
idx = PQSnapIndex(dim=384, M=192, K=256, normalized=True) # 192 B/vec
# Train once on a representative sample (≥ K, ~10–50 k is plenty)
idx.fit(training_vectors.astype(np.float32))
# Index and query exactly like SnapIndex
idx.add_batch(ids=list(range(N)), vectors=corpus.astype(np.float32))
results = idx.search(query.astype(np.float32), k=10)
# Persist to its own format (atomic save, .snpq magic SNPQ v1)
idx.save("my_index.snpq")
idx2 = PQSnapIndex.load("my_index.snpq")
Pick PQSnapIndex when you can afford the one-off fit step and want the recall lift documented above; pick SnapIndex when you need a truly training-free, drop-in index.
Quick start — IVFPQSnapIndex (sub-linear search at scale)
from snapvec import IVFPQSnapIndex
# Rule-of-thumb: nlist ≈ 4·√N (e.g. nlist = 256 for N = 20 000).
idx = IVFPQSnapIndex(dim=384, nlist=256, M=192, K=256, normalized=True)
# fit() trains both the coarse centroids and the residual PQ codebook
idx.fit(training_vectors.astype(np.float32))
idx.add_batch(ids=list(range(N)), vectors=corpus.astype(np.float32))
# nprobe trades recall for latency. Defaults to max(1, nlist // 16).
results = idx.search(query.astype(np.float32), k=10, nprobe=16)
idx.save("my_index.snpi")
idx2 = IVFPQSnapIndex.load("my_index.snpi")
On BGE-small (N = 20 000, M=192, K=256, nlist=256) vs. PQSnapIndex full-scan (recall 0.937, 5.82 ms/q):
nprobe |
% corpus probed | recall@10 | ms/q | speedup |
|---|---|---|---|---|
| 8 | 3.1 % | 0.910 | 0.60 | 9.7× |
| 16 | 6.2 % | 0.931 | 0.96 | 6.1× |
| 32 | 12.5 % | 0.940 | 1.66 | 3.5× (↑ recall) |
Residual encoding (stored codes are for x − centroid_c rather than x itself) lets the IVF configuration exceed full-scan recall at nprobe ≥ nlist / 8 because per-cluster residuals have smaller variance than globally-centred vectors. See experiments/bench_ivf_pq_contiguous.py for the sweep.
Why PQSnapIndex defaults to use_rht=False
The Randomized Hadamard Transform (RHT) is essential for scalar quantization: it Gaussianizes each coordinate so that a single fixed Lloyd-Max codebook is optimal regardless of the input distribution. That is the entire point of SnapIndex.
For Product Quantization the same rotation is actively harmful:
- Structure preservation. PQ learns a codebook per subspace by finding clusters inside that subspace. RHT spreads every coordinate's variance uniformly across the whole vector, which increases entropy per subspace and destroys the clustered geometry k-means is trying to exploit.
- Empirical cost. Turning the RHT off and running k-means on the raw embedding subspaces buys +15.6 pp recall@10 at 192 B/vec on BGE-small / SciFact (0.937 vs 0.781 for
SnapIndex(bits=3), and 0.781 vs 0.776 for PQ-with-RHT at the same storage). Validated across K ∈ {16, 64, 256} with 3-seed mean and std ≤ 0.01.
use_rht=True remains available for the rare cases where the embedding distribution is genuinely isotropic already, or for sanity-checking against the scalar pipeline. See experiments/bench_pq_no_rht_ablation.py for the controlled A/B.
Input dtype: pass np.float32
add_batch, add, and search accept any array-like input but cast
internally to float32. If you pass float64 (the NumPy default),
a full-size temporary copy is allocated during the cast — for
add_batch(1M × 384) that's a transient 1.5 GB allocation on top of
your input array, gone only after quantization completes.
To avoid this:
# Models that return float32 directly (most modern embeddings)
embeddings = model.encode(texts) # already float32 → no copy
idx.add_batch(ids, embeddings)
# Arrays from np.array([...]) default to float64 — cast explicitly
embeddings = np.asarray(my_list, dtype=np.float32) # cast once upfront
idx.add_batch(ids, embeddings)
# Loading from disk — respect the original dtype or force float32
embeddings = np.load("vecs.npy").astype(np.float32, copy=False)
The cast is a correctness convenience, not a design choice — the whole
pipeline (normalize, RHT, quantize) operates in float32.
Technical background
The problem: embedding vectors are expensive
Modern embedding models produce float32 vectors of dimension d ∈ {384, 768, 1536}.
Storing N vectors requires 4·N·d bytes; brute-force search costs O(N·d) per query.
For N = 1M, d = 384: 1.5 GB RAM, with inner products dominating inference time.
Product Quantization (PQ) splits vectors into M sub-vectors and quantizes each independently. It is effective but requires training a K-means codebook per dataset. Random Binary Quantization (RaBitQ, 1-bit) is fast but coarse.
TurboQuant achieves near-optimal distortion at b bits per coordinate without training codebooks, by first rotating the space with a randomized Hadamard transform to make coordinates approximately Gaussian, then quantizing each coordinate independently with the optimal scalar quantizer for N(0,1).
📄 Paper: Zandieh, Daliri, Hadian, Mirrokni (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026 — arXiv:2504.19874
snapvecis an independent, pure-NumPy implementation of the algorithm described in that paper. If you use snapvec in academic work, please cite the TurboQuant paper.
Algorithm
Step 1 — Normalize
Given a raw embedding v ∈ ℝᵈ, compute the unit vector v̂ = v / ‖v‖ and store
‖v‖ separately (float32, 4 bytes per vector).
Step 2 — Randomized Hadamard Transform (RHT)
Pad v̂ to the next power of 2 (d' = 2^⌈log₂ d⌉), then apply:
x = (1/√d') · H · D · v̂
where:
D = diag(σ₁, …, σ_d')— diagonal matrix of i.i.d. ±1 random signs (seed-deterministic)H— unnormalized Walsh-Hadamard matrix (butterfly pattern)
By the Johnson-Lindenstrauss lemma, each coordinate xᵢ ≈ N(0, 1/d').
After rescaling x̃ = x · √d', the coordinates are approximately N(0,1)
regardless of the original distribution of v.
Complexity: O(d log d) — no matrix multiplication, no codebook training.
Implementation note (v0.3.0): the butterfly is fully vectorised via a
reshape view — each level becomes one pair of NumPy ops on the whole
array instead of a Python for loop over n / (2h) slices. The diff is
tiny and the speedup is ~24× on a single query at padded_dim = 512:
# Before: Python loop over n/(2h) slices per level → O(d) dispatches
while h < n:
for i in range(0, n, h * 2):
a = x[..., i : i + h].copy()
b = x[..., i + h : i + 2 * h]
x[..., i : i + h] = a + b
x[..., i + h : i + 2 * h] = a - b
h *= 2
# After: reshape view → one pair of ops per level → O(log d) dispatches
while h < n:
view = x.reshape(*x.shape[:-1], n // (2 * h), 2, h)
a = view[..., 0, :].copy()
view[..., 0, :] += view[..., 1, :] # view[0] ← a + b
view[..., 1, :] = a - view[..., 1, :] # view[1] ← a - b
h *= 2
reshape returns a view only when the array is C-contiguous, so writes
to view propagate back to x — no extra allocation per level beyond
the single a.copy(). In the RHT pipeline we enforce this explicitly
(via np.ascontiguousarray at the call site and a defensive assert
inside _fwht_inplace); .astype alone does not guarantee C-order
when the input is Fortran-contiguous.
Step 3 — Lloyd-Max scalar quantization
The optimal scalar quantizer for N(0,1) at b bits partitions ℝ into 2^b intervals
and assigns each the conditional mean as reconstruction value.
These boundaries and centroids are precomputed and hardcoded in snapvec._codebooks
(no scipy required at runtime):
| bits | levels | distortion (MSE) | bytes/coord |
|---|---|---|---|
| 2 | 4 | 0.1175 | 0.25 |
| 3 | 8 | 0.0311 | 0.375 |
| 4 | 16 | 0.0077 | 0.50 |
The quantized vector is stored as a uint8 index matrix, tightly
bit-packed to b/8 bytes per coordinate both on disk and in RAM. For 3-bit
this means packing 8 indices into 3 bytes (24 bits); for 2-bit and 4-bit
the packing is byte-aligned (4 per byte, 2 per byte respectively).
Step 4 — Approximate inner product
At search time the query q is rotated (not quantized) and the approximate
cosine similarity is computed as:
score(q, v) = (1/d') · Σᵢ centroid[idx_qᵢ] · centroid[idx_vᵢ]
This is a single float16 matrix–vector product against the cached centroid expansions.
TurboQuant_prod: unbiased estimator with QJL correction
The MSE quantizer introduces a small systematic downward bias. The use_prod=True mode
corrects this using a Quantized Johnson-Lindenstrauss (QJL) residual:
Build time (per stored vector):
- Quantize at
(b-1)bits MSE, compute residualr = x̃ - x̃_MSE - Store
sign(S·r)as a 1-bit vector (int8 ±1 in practice), whereS ∈ ℝ^(d'×d')is a fixed random Gaussian matrix - Store
‖r‖ / √d'(one float32 per vector)
Query time (correction term):
correctionᵢ = √(π/2) / d' · ‖rᵢ‖ · dot(S·q̂, sign(S·rᵢ))
final_scoreᵢ = mse_scoreᵢ + correctionᵢ
This follows from Lemma 4 of Zandieh et al. (2025):
E[sign(S·r)] = √(2/π) · S·r / ‖S·r‖, giving an unbiased estimate of ⟨r, q̂⟩.
When to use use_prod=True:
- When you need accurate inner product magnitudes (KV-cache, attention approximation)
- Not recommended for pure ranking/NNS — the added QJL variance degrades recall@k relative to MSE-only at equal total bits
Compression ratios
Measured on d = 384 (BGE-small); the RHT pads to the next power of 2 so
padded_dim = 512. All per-vector byte counts below include the padded
dimensions — they are the numbers you actually pay in RAM and on disk.
The trailing + 4 is the float32 norm; in normalized=True mode it is
still persisted (as 1.0) to keep the on-disk layout stable across modes.
| Backend | Bytes/vec (disk) | Bytes/vec (RAM, idle) | Disk ratio | RAM ratio |
|---|---|---|---|---|
| float32 | 1 536 | 1 536 | 1.0× | 1.0× |
| 4-bit snapvec | 256 + 4 | 256 + 4 | 5.9× | 5.9× |
| 3-bit snapvec | 192 + 4 | 192 + 4 | 7.8× | 7.8× |
| 2-bit snapvec | 128 + 4 | 128 + 4 | 11.6× | 11.6× |
| int8 (naïve) | 384 + 4 | 384 + 4 | 4.0× | 4.0× |
RAM = disk: indices are tightly bit-packed in both. The same byte
layout is used everywhere, so save / load copy bytes directly without
an intermediate unpack/repack step. This halves indices RAM vs the
pre-v0.2 behaviour (which stored uint8 in RAM and only packed on disk).
Bit-packing scheme:
- 4-bit (
0.5 bytes/coord): 2 indices per byte (byte-aligned). - 3-bit (
0.375 bytes/coord): 8 indices → 3 bytes, cross-byte tight packing (v3 file format; v1/v2 files are read transparently via the legacy byte-aligned decoder). - 2-bit (
0.25 bytes/coord): 4 indices per byte (byte-aligned).
Unpacking happens when the float16 centroid cache is built (cached
full-scan path — once, off the hot matmul) or per-chunk/per-query in
chunk_size / filter_ids modes.
Search cache: a lazy float16 centroid expansion of shape
(N, padded_dim) is materialised on first query for fast matmul (~5 ms
at N = 100 k). It is evicted on writes and can be avoided entirely via
chunk_size for memory-constrained deployments.
Real-world footprint: 1 M vectors at d = 768
Typical for BGE-base, E5-base, nomic-embed-text-v1, and other 768-dim
models. The RHT pads to 1024.
| Backend | Idle RAM | + cache (float16) | Warm peak |
|---|---|---|---|
| float32 | 2.86 GiB | — | 2.86 GiB |
| int8 (naïve) | 0.72 GiB | — | 0.72 GiB |
| 4-bit snapvec | 0.48 GiB | +1.91 GiB | 2.39 GiB |
| 3-bit snapvec | 0.36 GiB | +1.91 GiB | 2.27 GiB |
| 2-bit snapvec | 0.24 GiB | +1.91 GiB | 2.15 GiB |
Numbers use binary units (1 GiB = 2³⁰ bytes); e.g. float32 is
1M × 768 × 4 B = 2.86 GiB.
The cache is materialised only during active search and is evicted on
any write. With chunk_size set, warm peak drops to roughly
idle RAM + chunk_size × padded_dim × 2 B (the per-chunk float16
scratch) — e.g. chunk_size=10_000 at padded_dim=1024 adds ~20 MiB,
at the cost of ~10× query latency. This is the usual memory/latency
trade-off, exposed as a first-class flag.
For a single-server RAG index at this scale, 4-bit snapvec idles at half a GiB where float32 idles at ~3 GiB.
Recall benchmarks
Measured on synthetic unit-sphere vectors (d=384, N=10 000, 100 queries).
Baseline: exact cosine float32 brute-force.
| bits | recall@1 | recall@10 | recall@50 |
|---|---|---|---|
| 2 | 0.72 | 0.83 | 0.91 |
| 3 | 0.81 | 0.91 | 0.96 |
| 4 | 0.86 | 0.93 | 0.95 |
Recall improves with clustered (real-world) data. On BGE-small-en embeddings from mixed document corpora, 4-bit achieves recall@10 ≈ 0.95.
Note on published results: The TurboQuant paper (Zandieh et al., 2025) reports recall up to 0.99, measured against HNSW graph navigation (not brute-force float32), on GloVe
d=200data, using recall@1 with largek_probe. These conditions differ from the above; both results are correct under their respective definitions.
File format (.snpv)
Offset Size Field
──────────────────────────────────────────────────
0 4 B magic: "SNPV"
4 4 B version: uint32 (1, 2, or 3)
8 4 B dim: uint32 — original embedding dimension
12 4 B bits: uint32 — total bits (2, 3, or 4)
16 4 B seed: uint32 — rotation seed
20 4 B n: uint32 — number of stored vectors
24 4 B flags: uint32 — bit-0: use_prod, bit-1: normalized [v2/v3 only]
──────────────────────────────────────────────────
28 4 B packed_len: uint32
32 * indices: bit-packed uint8 MSE indices
n×4 B norms: float32 per-vector original norms
[prod only]
n×d' B qjl_signs: int8 sign(S·r) per vector
n×4 B rnorms: float32 ‖r‖/√d per vector
──────────────────────────────────────────────────
n×(2+L) ids: uint16-length-prefixed UTF-8 strings
Saves are atomic on POSIX: writes to .snpv.tmp then os.replace().
Backward compatible: v1 (mse-only, pre-flags), v2 (flags + byte-aligned
3-bit), and v3 (tight 3-bit packing) files all load correctly — the
reader dispatches the 3-bit decoder on version.
API reference
SnapIndex(dim, bits=4, seed=0, use_prod=False, chunk_size=None, normalized=False)
| Parameter | Type | Default | Description |
|---|---|---|---|
dim |
int | — | Embedding dimension |
bits |
int | 4 | Bits per coordinate: 2, 3, or 4 |
seed |
int | 0 | Rotation seed — must be consistent across build and query |
use_prod |
bool | False | Enable QJL unbiased estimator (requires bits ≥ 3) |
chunk_size |
int | None | None | Stream search in chunks without the float16 cache (for N > 500k) |
normalized |
bool | False | Skip norm computation: trust that input vectors are unit-length |
Methods
idx.add(id, vector) # Add one vector
idx.add_batch(ids, vectors) # Add N vectors (~50x faster than loop)
idx.delete(id) -> bool # Remove by id, O(1) lookup
idx.search(query, k=10, filter_ids=None) # [(id, score), ...] descending
idx.save(path) # Atomic binary save to .snpv
SnapIndex.load(path) # Load from .snpv file
idx.stats() -> dict # Compression / memory diagnostics
len(idx) # Number of stored vectors
repr(idx) # SnapIndex(dim=384, bits=4, mode=mse, n=1000)
All vector inputs should be
np.float32. Passingfloat64triggers a full-size temporary cast insideadd_batch/search(see "Input dtype" in Quick start). Most embedding models already returnfloat32; only ad-hoc arrays built from Python lists vianp.array(...)default tofloat64.
Relation to TurboQuant / PolarQuant
snapvec implements the core compression pipeline from:
Zandieh, A., Daliri, M., Hadian, A., & Mirrokni, V. (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874
The same algorithm was published concurrently as "PolarQuant" at AISTATS 2026.
Both names were already taken on PyPI; snapvec = Hadamard + Lloyd-Max,
named after its two core operations.
Note on
PQSnapIndex. WhileSnapIndexfollows the TurboQuant paper strictly (RHT + Lloyd-Max),PQSnapIndexdeliberately departs from that pipeline: it skips the RHT and learns per-subspace k-means codebooks on the raw embedding space, which lets it exploit the natural clustered structure of modern LLM embeddings. The trade-off is that a training-free guarantee is replaced with a one-offfit(sample)call — in exchange for a strictly better recall / storage Pareto frontier on the retrieval workloads we've measured. See the decision table at the top of this README.
Key contributions of this implementation over the reference:
- No scipy — codebooks hardcoded, numpy is the only runtime dependency
- Batch WHT — single O(n·d·log d) call for bulk inserts (~50x faster than loop)
- Float16 cache — centroid expansions in half precision, ~2x faster matmul
- Packed RAM indices — 2/4-bit indices stored bit-packed in RAM (2× less memory), unpacked lazily when the float16 cache is built — zero impact on warm query latency
- Pre-filtering —
filter_idsrestricts search to a subset with O(|filter| · d) cost - Chunked streaming search —
chunk_sizeavoids the float16 cache for large N - O(1) delete —
_id_to_posdict + position compaction - Atomic saves —
.snpv.tmp→os.replace()pattern - Versioned format — v1/v2/v3 all loadable, forward-compatible flags field, legacy 3-bit decoder kept around for pre-v0.3 indices
Roadmap
The current implementation comfortably hits the 10–20 ms interactive budget
for N ≤ 200 k, d ∈ {384, 768}, with p50 warm-query latency of ~5 ms at
N=100 k, d=384. The float16 centroid cache carries ~77% of that time as a
BLAS-bound gemv; the remaining 23% is Python glue (RHT, normalization,
argpartition, result assembly).
Future work is scoped around this measured profile — we don't optimize what's already fast.
Pure-Python improvements (no new dependencies)
- Quantized norms on disk — store per-vector norms as uint8 with a (min, max) header. Saves 3 bytes/vec on disk with <0.1% precision loss.
- Typed ID storage — persist integer IDs as fixed-width uint32/uint64 instead of UTF-8 strings when all IDs are numeric. Saves ~3 bytes/vec on disk for sequential integer IDs.
- Positional-ID fast path — when IDs are
0..N-1, skip_idslist and_id_to_posdict entirely; position is the ID. Saves ~23 bytes/vec of Python object overhead.
Previously planned and now shipped: Tight 3-bit packing (v0.3.0); vectorized FWHT (v0.3.0, ~25× faster per query in isolation, ~10% E2E).
Hybrid Python + Rust core (opt-in accelerator)
The following phases would ship as an optional snapvec-core wheel. The
pure-Python path remains fully functional — Rust is detected at import
time and used automatically when available.
Phase 1 — Cold-start / cache build (highest ROI)
- Move centroid expansion +
float16conversion to Rust with SIMD. - Target: cold first-query 150 ms → ~20 ms on N=100 k, d=384.
- Smallest API surface; trivial Python fallback.
Phase 2 — LUT-based scan (eliminates the float16 cache)
- Implement PQ-style Asymmetric Distance Computation: per-query LUT (16 entries × pdim) + SIMD gather over packed indices.
- Target: warm query 5 ms → ~1.5–2 ms; cache RAM: 102 MB → 0.
- Enables N > 500 k with flat RAM growth.
Phase 3 — Batch RHT + quantization
- Rust FWHT +
searchsortedforadd_batch. - Target:
add_batch(100k, d=384)2.5 s → ~200 ms. - Lowest priority — indexing is typically amortized offline.
Non-goals
- Trained codebooks (PQ / OPQ / RaBitQ-trained). Keeps the "no training required" guarantee; the data-agnostic Lloyd-Max tables make snapvec safe to use without a representative training sample.
- Graph indices (HNSW / NSG). Different trade-off space; snapvec targets flat indices where compression and predictable latency matter more than sub-linear scan.
- GPU acceleration. The cache-matmul path is memory-bound on modern CPUs; GPU would help only at very large N where network/transfer cost already dominates.
Changelog
See CHANGELOG.md for the per-release history. Recent
highlights:
- v0.3.0 — Tight 3-bit packing (5.9× → 7.8× compression for 3-bit), vectorised FWHT (~24× faster single-query RHT), file format v3 with transparent v1/v2 backward-compat.
- v0.2.0 — RAM-packed indices for 2/4-bit (idle RAM cut in half),
normalized=Trueflag, measured-accurate compression docs.
Installation
pip install snapvec
Requirements: Python >= 3.10, NumPy >= 1.24. No other runtime dependencies.
For development:
git clone https://github.com/stffns/snapvec
cd snapvec
pip install -e .
pytest tests/ -v
License
MIT © 2025 Jayson Steffens.
The TurboQuant algorithm is described in arXiv:2504.19874 by Zandieh et al. (Google Research / ICLR 2026). This package is an independent implementation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file snapvec-0.4.0.tar.gz.
File metadata
- Download URL: snapvec-0.4.0.tar.gz
- Upload date:
- Size: 69.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1490509eff301c7b5c89e813933ca3abe57b119440eb8fe3e7f63728211daacd
|
|
| MD5 |
bf773ab8da944f4fb3bfef77779de539
|
|
| BLAKE2b-256 |
370daab6e58203d5c01988424a583185610fef11fd8721d926a36f4c84a216a7
|
File details
Details for the file snapvec-0.4.0-py3-none-any.whl.
File metadata
- Download URL: snapvec-0.4.0-py3-none-any.whl
- Upload date:
- Size: 43.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38b951bd196ab793bb780388eac9fb20d8192c6e8ed07bc62537eb05bdf0c603
|
|
| MD5 |
8a140ed3822b4cd17fd330026f30046f
|
|
| BLAKE2b-256 |
9a615bec18d85d3b2750154ef54543d08d04513eeefed04718a9914210343cd2
|