Build static, range-fetchable full-text search datasets (RRS/RRSF/RRSR) from Python — search millions of records in the browser with no backend
Project description
roaringrange (Python)
Build static, range-fetchable search datasets from Python, then search
millions of records in the browser with no backend. These bindings wrap the
core Rust build module, so the files they emit are byte-identical to the Go and
Rust builders and are read by the same WASM reader. Two index types: a trigram
text index (Builder) and a similarity / vector index (VectorBuilder).
What it produces
Builder.build(out_dir) writes the four files the text reader serves over HTTP
Range; VectorBuilder.build(path) writes one .rrvi similarity index:
| file | format | contents |
|---|---|---|
index.rrs |
RRSI |
trigram text index (popularity-split postings) |
index.rrf |
RRSF |
facet sidecar (field → category → doc-ID bitmap, with counts) |
records.idx / records.bin |
RRSR |
per-doc record bytes (your encoding) |
*.rrvi |
RRVI |
IVFPQ similarity index (range-fetched coarse clusters + PQ codes) |
Upload them to S3/CloudFront and point the WASM reader at the URLs.
Install
Prebuilt abi3 wheels (one wheel for CPython 3.8+) are published to PyPI:
pip install roaringrange
CI builds and tests the extension on CPython 3.12, 3.13, and 3.14.
From source (dev)
cd python
maturin develop --release # builds + installs into the active venv
# or: maturin build --release # produces a wheel in target/wheels/
Requires a Rust toolchain and pip install maturin.
Usage
import roaringrange as rr, json
b = rr.Builder(gram_size=3)
for row in rows: # rows from a DataFrame, DB, JSONL, …
b.add(
rank=row["citations"], # higher rank = listed first (doc-ID order)
text=f'{row["title"]} {row["abstract"]}', # tokenized into trigram keys
record=json.dumps({"t": row["title"], "y": row["year"]}).encode(),
facets={"year": [str(row["year"])], "type": [row["type"]]}, # field → categories
)
stats = b.build("out/") # writes out/index.rrs, index.rrf, records.idx, records.bin
print(stats) # BuildStats(docs=..., ngrams=..., fields=...)
rr.tokenize(text, gram_size=3) returns the n-gram keys a string maps to — useful
for understanding why a query does or doesn't match.
Vector / similarity search
VectorBuilder trains an IVFPQ index over your embeddings and writes a single
.rrvi file that the WASM reader range-fetches like the text index. Use the
same doc_id as the text index so a vector hit maps to the same record (and
can hybridize with trigram search). Vectors are L2-normalized for the default
"ip" (cosine) metric.
import roaringrange as rr
vb = rr.VectorBuilder(dim=256, nlist=4096, m=32, metric="ip") # m must divide dim
for doc_id, embedding in enumerate(embeddings): # embeddings: any float sequences
vb.add(doc_id, embedding.tolist()) # numpy row → list of floats
# or in one call: vb.add_many([(i, e.tolist()) for i, e in enumerate(embeddings)])
stats = vb.build("out/vectors.rrvi")
print(stats) # VectorBuildStats(vectors=..., dim=256, nlist=..., m=32, nbits=8)
Parameters: nlist coarse clusters (≈ 4·√N, clamped to the vector count),
m PQ subquantizers (must divide dim), nbits (1–8) → 2^nbits codes per
subspace, metric "ip"/"cosine" or "l2". Training is deterministic
(seed, kmeans_iters). One .rrvi per embedding model — each model is a
different vector space. See ../VECTORS.md for the byte layout.
This pure-Rust trainer suits small/medium corpora and tests; at very large scale
train with FAISS and export the same RRVI layout (the reader is identical).
Scale: train with FAISS, export to RRVI
For large corpora, train OPQ,IVF,PQ with FAISS and export the trained parts —
no retraining in Rust. python/scripts/faiss_to_rrvi.py does this end to end
(install the extra: pip install 'roaringrange[train]' for numpy + faiss-cpu):
from faiss_to_rrvi import export_to_rrvi
stats = export_to_rrvi(vectors, doc_ids, "vectors.rrvi", nlist=4096, m=32, metric="ip")
Under the hood it calls the low-level roaringrange.write_rrvi_from_faiss(...),
which takes the FAISS arrays (OPQ rotation, coarse centroids, PQ codebooks,
per-vector cluster + 8-bit codes) as little-endian byte buffers — so the wheel
needs no numpy dependency. The export is verified against the Rust reader
(recall@10 ≈ 0.9995 vs FAISS's own search on the same index).
Embedding text (mode 2: model2vec, no backend)
python/scripts/model2vec_embed.py embeds text with a model2vec static model
(minishlab/potion-retrieval-32M, 512-d, mean-pooled token vectors — no
transformer, fast on CPU) and builds a .rrvi. Install the extra:
pip install 'roaringrange[embed]'.
from model2vec_embed import build_rrvi_from_texts
stats, _ = build_rrvi_from_texts(titles, doc_ids, "vectors.rrvi", nlist=256, m=32)
It's "mode 2" because the same model2vec recipe can run in the browser at query time, so similarity search needs no backend at all. The query embedding must use the identical model + pooling as the corpus, or the spaces won't match.
Notes
- Ranking is baked in. Doc IDs are assigned in descending
rank, so the top-K of any query is free at read time (no query-time scoring). Pick a good rank signal (citations, holdings, popularity, …). - Records are opaque.
record=is raw bytes; the format never dictates your schema. Decode them however you like on the client. - In-memory build. This builds the whole index in RAM — ideal for up to many
millions of records. For corpora whose index exceeds memory, the core crate's
chunked path (
build::chunk) is the route; exposing it here is a follow-up.
MIT — see ../LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file roaringrange-0.1.0-cp38-abi3-win_amd64.whl.
File metadata
- Download URL: roaringrange-0.1.0-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 452.7 kB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5f62ad50a8c2130a2d1b1a1918d22b33502327017d3bb33d7b60f9f546a64a5
|
|
| MD5 |
bea6bbafa0554cd158e1001068736f6f
|
|
| BLAKE2b-256 |
ad686fc80683fee5c39ecf167981457dc0f36cb1796e5d67b9885552f2c89d3e
|
Provenance
The following attestation bundles were made for roaringrange-0.1.0-cp38-abi3-win_amd64.whl:
Publisher:
release.yml on freeeve/roaringrange
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
roaringrange-0.1.0-cp38-abi3-win_amd64.whl -
Subject digest:
e5f62ad50a8c2130a2d1b1a1918d22b33502327017d3bb33d7b60f9f546a64a5 - Sigstore transparency entry: 1934336834
- Sigstore integration time:
-
Permalink:
freeeve/roaringrange@7753352a4e5caac560bf354eda8b0de13f74b70c -
Branch / Tag:
refs/tags/v0.24.0 - Owner: https://github.com/freeeve
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7753352a4e5caac560bf354eda8b0de13f74b70c -
Trigger Event:
push
-
Statement type:
File details
Details for the file roaringrange-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: roaringrange-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 560.3 kB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccb4884f328823d3a47a2323000aec9970401c1d8a32d090cdc89a3c7116a4f6
|
|
| MD5 |
4c7df934107794fc53c82e3dd3dcae0f
|
|
| BLAKE2b-256 |
d459a9b2590d3a32f6f557a8a3e4bde8723f1c8a506437e46b122431dd47ddd1
|
Provenance
The following attestation bundles were made for roaringrange-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on freeeve/roaringrange
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
roaringrange-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
ccb4884f328823d3a47a2323000aec9970401c1d8a32d090cdc89a3c7116a4f6 - Sigstore transparency entry: 1934336769
- Sigstore integration time:
-
Permalink:
freeeve/roaringrange@7753352a4e5caac560bf354eda8b0de13f74b70c -
Branch / Tag:
refs/tags/v0.24.0 - Owner: https://github.com/freeeve
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7753352a4e5caac560bf354eda8b0de13f74b70c -
Trigger Event:
push
-
Statement type:
File details
Details for the file roaringrange-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: roaringrange-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 529.3 kB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75dac763e495e4b5c84860935d02b82e9d097d6f1d4fd14db5843048883460a1
|
|
| MD5 |
0ff37699e4dbd3f5fee8f22a8d2b815b
|
|
| BLAKE2b-256 |
0958527b7b81f513230fda92de07bb1ffd133f5f4a95769a4f812b2ad68599b2
|
Provenance
The following attestation bundles were made for roaringrange-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on freeeve/roaringrange
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
roaringrange-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
75dac763e495e4b5c84860935d02b82e9d097d6f1d4fd14db5843048883460a1 - Sigstore transparency entry: 1934336811
- Sigstore integration time:
-
Permalink:
freeeve/roaringrange@7753352a4e5caac560bf354eda8b0de13f74b70c -
Branch / Tag:
refs/tags/v0.24.0 - Owner: https://github.com/freeeve
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7753352a4e5caac560bf354eda8b0de13f74b70c -
Trigger Event:
push
-
Statement type:
File details
Details for the file roaringrange-0.1.0-cp38-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: roaringrange-0.1.0-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 497.7 kB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c61cf22459a777dc65687c9e14cc1cc0cce3a5e30622db78d60ee1a905423280
|
|
| MD5 |
f8b823cd191098a00bf879030e012310
|
|
| BLAKE2b-256 |
cd71e49d23cad2d85919f6850ab2bb3cff8c2e16b46619918994af22046fe10f
|
Provenance
The following attestation bundles were made for roaringrange-0.1.0-cp38-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on freeeve/roaringrange
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
roaringrange-0.1.0-cp38-abi3-macosx_11_0_arm64.whl -
Subject digest:
c61cf22459a777dc65687c9e14cc1cc0cce3a5e30622db78d60ee1a905423280 - Sigstore transparency entry: 1934336871
- Sigstore integration time:
-
Permalink:
freeeve/roaringrange@7753352a4e5caac560bf354eda8b0de13f74b70c -
Branch / Tag:
refs/tags/v0.24.0 - Owner: https://github.com/freeeve
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7753352a4e5caac560bf354eda8b0de13f74b70c -
Trigger Event:
push
-
Statement type:
File details
Details for the file roaringrange-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: roaringrange-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 513.1 kB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e93a0ab923a7f52fb0a006bbe8c2afde89cf4d84517cea9edada308fa66386d5
|
|
| MD5 |
20765d25b7413d33ea910c67ad87f181
|
|
| BLAKE2b-256 |
c8ecf1beba6f0ed90ba7dc17866d30c4e882fa9f54262bdbfc742464d5944001
|
Provenance
The following attestation bundles were made for roaringrange-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on freeeve/roaringrange
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
roaringrange-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl -
Subject digest:
e93a0ab923a7f52fb0a006bbe8c2afde89cf4d84517cea9edada308fa66386d5 - Sigstore transparency entry: 1934336789
- Sigstore integration time:
-
Permalink:
freeeve/roaringrange@7753352a4e5caac560bf354eda8b0de13f74b70c -
Branch / Tag:
refs/tags/v0.24.0 - Owner: https://github.com/freeeve
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7753352a4e5caac560bf354eda8b0de13f74b70c -
Trigger Event:
push
-
Statement type: