Skip to main content

Build static, range-fetchable full-text search datasets (RRS/RRSF/RRSR) from Python — search millions of records in the browser with no backend

Project description

roaringrange (Python)

Build static, range-fetchable search datasets from Python, then search millions of records in the browser with no backend. These bindings wrap the core Rust build module, so the files they emit are byte-identical to the Go and Rust builders and are read by the same WASM reader. Two index types: a trigram text index (Builder) and a similarity / vector index (VectorBuilder).

What it produces

Builder.build(out_dir) writes the four files the text reader serves over HTTP Range; VectorBuilder.build(path) writes one .rrvi similarity index:

file format contents
index.rrs RRSI trigram text index (popularity-split postings)
index.rrf RRSF facet sidecar (field → category → doc-ID bitmap, with counts)
records.idx / records.bin RRSR per-doc record bytes (your encoding)
*.rrvi RRVI IVFPQ similarity index (range-fetched coarse clusters + PQ codes)

Upload them to S3/CloudFront and point the WASM reader at the URLs.

Install

Prebuilt abi3 wheels (one wheel for CPython 3.8+) are published to PyPI:

pip install roaringrange

CI builds and tests the extension on CPython 3.12, 3.13, and 3.14.

From source (dev)

cd python
maturin develop --release      # builds + installs into the active venv
# or: maturin build --release   # produces a wheel in target/wheels/

Requires a Rust toolchain and pip install maturin.

Usage

import roaringrange as rr, json

b = rr.Builder(gram_size=3)
for row in rows:                              # rows from a DataFrame, DB, JSONL, …
    b.add(
        rank=row["citations"],                # higher rank = listed first (doc-ID order)
        text=f'{row["title"]} {row["abstract"]}',   # tokenized into trigram keys
        record=json.dumps({"t": row["title"], "y": row["year"]}).encode(),
        facets={"year": [str(row["year"])], "type": [row["type"]]},  # field → categories
    )

stats = b.build("out/")        # writes out/index.rrs, index.rrf, records.idx, records.bin
print(stats)                   # BuildStats(docs=..., ngrams=..., fields=...)

rr.tokenize(text, gram_size=3) returns the n-gram keys a string maps to — useful for understanding why a query does or doesn't match.

Vector / similarity search

VectorBuilder trains an IVFPQ index over your embeddings and writes a single .rrvi file that the WASM reader range-fetches like the text index. Use the same doc_id as the text index so a vector hit maps to the same record (and can hybridize with trigram search). Vectors are L2-normalized for the default "ip" (cosine) metric.

import roaringrange as rr

vb = rr.VectorBuilder(dim=256, nlist=4096, m=32, metric="ip")  # m must divide dim
for doc_id, embedding in enumerate(embeddings):     # embeddings: any float sequences
    vb.add(doc_id, embedding.tolist())              # numpy row → list of floats
# or in one call: vb.add_many([(i, e.tolist()) for i, e in enumerate(embeddings)])

stats = vb.build("out/vectors.rrvi")
print(stats)   # VectorBuildStats(vectors=..., dim=256, nlist=..., m=32, nbits=8)

Parameters: nlist coarse clusters (≈ 4·√N, clamped to the vector count), m PQ subquantizers (must divide dim), nbits (1–8) → 2^nbits codes per subspace, metric "ip"/"cosine" or "l2". Training is deterministic (seed, kmeans_iters). One .rrvi per embedding model — each model is a different vector space. See ../VECTORS.md for the byte layout.

This pure-Rust trainer suits small/medium corpora and tests; at very large scale train with FAISS and export the same RRVI layout (the reader is identical).

Scale: train with FAISS, export to RRVI

For large corpora, train OPQ,IVF,PQ with FAISS and export the trained parts — no retraining in Rust. python/scripts/faiss_to_rrvi.py does this end to end (install the extra: pip install 'roaringrange[train]' for numpy + faiss-cpu):

from faiss_to_rrvi import export_to_rrvi
stats = export_to_rrvi(vectors, doc_ids, "vectors.rrvi", nlist=4096, m=32, metric="ip")

Under the hood it calls the low-level roaringrange.write_rrvi_from_faiss(...), which takes the FAISS arrays (OPQ rotation, coarse centroids, PQ codebooks, per-vector cluster + 8-bit codes) as little-endian byte buffers — so the wheel needs no numpy dependency. The export is verified against the Rust reader (recall@10 ≈ 0.9995 vs FAISS's own search on the same index).

Embedding text (mode 2: model2vec, no backend)

python/scripts/model2vec_embed.py embeds text with a model2vec static model (minishlab/potion-retrieval-32M, 512-d, mean-pooled token vectors — no transformer, fast on CPU) and builds a .rrvi. Install the extra: pip install 'roaringrange[embed]'.

from model2vec_embed import build_rrvi_from_texts
stats, _ = build_rrvi_from_texts(titles, doc_ids, "vectors.rrvi", nlist=256, m=32)

It's "mode 2" because the same model2vec recipe can run in the browser at query time, so similarity search needs no backend at all. The query embedding must use the identical model + pooling as the corpus, or the spaces won't match.

Notes

  • Ranking is baked in. Doc IDs are assigned in descending rank, so the top-K of any query is free at read time (no query-time scoring). Pick a good rank signal (citations, holdings, popularity, …).
  • Records are opaque. record= is raw bytes; the format never dictates your schema. Decode them however you like on the client.
  • In-memory build. This builds the whole index in RAM — ideal for up to many millions of records. For corpora whose index exceeds memory, the core crate's chunked path (build::chunk) is the route; exposing it here is a follow-up.

MIT — see ../LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

roaringrange-0.25.0-cp38-abi3-win_amd64.whl (456.0 kB view details)

Uploaded CPython 3.8+Windows x86-64

roaringrange-0.25.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (563.1 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

roaringrange-0.25.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (530.6 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

roaringrange-0.25.0-cp38-abi3-macosx_11_0_arm64.whl (499.5 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

roaringrange-0.25.0-cp38-abi3-macosx_10_12_x86_64.whl (516.1 kB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file roaringrange-0.25.0-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: roaringrange-0.25.0-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 456.0 kB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for roaringrange-0.25.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 28561c73fa8954110effd9beeff49b8d7b8f137c08d08f7e9dd2089f4d0c48dd
MD5 f045fb7b33ce3c6399e05a2f03605d6b
BLAKE2b-256 3e7d3bb5bdcc5558aac16c8cf310b524038e892d3e019ca6dbff67f5ecbaccaa

See more details on using hashes here.

Provenance

The following attestation bundles were made for roaringrange-0.25.0-cp38-abi3-win_amd64.whl:

Publisher: release.yml on freeeve/roaringrange

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file roaringrange-0.25.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for roaringrange-0.25.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 89f1ae50a670b794d7c294a6e76c7135a509cdcdfe569edd0601779a6ba9f778
MD5 f70ad3e15d9024bcc66ba3c312f32cfb
BLAKE2b-256 ff3a9683b6a8560ebb1af257fd80bb59de3abfd01ca54a2326b24a4229cbb517

See more details on using hashes here.

Provenance

The following attestation bundles were made for roaringrange-0.25.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on freeeve/roaringrange

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file roaringrange-0.25.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for roaringrange-0.25.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5a0aa6d0f66183eaa7251dacb5c9e37ef6d6118d876f42247667b7f7a3dd2879
MD5 4ec5d123b2896fa8820a47b66ab1fad3
BLAKE2b-256 e9a1fe48287ac2b6041727ed43ca9175c2e8beeeda24403dc5f288091a8a7c5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for roaringrange-0.25.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on freeeve/roaringrange

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file roaringrange-0.25.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roaringrange-0.25.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f61d1006adb8b3bfccd8e2f8ae8557e95619ec585d3be4e7d9d4b94f6dc3db3b
MD5 320a33882b8ac1f894066d8ae940575c
BLAKE2b-256 f73959f098b3b373c940a701fd7131743492070e15a74900e407f319d2b4f040

See more details on using hashes here.

Provenance

The following attestation bundles were made for roaringrange-0.25.0-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on freeeve/roaringrange

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file roaringrange-0.25.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roaringrange-0.25.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0d8b0736d96bfb9648c0c9236c88ef824d1bec60c4cfa4eaa795b4548354f541
MD5 47f08d7768555951740df0c3b5b87228
BLAKE2b-256 6d4ef869971feb33341f8665b6b3ef27f94e3958c6c55c8c647bbf4d55d5b385

See more details on using hashes here.

Provenance

The following attestation bundles were made for roaringrange-0.25.0-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on freeeve/roaringrange

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page