The complete BM25 engine for Python: production-scale, Rust-native
Project description
retrievalx
The complete BM25 engine for Python — all major scoring variants, multiple retrieval strategies, Rust-native performance.
48-92x faster than rank-bm25 with equal or better retrieval quality on BEIR benchmarks.
Installation
uv add retrievalx
or with pip:
pip install retrievalx
Pre-built wheels available for:
| Platform | Architectures |
|---|---|
| Linux | x86_64, aarch64 |
| macOS | x86_64, ARM64 (Apple Silicon) |
| Windows | x86_64 |
Supports Python 3.9 through 3.15 (including 3.14 and 3.15 pre-releases). Zero dependencies.
Quickstart
from retrievalx import BM25Index
# Index documents
index = BM25Index.from_documents([
"rust and python",
"information retrieval with bm25",
"search engine internals",
])
# Search
for hit in index.search("rust retrieval", top_k=2):
print(f"{hit.doc_id}: {hit.score:.4f}")
Features
8 Scoring Variants
BM25 Okapi, Plus, L, Adpt, F (field-weighted), T (term-specific k1), Atire, and Tf-Idf.
from retrievalx import BM25Config, ScoringVariant
config = BM25Config(scoring=ScoringVariant.plus(k1=1.5, b=0.8, delta=1.0))
index = BM25Index.from_documents(docs, config=config)
5 Retrieval Strategies
Exhaustive DAAT/TAAT, WAND, Block-Max WAND, and MaxScore.
from retrievalx import RetrievalStrategy
# Exact retrieval
config = BM25Config(retrieval=RetrievalStrategy.exhaustive_taat())
# Fast approximate top-k
config = BM25Config(retrieval=RetrievalStrategy.block_max_wand())
Advanced Query Types
from retrievalx import BooleanQuery, PhraseQuery, WeightedQuery
# Boolean: must/should/must_not
index.search_boolean(BooleanQuery(must=["python"], should=["fast"], must_not=["slow"]))
# Phrase with proximity window
index.search_phrase(PhraseQuery(terms=["information", "retrieval"], window=2))
# Weighted terms
index.search_weighted(WeightedQuery(weights={"python": 2.0, "search": 1.0}))
Persistence & Crash Recovery
# Save and load
index.save("index.bin")
loaded = BM25Index.load("index.bin") # in-memory
loaded = BM25Index.load("index.bin", mode="mmap") # memory-mapped
# Write-ahead log for crash recovery
index.enable_wal("index.wal")
index.insert_batch(new_docs)
index.compact_and_flush("index.bin")
Incremental Updates
# Add documents (with optional IDs)
index.insert_batch(["new document text"])
index.insert_batch([("doc-id", "document with custom ID")])
# Delete and compact
index.delete("doc-id")
index.compact()
Score Fusion
Combine BM25 with dense retrieval or other signals:
from retrievalx import rrf, linear_combination, min_max_normalize
fused = rrf(bm25_results, dense_results, k=60)
fused = linear_combination(bm25_results, dense_results, alpha=0.7)
normalized = min_max_normalize(scores)
Evaluation Metrics
Built-in IR metrics with native acceleration:
from retrievalx import ndcg_at_k, recall_at_k, precision_at_k, mrr, average_precision_at_k
ndcg = ndcg_at_k(ranked_ids, relevant_ids, k=10)
Custom Tokenization
from retrievalx import BM25Config, TokenizerConfig, Tokenizer, Filter, Stemmer
config = BM25Config(
tokenizer=TokenizerConfig(
tokenizer=Tokenizer.UNICODE,
filters=[Filter.LOWERCASE, Filter.stopwords("en"), Filter.length(min_len=2)],
stemmer=Stemmer.snowball("en"),
)
)
Benchmarks
On BEIR SciFact (5,183 documents, 300 queries):
| Engine | QPS | Speedup | nDCG@10 | P50 (ms) |
|---|---|---|---|---|
| rank-bm25 (Okapi) | 134 | 1x | 0.5618 | 6.964 |
| retrievalx (Exhaustive DAAT) | 6,505 | 48x | 0.5723 | 0.152 |
| retrievalx (Block-Max WAND) | 4,919 | 37x | 0.5723 | 0.151 |
| retrievalx (Exhaustive TAAT) | 11,935 | 89x | 0.5723 | 0.083 |
| retrievalx (MaxScore) | 7,351 | 55x | 0.5723 | 0.099 |
All retrieval strategies produce identical quality metrics — no accuracy tradeoff.
Full 40-configuration matrix (8 scorers x 5 strategies): docs/benchmarks.md
Architecture
Rust workspace with five crates:
| Crate | Purpose |
|---|---|
retrievalx-core |
Indexing, scoring, retrieval, query execution, fusion |
retrievalx-tokenize |
Unicode tokenization, stemming, stopword filtering |
retrievalx-persist |
Binary serialization, mmap, write-ahead log |
retrievalx-eval |
IR metrics, BEIR benchmark runner |
retrievalx-py |
PyO3 bindings |
Details: docs/architecture.md | docs/algorithms.md
Examples
| Example | Description |
|---|---|
| quickstart.py | Basic indexing and search |
| it_ticket_search.py | IT ticket triage |
| legal_clause_discovery.py | Legal document search |
| ecommerce_query_tuning.py | E-commerce product search |
| security_log_hunt.py | Security log analysis |
| wal_crash_recovery.py | WAL crash recovery |
| production_hybrid_reranking.py | Hybrid BM25 + dense reranking |
| query_expansion_prf.py | Pseudo-relevance feedback |
| custom_tokenizer.py | Custom tokenizer pipeline |
| bm25f_structured_docs.py | BM25F field-weighted scoring |
| rag_hybrid_pipeline.py | RAG hybrid retrieval pipeline |
| benchmark_retrievalx_vs_rank_bm25.py | Benchmark vs rank-bm25 |
Development
# Setup with uv
uv sync --extra dev
# Or with pip
python -m venv .venv && source .venv/bin/activate
pip install -e .[dev]
# Run all checks (mirrors CI)
./scripts/check_all.sh
# Run benchmarks
uv sync --extra bench
uv run python examples/benchmark_retrievalx_vs_rank_bm25.py --dataset scifact
See CONTRIBUTING.md for guidelines.
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file retrievalx-0.1.4.tar.gz.
File metadata
- Download URL: retrievalx-0.1.4.tar.gz
- Upload date:
- Size: 62.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2050e2d805f90f56ba4efece7928f876ea73b8bbcd5220742abed98d1d2b931
|
|
| MD5 |
be4eb4d8e5aa4d0e10d8f3951e28c909
|
|
| BLAKE2b-256 |
4bf1397bf0894b5f0799b8785a1ce7ea3ad60953e4b147d4d99cceae5af8e58a
|
File details
Details for the file retrievalx-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: retrievalx-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.7 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a91f52ff19ed4fc9d9a3e4c0fceaa7f7a1acd80ed569ed264eef4ce5e3db268
|
|
| MD5 |
55ce97f15212f38afcc4fb6339bf43ed
|
|
| BLAKE2b-256 |
b6f61f7824aeb0598bdf2fae12026d122a7b56d0e0cfe81fd74db203e6ad8c9e
|
File details
Details for the file retrievalx-0.1.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: retrievalx-0.1.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.7 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe60b3775185e85a51c0187d6626cd3e63724842b4e8290cd40ec59f14efd157
|
|
| MD5 |
2ab4e17828ebda48888519e8d01d5bb9
|
|
| BLAKE2b-256 |
c06e186439dfbf1dff0ed329d1224f5862a8ad9bfe13d2c068314d5916f93619
|
File details
Details for the file retrievalx-0.1.4-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: retrievalx-0.1.4-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00f182d9c57a60e9a2b1a4862da54cf7a852c11b231340bbeb98d91bfaca6f22
|
|
| MD5 |
cbd0bf513facbc7c45afd15780db9a8f
|
|
| BLAKE2b-256 |
8af7f70769f1a2024b31c0b54c61d25b48e2eb7b407a9aaa7d9a950e8080ad2e
|
File details
Details for the file retrievalx-0.1.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: retrievalx-0.1.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ffb80be009a071e42c7bafe3f556ebca33105ad56173a2fa8c0e197cfef5088
|
|
| MD5 |
7689e48862b7363ddc237c1d84205b12
|
|
| BLAKE2b-256 |
05dca5ee7c10cabf5c93357ce4f642580216d76d9b94195830a1bcca66a20703
|
File details
Details for the file retrievalx-0.1.4-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: retrievalx-0.1.4-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8e033e15989d0f72428d6bbcf45de8c7de534e8c2093cda43f3ba406f599d60
|
|
| MD5 |
f1bc345745c3adba43677ab00ae2b019
|
|
| BLAKE2b-256 |
a935624cfe5d13f5b5b5a7b1b74c37f84b180761a4a60d6619cc9be1bdc27da7
|
File details
Details for the file retrievalx-0.1.4-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: retrievalx-0.1.4-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80b754341789513d641a8db0011f3df525e7cf554f133abb698911f574bfd189
|
|
| MD5 |
fc29995b74c058e865c8983d088def5f
|
|
| BLAKE2b-256 |
512e54324ce572418a12648591804108c47a9448b9d0170a2b75cf9e99fdf0df
|
File details
Details for the file retrievalx-0.1.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: retrievalx-0.1.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e27df756173a90fb564366c168cde1f63d0a6a4e28953eb5f5f621fe9fe92716
|
|
| MD5 |
2baf7e3b73b42ac58b3c1c79d49f5e5c
|
|
| BLAKE2b-256 |
fbbe170f086362356ac3c39d9cb258844aa59840d770b585eac6a6a190f1c9aa
|