Skip to main content

The complete BM25 engine for Python: production-scale, Rust-native

Project description

retrievalx

PyPI Downloads Python License CI

The complete BM25 engine for Python — all major scoring variants, multiple retrieval strategies, Rust-native performance.

48-92x faster than rank-bm25 with equal or better retrieval quality on BEIR benchmarks.

Installation

uv add retrievalx

or with pip:

pip install retrievalx

Pre-built wheels available for:

Platform Architectures
Linux x86_64, aarch64
macOS x86_64, ARM64 (Apple Silicon)
Windows x86_64

Supports Python 3.9 through 3.15 (including 3.14 and 3.15 pre-releases). Zero dependencies.

Quickstart

from retrievalx import BM25Index

# Index documents
index = BM25Index.from_documents([
    "rust and python",
    "information retrieval with bm25",
    "search engine internals",
])

# Search
for hit in index.search("rust retrieval", top_k=2):
    print(f"{hit.doc_id}: {hit.score:.4f}")

Features

8 Scoring Variants

BM25 Okapi, Plus, L, Adpt, F (field-weighted), T (term-specific k1), Atire, and Tf-Idf.

from retrievalx import BM25Config, ScoringVariant

config = BM25Config(scoring=ScoringVariant.plus(k1=1.5, b=0.8, delta=1.0))
index = BM25Index.from_documents(docs, config=config)

5 Retrieval Strategies

Exhaustive DAAT/TAAT, WAND, Block-Max WAND, and MaxScore.

from retrievalx import RetrievalStrategy

# Exact retrieval
config = BM25Config(retrieval=RetrievalStrategy.exhaustive_taat())

# Fast approximate top-k
config = BM25Config(retrieval=RetrievalStrategy.block_max_wand())

Advanced Query Types

from retrievalx import BooleanQuery, PhraseQuery, WeightedQuery

# Boolean: must/should/must_not
index.search_boolean(BooleanQuery(must=["python"], should=["fast"], must_not=["slow"]))

# Phrase with proximity window
index.search_phrase(PhraseQuery(terms=["information", "retrieval"], window=2))

# Weighted terms
index.search_weighted(WeightedQuery(weights={"python": 2.0, "search": 1.0}))

Persistence & Crash Recovery

# Save and load
index.save("index.bin")
loaded = BM25Index.load("index.bin")              # in-memory
loaded = BM25Index.load("index.bin", mode="mmap")  # memory-mapped

# Write-ahead log for crash recovery
index.enable_wal("index.wal")
index.insert_batch(new_docs)
index.compact_and_flush("index.bin")

Incremental Updates

# Add documents (with optional IDs)
index.insert_batch(["new document text"])
index.insert_batch([("doc-id", "document with custom ID")])

# Delete and compact
index.delete("doc-id")
index.compact()

Score Fusion

Combine BM25 with dense retrieval or other signals:

from retrievalx import rrf, linear_combination, min_max_normalize

fused = rrf(bm25_results, dense_results, k=60)
fused = linear_combination(bm25_results, dense_results, alpha=0.7)
normalized = min_max_normalize(scores)

Evaluation Metrics

Built-in IR metrics with native acceleration:

from retrievalx import ndcg_at_k, recall_at_k, precision_at_k, mrr, average_precision_at_k

ndcg = ndcg_at_k(ranked_ids, relevant_ids, k=10)

Custom Tokenization

from retrievalx import BM25Config, TokenizerConfig, Tokenizer, Filter, Stemmer

config = BM25Config(
    tokenizer=TokenizerConfig(
        tokenizer=Tokenizer.UNICODE,
        filters=[Filter.LOWERCASE, Filter.stopwords("en"), Filter.length(min_len=2)],
        stemmer=Stemmer.snowball("en"),
    )
)

Benchmarks

On BEIR SciFact (5,183 documents, 300 queries):

Engine QPS Speedup nDCG@10 P50 (ms)
rank-bm25 (Okapi) 134 1x 0.5618 6.964
retrievalx (Exhaustive DAAT) 6,505 48x 0.5723 0.152
retrievalx (Block-Max WAND) 4,919 37x 0.5723 0.151
retrievalx (Exhaustive TAAT) 11,935 89x 0.5723 0.083
retrievalx (MaxScore) 7,351 55x 0.5723 0.099

All retrieval strategies produce identical quality metrics — no accuracy tradeoff.

Full 40-configuration matrix (8 scorers x 5 strategies): docs/benchmarks.md

Architecture

Rust workspace with five crates:

Crate Purpose
retrievalx-core Indexing, scoring, retrieval, query execution, fusion
retrievalx-tokenize Unicode tokenization, stemming, stopword filtering
retrievalx-persist Binary serialization, mmap, write-ahead log
retrievalx-eval IR metrics, BEIR benchmark runner
retrievalx-py PyO3 bindings

Details: docs/architecture.md | docs/algorithms.md

Examples

Example Description
quickstart.py Basic indexing and search
it_ticket_search.py IT ticket triage
legal_clause_discovery.py Legal document search
ecommerce_query_tuning.py E-commerce product search
security_log_hunt.py Security log analysis
wal_crash_recovery.py WAL crash recovery
production_hybrid_reranking.py Hybrid BM25 + dense reranking
query_expansion_prf.py Pseudo-relevance feedback
custom_tokenizer.py Custom tokenizer pipeline
bm25f_structured_docs.py BM25F field-weighted scoring
rag_hybrid_pipeline.py RAG hybrid retrieval pipeline
benchmark_retrievalx_vs_rank_bm25.py Benchmark vs rank-bm25

Development

# Setup with uv
uv sync --extra dev

# Or with pip
python -m venv .venv && source .venv/bin/activate
pip install -e .[dev]

# Run all checks (mirrors CI)
./scripts/check_all.sh

# Run benchmarks
uv sync --extra bench
uv run python examples/benchmark_retrievalx_vs_rank_bm25.py --dataset scifact

See CONTRIBUTING.md for guidelines.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retrievalx-0.1.4.tar.gz (62.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

retrievalx-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.7 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

retrievalx-0.1.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.7 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

retrievalx-0.1.4-cp39-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

retrievalx-0.1.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

retrievalx-0.1.4-cp39-abi3-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

retrievalx-0.1.4-cp39-abi3-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

retrievalx-0.1.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file retrievalx-0.1.4.tar.gz.

File metadata

  • Download URL: retrievalx-0.1.4.tar.gz
  • Upload date:
  • Size: 62.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for retrievalx-0.1.4.tar.gz
Algorithm Hash digest
SHA256 c2050e2d805f90f56ba4efece7928f876ea73b8bbcd5220742abed98d1d2b931
MD5 be4eb4d8e5aa4d0e10d8f3951e28c909
BLAKE2b-256 4bf1397bf0894b5f0799b8785a1ce7ea3ad60953e4b147d4d99cceae5af8e58a

See more details on using hashes here.

File details

Details for the file retrievalx-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for retrievalx-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1a91f52ff19ed4fc9d9a3e4c0fceaa7f7a1acd80ed569ed264eef4ce5e3db268
MD5 55ce97f15212f38afcc4fb6339bf43ed
BLAKE2b-256 b6f61f7824aeb0598bdf2fae12026d122a7b56d0e0cfe81fd74db203e6ad8c9e

See more details on using hashes here.

File details

Details for the file retrievalx-0.1.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for retrievalx-0.1.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fe60b3775185e85a51c0187d6626cd3e63724842b4e8290cd40ec59f14efd157
MD5 2ab4e17828ebda48888519e8d01d5bb9
BLAKE2b-256 c06e186439dfbf1dff0ed329d1224f5862a8ad9bfe13d2c068314d5916f93619

See more details on using hashes here.

File details

Details for the file retrievalx-0.1.4-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: retrievalx-0.1.4-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for retrievalx-0.1.4-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 00f182d9c57a60e9a2b1a4862da54cf7a852c11b231340bbeb98d91bfaca6f22
MD5 cbd0bf513facbc7c45afd15780db9a8f
BLAKE2b-256 8af7f70769f1a2024b31c0b54c61d25b48e2eb7b407a9aaa7d9a950e8080ad2e

See more details on using hashes here.

File details

Details for the file retrievalx-0.1.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for retrievalx-0.1.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6ffb80be009a071e42c7bafe3f556ebca33105ad56173a2fa8c0e197cfef5088
MD5 7689e48862b7363ddc237c1d84205b12
BLAKE2b-256 05dca5ee7c10cabf5c93357ce4f642580216d76d9b94195830a1bcca66a20703

See more details on using hashes here.

File details

Details for the file retrievalx-0.1.4-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for retrievalx-0.1.4-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b8e033e15989d0f72428d6bbcf45de8c7de534e8c2093cda43f3ba406f599d60
MD5 f1bc345745c3adba43677ab00ae2b019
BLAKE2b-256 a935624cfe5d13f5b5b5a7b1b74c37f84b180761a4a60d6619cc9be1bdc27da7

See more details on using hashes here.

File details

Details for the file retrievalx-0.1.4-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for retrievalx-0.1.4-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 80b754341789513d641a8db0011f3df525e7cf554f133abb698911f574bfd189
MD5 fc29995b74c058e865c8983d088def5f
BLAKE2b-256 512e54324ce572418a12648591804108c47a9448b9d0170a2b75cf9e99fdf0df

See more details on using hashes here.

File details

Details for the file retrievalx-0.1.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for retrievalx-0.1.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e27df756173a90fb564366c168cde1f63d0a6a4e28953eb5f5f621fe9fe92716
MD5 2baf7e3b73b42ac58b3c1c79d49f5e5c
BLAKE2b-256 fbbe170f086362356ac3c39d9cb258844aa59840d770b585eac6a6a190f1c9aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page