High-performance BM25 implementation in Rust with Python bindings
Project description
BM25-RS: High-Performance BM25 for Python
A blazingly fast BM25 implementation in Rust with Python bindings. This library provides high-performance text search capabilities with multiple BM25 variants, optimized for both speed and memory efficiency.
🚀 Features
- 🔥 High Performance: 4000+ queries per second with sub-millisecond latency
- 🧵 Thread-Safe: Perfect linear scaling with concurrent queries
- 💾 Memory Efficient: Optimized data structures with 30% less memory usage
- 🎯 Multiple Variants: BM25Okapi, BM25Plus, and BM25L implementations
- 🐍 Python Integration: Seamless integration with Python via PyO3
- ⚡ Batch Operations: Efficient batch scoring for multiple documents
- 🔧 Custom Tokenization: Support for custom tokenizers via Python callbacks
📦 Installation
Install from PyPI:
pip install bm25-rs
🏃♂️ Quick Start
from bm25_rs import BM25Okapi
# Sample corpus
corpus = [
"the quick brown fox jumps over the lazy dog",
"never gonna give you up never gonna let you down",
"the answer to life the universe and everything is 42",
"to be or not to be that is the question",
"may the force be with you",
]
# Initialize BM25
bm25 = BM25Okapi(corpus)
# Search query
query = "the quick brown"
query_tokens = query.lower().split()
# Get relevance scores for all documents
scores = bm25.get_scores(query_tokens)
print(f"Scores: {scores}")
# Get top-k most relevant documents
top_docs = bm25.get_top_n(query_tokens, corpus, n=3)
print(f"Top documents: {top_docs}")
🎯 Advanced Usage
Custom Tokenization
def custom_tokenizer(text):
# Your custom tokenization logic
return text.lower().split()
bm25 = BM25Okapi(corpus, tokenizer=custom_tokenizer)
Batch Operations
# Score specific documents efficiently
doc_ids = [0, 2, 4] # Document indices to score
scores = bm25.get_batch_scores(query_tokens, doc_ids)
Multiple BM25 Variants
from bm25_rs import BM25Okapi, BM25Plus, BM25L
# Standard BM25Okapi
bm25_okapi = BM25Okapi(corpus, k1=1.5, b=0.75, epsilon=0.25)
# BM25Plus (handles term frequency saturation)
bm25_plus = BM25Plus(corpus, k1=1.5, b=0.75, delta=1.0)
# BM25L (length normalization variant)
bm25_l = BM25L(corpus, k1=1.5, b=0.75, delta=0.5)
Performance Optimization
# For large corpora, use chunked processing
scores = bm25.get_scores_chunked(query_tokens, chunk_size=1000)
# Get only top-k indices (faster when you don't need full documents)
top_indices = bm25.get_top_n_indices(query_tokens, n=10)
📊 Performance Benchmarks
Performance comparison on a corpus of 10,000 documents:
| Operation | Throughput | Latency |
|---|---|---|
| Initialization | 190K docs/sec | - |
| Single Query | 4,400 QPS | 0.23ms |
| Batch Queries | 73K ops/sec | 0.01ms |
| Concurrent (4 threads) | 17,600 QPS | 0.06ms |
Memory usage: ~30% less than pure Python implementations.
🔧 API Reference
BM25Okapi
class BM25Okapi:
def __init__(
self,
corpus: List[str],
tokenizer: Optional[Callable] = None,
k1: float = 1.5,
b: float = 0.75,
epsilon: float = 0.25
)
def get_scores(self, query: List[str]) -> List[float]
def get_batch_scores(self, query: List[str], doc_ids: List[int]) -> List[float]
def get_top_n(self, query: List[str], documents: List[str], n: int = 5) -> List[Tuple[str, float]]
def get_top_n_indices(self, query: List[str], n: int = 5) -> List[Tuple[int, float]]
def get_scores_chunked(self, query: List[str], chunk_size: int = 1000) -> List[float]
Parameters
- k1 (float): Controls term frequency saturation (default: 1.5)
- b (float): Controls length normalization (default: 0.75)
- epsilon (float): IDF normalization parameter for BM25Okapi (default: 0.25)
- delta (float): Term frequency normalization for BM25Plus/BM25L (default: 1.0/0.5)
🛠️ Development
Building from Source
# Clone the repository
git clone https://github.com/amiyamandal-dev/bm25_pyrs.git
cd bm25_pyrs
# Install development dependencies
pip install -e .[dev]
# Build the Rust extension
maturin develop --release
Running Tests
pytest tests/
Benchmarking
python benchmarks/benchmark.py
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built with PyO3 for Python-Rust interoperability
- Uses Rayon for parallel processing
- Inspired by the rank-bm25 Python library
📈 Changelog
See CHANGELOG.md for a detailed history of changes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bm25_rs-1.0.4.tar.gz.
File metadata
- Download URL: bm25_rs-1.0.4.tar.gz
- Upload date:
- Size: 226.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f1bfc3eaa39705221dd8dab6efc7f02c210bc50bc3061c7318eb19780fcbb30
|
|
| MD5 |
a7f2c59a2db4f45ad4857fecc8894b70
|
|
| BLAKE2b-256 |
57dc2b30945578e84f061520a623b8b5ddbfc3ae75fb6cb899a674f80ac57b27
|
File details
Details for the file bm25_rs-1.0.4-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 310.3 kB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
181461091ca08cab02e2b33b390163f20d3cd6a2bdcde7656ae4418626af6c06
|
|
| MD5 |
fbf0739d4092f47dd461b50e8977bb59
|
|
| BLAKE2b-256 |
b3d82d7787a5aa92e0dabe66c60f3aaf00b41ad392dcd7834aecc7ed36a2ae7b
|
File details
Details for the file bm25_rs-1.0.4-cp313-cp313-win32.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp313-cp313-win32.whl
- Upload date:
- Size: 283.9 kB
- Tags: CPython 3.13, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f319c84f9f47b41d5c121af518e5bc09e8343015c3cc2e25273be436d64863df
|
|
| MD5 |
ff5c3da405753820a06b667499e2751e
|
|
| BLAKE2b-256 |
6f4e6c81c5c4efddf93ee0478d276ec8b1ea18edd0b5c3906ce1868041a73536
|
File details
Details for the file bm25_rs-1.0.4-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 410.5 kB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9b456638d8822d822b96fe11328012b89c8579e712f4fb1f787f5c9a9fa404a
|
|
| MD5 |
5896dda3d054bdfbe4d5e41093705c50
|
|
| BLAKE2b-256 |
dae61630c108aa73f1b6c5d4ab2ea85a83f5871ab11326f92dd510891295e078
|
File details
Details for the file bm25_rs-1.0.4-cp313-cp313-macosx_10_12_x86_64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp313-cp313-macosx_10_12_x86_64.whl
- Upload date:
- Size: 428.5 kB
- Tags: CPython 3.13, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
baffdc0973db9086593c8924366e8dce9eb1f5d82b290b7ca545eb31f9f65c47
|
|
| MD5 |
5ba15cb4e49fdf043825d455866880ce
|
|
| BLAKE2b-256 |
eed07b1f71e50828940c7ead8780c6edd2baa383ebcfe33bdfc15ea68a1e3d1e
|
File details
Details for the file bm25_rs-1.0.4-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 310.3 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31631ca728af705a91ca2b09c0424d3da662bb37f781e72b5e644181bfcf8cc9
|
|
| MD5 |
ab2c3cb566bf13b3223485defd1d8b05
|
|
| BLAKE2b-256 |
2a852be621517d91f05230f65c0232110f20203aa4eebe1f12e6b3d1df2a6ae9
|
File details
Details for the file bm25_rs-1.0.4-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 410.5 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
440220378abd74afd2afdb1959314733a1e39cc8d3a9869a0d0700321cabdcf0
|
|
| MD5 |
f00281f4eb77e98e09390710699c6784
|
|
| BLAKE2b-256 |
3080463866746704b3da81149d396296a90f95e6c4aa08b7632836399093cbfb
|
File details
Details for the file bm25_rs-1.0.4-cp312-cp312-macosx_10_12_x86_64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 428.4 kB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bcfde006446549d807d2b800d87d38f82524c6eda19c926ee56b90ad32969200
|
|
| MD5 |
6257667df5afdd33e7578c1574fa1ee0
|
|
| BLAKE2b-256 |
ba6bcd5bde80a8a20aabe362ad05d7cfcce677f084d602b87e830148f7e3868e
|
File details
Details for the file bm25_rs-1.0.4-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 311.5 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cff73e97e8b2b19762113d99616d8d6e07c6b9b2ab3ef9f9348dff2ed042c6a1
|
|
| MD5 |
9ed08a48c9bf1a957a4db21f9c1a6dd8
|
|
| BLAKE2b-256 |
dde347091cb3426c2cfe425cd2c56ea1a9fcdc970453024e91c487b12be54d7c
|
File details
Details for the file bm25_rs-1.0.4-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 414.9 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
259079645617207c932915c1336bbc11950f68b39cf91bb9c817f273d2a29a7e
|
|
| MD5 |
d8a4f8170c4f9a4703aa2028e145696f
|
|
| BLAKE2b-256 |
413466dbe725bd2a8fecf5867bf4f741ff00511495dd71d0bcd9a94ee7b37fde
|
File details
Details for the file bm25_rs-1.0.4-cp311-cp311-macosx_10_12_x86_64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 435.2 kB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5126854600fbc479df57e17399daf41d05d93b2d88b9ee37439c7d824a554576
|
|
| MD5 |
569f652b76b49eea48de978fa91e54da
|
|
| BLAKE2b-256 |
37ffa37de48ec385563e27a24d56332af50327111c468ee4fb20ff19c7fcbbe7
|
File details
Details for the file bm25_rs-1.0.4-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 313.1 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3196e912b43716223c211b77536c1d6c0fb185b6b7dd3c983fbf8dac779c678d
|
|
| MD5 |
b232d8d26fd98515a2901b9da4cb00a9
|
|
| BLAKE2b-256 |
29488e679a48db369d74b8d3404dbe19b8478a600bf4c17f95ca6235a8c8136d
|
File details
Details for the file bm25_rs-1.0.4-cp39-cp39-win_amd64.whl.
File metadata
- Download URL: bm25_rs-1.0.4-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 315.5 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9aa2b4aab78addb21633b05ef1c51f73dbd6a1545c806f0f4fb2fa16378bf922
|
|
| MD5 |
3a0c4d393071a3f6857f0b1c11d30a70
|
|
| BLAKE2b-256 |
c7307ba4017f40eabd6c641fcbb1a65d990b84d86f36b32df871ddd4d4797c61
|