Skip to main content

High-performance BM25 implementation in Rust with Python bindings

Project description

BM25-RS: High-Performance BM25 for Python

PyPI version Python versions License: MIT

A blazingly fast BM25 implementation in Rust with Python bindings. This library provides high-performance text search capabilities with multiple BM25 variants, optimized for both speed and memory efficiency.

🚀 Features

  • 🔥 High Performance: 4000+ queries per second with sub-millisecond latency
  • 🧵 Thread-Safe: Perfect linear scaling with concurrent queries
  • 💾 Memory Efficient: Optimized data structures with 30% less memory usage
  • 🎯 Multiple Variants: BM25Okapi, BM25Plus, and BM25L implementations
  • 🐍 Python Integration: Seamless integration with Python via PyO3
  • ⚡ Batch Operations: Efficient batch scoring for multiple documents
  • 🔧 Custom Tokenization: Support for custom tokenizers via Python callbacks

📦 Installation

Install from PyPI:

pip install bm25-rs

🏃‍♂️ Quick Start

from bm25_rs import BM25Okapi

# Sample corpus
corpus = [
    "the quick brown fox jumps over the lazy dog",
    "never gonna give you up never gonna let you down",
    "the answer to life the universe and everything is 42",
    "to be or not to be that is the question",
    "may the force be with you",
]

# Initialize BM25
bm25 = BM25Okapi(corpus)

# Search query
query = "the quick brown"
query_tokens = query.lower().split()

# Get relevance scores for all documents
scores = bm25.get_scores(query_tokens)
print(f"Scores: {scores}")

# Get top-k most relevant documents
top_docs = bm25.get_top_n(query_tokens, corpus, n=3)
print(f"Top documents: {top_docs}")

🎯 Advanced Usage

Custom Tokenization

def custom_tokenizer(text):
    # Your custom tokenization logic
    return text.lower().split()

bm25 = BM25Okapi(corpus, tokenizer=custom_tokenizer)

Batch Operations

# Score specific documents efficiently
doc_ids = [0, 2, 4]  # Document indices to score
scores = bm25.get_batch_scores(query_tokens, doc_ids)

Multiple BM25 Variants

from bm25_rs import BM25Okapi, BM25Plus, BM25L

# Standard BM25Okapi
bm25_okapi = BM25Okapi(corpus, k1=1.5, b=0.75, epsilon=0.25)

# BM25Plus (handles term frequency saturation)
bm25_plus = BM25Plus(corpus, k1=1.5, b=0.75, delta=1.0)

# BM25L (length normalization variant)
bm25_l = BM25L(corpus, k1=1.5, b=0.75, delta=0.5)

Performance Optimization

# For large corpora, use chunked processing
scores = bm25.get_scores_chunked(query_tokens, chunk_size=1000)

# Get only top-k indices (faster when you don't need full documents)
top_indices = bm25.get_top_n_indices(query_tokens, n=10)

📊 Performance Benchmarks

Performance comparison on a corpus of 10,000 documents:

Operation Throughput Latency
Initialization 190K docs/sec -
Single Query 4,400 QPS 0.23ms
Batch Queries 73K ops/sec 0.01ms
Concurrent (4 threads) 17,600 QPS 0.06ms

Memory usage: ~30% less than pure Python implementations.

🔧 API Reference

BM25Okapi

class BM25Okapi:
    def __init__(
        self,
        corpus: List[str],
        tokenizer: Optional[Callable] = None,
        k1: float = 1.5,
        b: float = 0.75,
        epsilon: float = 0.25
    )

    def get_scores(self, query: List[str]) -> List[float]
    def get_batch_scores(self, query: List[str], doc_ids: List[int]) -> List[float]
    def get_top_n(self, query: List[str], documents: List[str], n: int = 5) -> List[Tuple[str, float]]
    def get_top_n_indices(self, query: List[str], n: int = 5) -> List[Tuple[int, float]]
    def get_scores_chunked(self, query: List[str], chunk_size: int = 1000) -> List[float]

Parameters

  • k1 (float): Controls term frequency saturation (default: 1.5)
  • b (float): Controls length normalization (default: 0.75)
  • epsilon (float): IDF normalization parameter for BM25Okapi (default: 0.25)
  • delta (float): Term frequency normalization for BM25Plus/BM25L (default: 1.0/0.5)

🛠️ Development

Building from Source

# Clone the repository
git clone https://github.com/amiyamandal-dev/bm25_pyrs.git
cd bm25_pyrs

# Install development dependencies
pip install -e .[dev]

# Build the Rust extension
maturin develop --release

Running Tests

pytest tests/

Benchmarking

python benchmarks/benchmark.py

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with PyO3 for Python-Rust interoperability
  • Uses Rayon for parallel processing
  • Inspired by the rank-bm25 Python library

📈 Changelog

See CHANGELOG.md for a detailed history of changes.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bm25_rs-1.0.4.tar.gz (226.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bm25_rs-1.0.4-cp313-cp313-win_amd64.whl (310.3 kB view details)

Uploaded CPython 3.13Windows x86-64

bm25_rs-1.0.4-cp313-cp313-win32.whl (283.9 kB view details)

Uploaded CPython 3.13Windows x86

bm25_rs-1.0.4-cp313-cp313-macosx_11_0_arm64.whl (410.5 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

bm25_rs-1.0.4-cp313-cp313-macosx_10_12_x86_64.whl (428.5 kB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

bm25_rs-1.0.4-cp312-cp312-win_amd64.whl (310.3 kB view details)

Uploaded CPython 3.12Windows x86-64

bm25_rs-1.0.4-cp312-cp312-macosx_11_0_arm64.whl (410.5 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

bm25_rs-1.0.4-cp312-cp312-macosx_10_12_x86_64.whl (428.4 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

bm25_rs-1.0.4-cp311-cp311-win_amd64.whl (311.5 kB view details)

Uploaded CPython 3.11Windows x86-64

bm25_rs-1.0.4-cp311-cp311-macosx_11_0_arm64.whl (414.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

bm25_rs-1.0.4-cp311-cp311-macosx_10_12_x86_64.whl (435.2 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

bm25_rs-1.0.4-cp310-cp310-win_amd64.whl (313.1 kB view details)

Uploaded CPython 3.10Windows x86-64

bm25_rs-1.0.4-cp39-cp39-win_amd64.whl (315.5 kB view details)

Uploaded CPython 3.9Windows x86-64

File details

Details for the file bm25_rs-1.0.4.tar.gz.

File metadata

  • Download URL: bm25_rs-1.0.4.tar.gz
  • Upload date:
  • Size: 226.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_rs-1.0.4.tar.gz
Algorithm Hash digest
SHA256 0f1bfc3eaa39705221dd8dab6efc7f02c210bc50bc3061c7318eb19780fcbb30
MD5 a7f2c59a2db4f45ad4857fecc8894b70
BLAKE2b-256 57dc2b30945578e84f061520a623b8b5ddbfc3ae75fb6cb899a674f80ac57b27

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bm25_rs-1.0.4-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 310.3 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_rs-1.0.4-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 181461091ca08cab02e2b33b390163f20d3cd6a2bdcde7656ae4418626af6c06
MD5 fbf0739d4092f47dd461b50e8977bb59
BLAKE2b-256 b3d82d7787a5aa92e0dabe66c60f3aaf00b41ad392dcd7834aecc7ed36a2ae7b

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp313-cp313-win32.whl.

File metadata

  • Download URL: bm25_rs-1.0.4-cp313-cp313-win32.whl
  • Upload date:
  • Size: 283.9 kB
  • Tags: CPython 3.13, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_rs-1.0.4-cp313-cp313-win32.whl
Algorithm Hash digest
SHA256 f319c84f9f47b41d5c121af518e5bc09e8343015c3cc2e25273be436d64863df
MD5 ff5c3da405753820a06b667499e2751e
BLAKE2b-256 6f4e6c81c5c4efddf93ee0478d276ec8b1ea18edd0b5c3906ce1868041a73536

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_rs-1.0.4-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d9b456638d8822d822b96fe11328012b89c8579e712f4fb1f787f5c9a9fa404a
MD5 5896dda3d054bdfbe4d5e41093705c50
BLAKE2b-256 dae61630c108aa73f1b6c5d4ab2ea85a83f5871ab11326f92dd510891295e078

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for bm25_rs-1.0.4-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 baffdc0973db9086593c8924366e8dce9eb1f5d82b290b7ca545eb31f9f65c47
MD5 5ba15cb4e49fdf043825d455866880ce
BLAKE2b-256 eed07b1f71e50828940c7ead8780c6edd2baa383ebcfe33bdfc15ea68a1e3d1e

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bm25_rs-1.0.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 310.3 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_rs-1.0.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 31631ca728af705a91ca2b09c0424d3da662bb37f781e72b5e644181bfcf8cc9
MD5 ab2c3cb566bf13b3223485defd1d8b05
BLAKE2b-256 2a852be621517d91f05230f65c0232110f20203aa4eebe1f12e6b3d1df2a6ae9

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_rs-1.0.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 440220378abd74afd2afdb1959314733a1e39cc8d3a9869a0d0700321cabdcf0
MD5 f00281f4eb77e98e09390710699c6784
BLAKE2b-256 3080463866746704b3da81149d396296a90f95e6c4aa08b7632836399093cbfb

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for bm25_rs-1.0.4-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 bcfde006446549d807d2b800d87d38f82524c6eda19c926ee56b90ad32969200
MD5 6257667df5afdd33e7578c1574fa1ee0
BLAKE2b-256 ba6bcd5bde80a8a20aabe362ad05d7cfcce677f084d602b87e830148f7e3868e

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bm25_rs-1.0.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 311.5 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_rs-1.0.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 cff73e97e8b2b19762113d99616d8d6e07c6b9b2ab3ef9f9348dff2ed042c6a1
MD5 9ed08a48c9bf1a957a4db21f9c1a6dd8
BLAKE2b-256 dde347091cb3426c2cfe425cd2c56ea1a9fcdc970453024e91c487b12be54d7c

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_rs-1.0.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 259079645617207c932915c1336bbc11950f68b39cf91bb9c817f273d2a29a7e
MD5 d8a4f8170c4f9a4703aa2028e145696f
BLAKE2b-256 413466dbe725bd2a8fecf5867bf4f741ff00511495dd71d0bcd9a94ee7b37fde

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for bm25_rs-1.0.4-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5126854600fbc479df57e17399daf41d05d93b2d88b9ee37439c7d824a554576
MD5 569f652b76b49eea48de978fa91e54da
BLAKE2b-256 37ffa37de48ec385563e27a24d56332af50327111c468ee4fb20ff19c7fcbbe7

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: bm25_rs-1.0.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 313.1 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_rs-1.0.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 3196e912b43716223c211b77536c1d6c0fb185b6b7dd3c983fbf8dac779c678d
MD5 b232d8d26fd98515a2901b9da4cb00a9
BLAKE2b-256 29488e679a48db369d74b8d3404dbe19b8478a600bf4c17f95ca6235a8c8136d

See more details on using hashes here.

File details

Details for the file bm25_rs-1.0.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: bm25_rs-1.0.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 315.5 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_rs-1.0.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 9aa2b4aab78addb21633b05ef1c51f73dbd6a1545c806f0f4fb2fa16378bf922
MD5 3a0c4d393071a3f6857f0b1c11d30a70
BLAKE2b-256 c7307ba4017f40eabd6c641fcbb1a65d990b84d86f36b32df871ddd4d4797c61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page