Fast BM25 search engine for Python with RAG support
Project description
YaBM25 - Python BM25 Search Engine
Fast, scalable BM25 search engine implementation in Python with both in-memory and disk-based indexing. Perfect for RAG (Retrieval Augmented Generation), information retrieval, and search applications.
Key Features
- 🚀 High Performance: Optimized implementation with vectorized operations
- 💾 Memory Efficient: Optional disk-based indexing for large datasets
- 🔄 rank_bm25 Compatible: Drop-in replacement for rank_bm25 with extended features
- 📊 Multiple Variants: Supports BM25, BM25L, BM25Adpt
- 🛠 Production Ready: Thread-safe with proper resource management
- 📦 Easy Integration: Works with LangChain, LlamaIndex, and other RAG frameworks
Benchmarks
| Dataset Size | Memory Usage | Index Time | Query Time |
|---|---|---|---|
| x | y | z | qt |
Installation
pip install yabm25
Quick Start
Simple In-Memory Usage
from yabm25 import BM25Indexer
# Initialize with corpus
corpus = [
"Hello there good man!",
"It is quite windy in London",
"How is the weather today?"
]
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Indexer(tokenized_corpus)
# Search
query = "windy London"
doc_scores = bm25.get_scores(query.split(" "))
print(doc_scores) # array([0., 0.93729472, 0.])
# Get top document
top_docs = bm25.get_top_n(query.split(" "), corpus, n=1)
print(top_docs) # ['It is quite windy in London']
Large-Scale Usage
from yabm25 import BM25Indexer, BM25Config
# Configure disk-based index
config = BM25Config(
index_dir="my_index",
doc_chunk_size=500_000,
compression="ZSTD"
)
# Build index
indexer = BM25Indexer(config)
indexer.build_index(large_corpus)
# Search
results = indexer.query(["term1", "term2"])
Documentation
Use Cases
- 🤖 RAG Applications: Enhance LLM responses with relevant context
- 🔍 Search Systems: Build powerful document search engines
- 📚 Information Retrieval: Academic and research applications
- 📊 Text Analysis: Document similarity and ranking
Comparison with Alternatives
| Feature | YaBM25 | rank_bm25 | Elasticsearch |
|---|---|---|---|
| Memory Efficient | ✅ | ❌ | ✅ |
| Disk-based | ✅ | ❌ | ✅ |
| Easy Setup | ✅ | ✅ | ❌ |
| Python Native | ✅ | ✅ | ❌ |
| RAG Optimized | ✅ | ❌ | ❌ |
Citation
@software{yabm25,
title = {YaBM25: Yet Another BM25 Implementation},
author = {Muhammad, Ali},
year = {2025},
url = {https://github.com/alimuhammadofficial/yabm25}
}
Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines.
License
MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yabm25-0.1.1.tar.gz.
File metadata
- Download URL: yabm25-0.1.1.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04352a2fee6cf0e2ba53c6782ed9c6a63f9540c686bbf4dad510bdacab8323ed
|
|
| MD5 |
28388ef6badb15a41f11346a822f70af
|
|
| BLAKE2b-256 |
204aa76cd8da938d51231f0e48f0a52133d53631943e17c045cb9341435fa328
|
File details
Details for the file yabm25-0.1.1-py3-none-any.whl.
File metadata
- Download URL: yabm25-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd2a5be891a87896d76e1cede905b2bcab5f8db68424149b729a36fb6481440f
|
|
| MD5 |
a8b635b70df3570c5995ef1e95575fbd
|
|
| BLAKE2b-256 |
a8c86bad88886e52ab8c21e3b4a2d0dc5ccefd3f6a9c1c155b964ea7f23aca47
|