Skip to main content

Fast BM25 search engine for Python with RAG support

Project description

YaBM25 - Python BM25 Search Engine

PyPI version Python Versions Downloads License

Fast, scalable BM25 search engine implementation in Python with both in-memory and disk-based indexing. Perfect for RAG (Retrieval Augmented Generation), information retrieval, and search applications.

Key Features

  • 🚀 High Performance: Optimized implementation with vectorized operations
  • 💾 Memory Efficient: Optional disk-based indexing for large datasets
  • 🔄 rank_bm25 Compatible: Drop-in replacement for rank_bm25 with extended features
  • 📊 Multiple Variants: Supports BM25, BM25L, BM25Adpt
  • 🛠 Production Ready: Thread-safe with proper resource management
  • 📦 Easy Integration: Works with LangChain, LlamaIndex, and other RAG frameworks

Benchmarks

Dataset Size Memory Usage Index Time Query Time
x y z qt

Installation

pip install yabm25

Quick Start

Simple In-Memory Usage

from yabm25 import BM25Indexer

# Initialize with corpus
corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Indexer(tokenized_corpus)

# Search
query = "windy London"
doc_scores = bm25.get_scores(query.split(" "))
print(doc_scores)  # array([0., 0.93729472, 0.])

# Get top document
top_docs = bm25.get_top_n(query.split(" "), corpus, n=1)
print(top_docs)  # ['It is quite windy in London']

Large-Scale Usage

from yabm25 import BM25Indexer, BM25Config

# Configure disk-based index
config = BM25Config(
    index_dir="my_index",
    doc_chunk_size=500_000,
    compression="ZSTD"
)

# Build index
indexer = BM25Indexer(config)
indexer.build_index(large_corpus)

# Search
results = indexer.query(["term1", "term2"])

Documentation

Use Cases

  • 🤖 RAG Applications: Enhance LLM responses with relevant context
  • 🔍 Search Systems: Build powerful document search engines
  • 📚 Information Retrieval: Academic and research applications
  • 📊 Text Analysis: Document similarity and ranking

Comparison with Alternatives

Feature YaBM25 rank_bm25 Elasticsearch
Memory Efficient
Disk-based
Easy Setup
Python Native
RAG Optimized

Citation

@software{yabm25,
  title = {YaBM25: Yet Another BM25 Implementation},
  author = {Muhammad, Ali},
  year = {2025},
  url = {https://github.com/alimuhammadofficial/yabm25}
}

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yabm25-0.1.1.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yabm25-0.1.1-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file yabm25-0.1.1.tar.gz.

File metadata

  • Download URL: yabm25-0.1.1.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for yabm25-0.1.1.tar.gz
Algorithm Hash digest
SHA256 04352a2fee6cf0e2ba53c6782ed9c6a63f9540c686bbf4dad510bdacab8323ed
MD5 28388ef6badb15a41f11346a822f70af
BLAKE2b-256 204aa76cd8da938d51231f0e48f0a52133d53631943e17c045cb9341435fa328

See more details on using hashes here.

File details

Details for the file yabm25-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: yabm25-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for yabm25-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fd2a5be891a87896d76e1cede905b2bcab5f8db68424149b729a36fb6481440f
MD5 a8b635b70df3570c5995ef1e95575fbd
BLAKE2b-256 a8c86bad88886e52ab8c21e3b4a2d0dc5ccefd3f6a9c1c155b964ea7f23aca47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page