A fast, memory-efficient hybrid search system combining BM25 and vector search
Project description
BM-25 Chroma Hybrid Retriever
A fast, memory-efficient hybrid search system combining BM25 and vector search with Reciprocal Rank Fusion (RRF).
Features
- BM25: Memory-efficient with integer indices and pre-sorted postings
- lemmatizes, lowercase, no punctuation (replaced with space), len norm
- Vector Search: Semantic similarity using ChromaDB and sentence transformers
- Hybrid Fusion: Industry-standard Reciprocal Rank Fusion (RRF)
- ChromaDB Drop-in Replacement: Compatible interface with hybrid search capabilities
- Dual Processing Modes: Sequential or unified batch processing
- State Persistence: Automatic save/load of BM25 index
- Document Management: Add, remove, and update documents (chunks) with inverted index consistency
Data Structure Notes
- Storage: Inverted index format
word -> [(frequency, doc_id), ...] - BM25 Scoring: Accesses document term frequencies by inverting the lookup (query term → posting list → document frequencies)
- Avoids: Storing redundant
document -> [(freq, word), ...]mappings
Quickstart
from bm25_chroma import HybridRetriever
import hashlib
# Initialize
retriever = HybridRetriever(
chroma_path="./my_db",
collection_name="my_docs"
)
# Add documents with deterministic, unique IDs
documents = [
"Machine learning helps analyze data patterns.",
"Natural language processing understands human text.",
"Deep learning uses neural networks for complex tasks."
]
# Content-based surrogate keys via hashlib - avoids order dependency
# Alternatively use natural keys when available
doc_ids = [hashlib.sha256(doc.encode()).hexdigest() for doc in documents]
# Add documents
retriever.add_documents_batch(
documents,
doc_ids=doc_ids, # Optional: auto-generated--using chroma's UUID--if not provided
mode="unified", # or "sequential"
show_progress=True
)
# Search with ChromaDB-compatible interface
results = retriever.query(
query_texts=["machine learning"],
n_results=5,
bm25_ratio=0.5, # 0.0 = vector only, 1.0 = BM25 only, 0.5 = balanced
include=['documents', 'metadatas', 'distances']
)
# Process results
for doc, meta, dist in zip(results['documents'][0], results['metadatas'][0], results['distances'][0]):
print(f"Score: {1-dist:.3f} - {doc[:100]}...")
# Legacy interface also available
legacy_results = retriever.hybrid_search("machine learning", top_k=5, bm25_ratio=0.5)
for doc_id, score, metadata in legacy_results:
print(f"{doc_id[:16]}...: {score:.3f} - {metadata['text'][:100]}...")
Why use hashlib for document IDs?
- Deterministic: Same content always produces the same ID
- Unique: SHA256 hash collisions are extremely rare
- Content-based: ID reflects the actual document content
- Database-safe: Perfect for ensuring uniqueness across systems
ChromaDB Interface Compatibility
HybridRetriever provides a drop-in replacement for ChromaDB collections with hybrid BM25+vector search:
# ChromaDB-compatible interface
results = retriever.query(
query_texts=["machine learning algorithms"],
n_results=5,
bm25_ratio=0.5, # 0.0 = vector only, 1.0 = BM25 only, 0.5 = balanced
include=['documents', 'metadatas', 'distances']
)
# Returns ChromaDB format
{
'documents': [['Machine learning helps analyze...', '...']],
'metadatas': [[{'document_id': 'abc123...'}, {...}]],
'distances': [[0.234, 0.456, ...]],
'embeddings': [[...], [...]] # if requested in include
}
# Single query string (automatically converted to list)
results = retriever.query("deep learning", n_results=3)
# Use as drop-in ChromaDB replacement in existing code
# Just replace: collection.query() with: retriever.query()
Installation
pip install bm25-chroma
Core Architecture
The BM25 component maintains an inverted index with the following structure:
Data Structure:
- Vocabulary Set:
set(words)containing all unique terms - Inverted Index:
dict[word] = [(frequency, document_id), ...] - Posting Lists: Tuples ordered by frequency in descending order
Inverted Index Consistency:
# Example inverted index structure
{
"machine": [(3, doc_1), (2, doc_5), (1, doc_3)], # frequency descending
"learning": [(2, doc_1), (2, doc_2), (1, doc_4)],
"data": [(1, doc_1), (1, doc_3)]
}
When documents are added or removed:
- Addition: Terms added to vocabulary, posting lists updated and re-sorted
- Removal: Orphaned terms removed from vocabulary, posting lists cleaned
- Consistency: Document statistics and averages recalculated incrementally
Document Management
Adding Documents
# Single batch with auto-generated IDs
retriever.add_documents_batch(documents, mode="unified")
# Single batch with custom IDs
retriever.add_documents_batch(
documents,
doc_ids=["custom_1", "custom_2", "custom_3"],
mode="unified"
)
# Multiple batches for large datasets
for batch in document_batches:
retriever.add_documents_batch(batch, mode="unified", show_progress=True)
Removing Documents
# Remove single document
retriever.remove_document("doc_id_1")
# Batch removal (efficient for multiple documents)
retriever.remove_documents_batch(["doc_1", "doc_2", "doc_3"])
# Check system state after removal
stats = retriever.get_system_stats()
print(f"Documents remaining: {stats['chunks']}")
System Management
Reset Collection
# Clear all documents and start fresh
retriever.reset_collection()
# Verify clean state
stats = retriever.get_system_stats()
print(f"Documents after reset: {stats['chunks']}") # Should be 0
State Persistence
# BM25 state automatically saved to disk
# Reload existing index on initialization
retriever = HybridRetriever(
chroma_path="./my_db",
collection_name="my_docs",
bm25_state_path="./my_bm25_index.pkl" # Auto-loads if exists
)
Search Methods
# ChromaDB-compatible interface (recommended)
results = retriever.query(
query_texts=["machine learning"],
n_results=10,
bm25_ratio=0.5, # Hybrid ratio: 0.0=vector only, 1.0=BM25 only
include=['documents', 'metadatas', 'distances']
)
# Legacy hybrid search interface
hybrid_results = retriever.hybrid_search(
"deep learning neural networks",
top_k=10,
bm25_ratio=0.5
)
# BM25 only (keyword-based)
bm25_results = retriever.search_bm25("machine learning", top_k=10)
# Vector only (semantic similarity)
vector_results = retriever.search_vector("artificial intelligence", top_k=10)
Testing
Run tests to verify functionality:
pytest tests/
Or run directly:
python tests/test_examples.py
Test Coverage:
- ChromaDB interface compatibility
- Inverted index consistency validation
- Document addition/removal workflows
- Cross-document term tracking
- Posting list ordering verification
- Vocabulary cleanup on document removal
- Reset collection functionality
- Critical method existence validation
Examples
examples/basic_usage.py- Document management workflow with custom documentsexamples/brown_corpus_w_ratio.py- Brown corpus analysis with ratio testing
Processing Modes
Unified Mode (Recommended)
- Processes both BM25 and ChromaDB together
- Usually faster for large datasets
- Better for production use
Sequential Mode
- Processes ChromaDB first, then BM25
- Better for debugging and optimization
- Separate timing for each system
Performance
The system is designed for efficiency through incremental operations:
No Full Recalculation: Adding or removing documents updates only affected components:
- Vocabulary set adds/removes only new/orphaned terms
- Inverted index updates only posting lists for changed terms
- Document statistics incrementally adjust averages and counts
Python Native Libraries: Heavy lifting handled by optimized built-ins:
Counter.most_common()provides pre-sorted frequency listsheapq.merge()efficiently combines sorted posting listssetoperations for O(1) vocabulary lookups and updatesdefaultdict(Counter)for sparse term-document matrices
Batch Processing: Configurable batch sizes balance memory usage and processing speed:
- Pending additions buffer reduces index update frequency
- Automatic flush mechanism maintains data consistency
- Progress tracking for large document collections
API
Main Classes
BM25: Fast BM25 implementation with inverted indexHybridRetriever: Main hybrid search interface with ChromaDB compatibility
Key Methods
Search Methods:
query(): ChromaDB-compatible hybrid search interfacehybrid_search(): Legacy hybrid search with RRFsearch_bm25(): BM25-only searchsearch_vector(): Vector-only search
Document Management:
add_documents_batch(): Add documents in batchesremove_document(): Remove single documentremove_documents_batch(): Remove multiple documents
System Methods:
reset_collection(): Clear all documents and restart freshget_system_stats(): Performance statistics and document counts_save_state()/_load_state(): Automatic BM25 persistence
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bm25_chroma-0.6.1.tar.gz.
File metadata
- Download URL: bm25_chroma-0.6.1.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91b7a380eb36b3a39810db1fddebbc32645c7e26c496f9717407409cc021dc91
|
|
| MD5 |
951580a8e21b5eb3c98dddb0e80a5f95
|
|
| BLAKE2b-256 |
0798fa9f7899dfcbee5afe4f1f14181d8834e2dd5bea180584de0805deee3113
|
File details
Details for the file bm25_chroma-0.6.1-py3-none-any.whl.
File metadata
- Download URL: bm25_chroma-0.6.1-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01a31014e71b34334002966a4b0794f2a0ced1d088d3d6c1a566cfa85cc71c5b
|
|
| MD5 |
c0f0f5eb96a01b60dc58b24ad998318b
|
|
| BLAKE2b-256 |
9af5a367ec3f49badad220c729b52a554b471e5bb3ef32848a9b16c72aeb86b8
|