A fast, memory-efficient hybrid search system combining BM25 and vector search

These details have not been verified by PyPI

Project links

Project description

BM-25 Chroma Hybrid Retriever

A fast, memory-efficient hybrid search system combining BM25 and vector search with Reciprocal Rank Fusion (RRF).

Features

BM25: Memory-efficient with integer indices and pre-sorted postings
- lemmatizes, lowercase, no punctuation (replaced with space), len norm
Vector Search: Semantic similarity using ChromaDB and sentence transformers
Hybrid Fusion: Industry-standard Reciprocal Rank Fusion (RRF)
ChromaDB Drop-in Replacement: Compatible interface with hybrid search capabilities
Dual Processing Modes: Sequential or unified batch processing
State Persistence: Automatic save/load of BM25 index
Document Management: Add, remove, and update documents (chunks) with inverted index consistency

Data Structure Notes

Storage: Inverted index format word -> [(frequency, doc_id), ...]
BM25 Scoring: Accesses document term frequencies by inverting the lookup (query term → posting list → document frequencies)
Avoids: Storing redundant document -> [(freq, word), ...] mappings

Quickstart

from bm25_chroma import HybridRetriever
import hashlib

# Initialize
retriever = HybridRetriever(
    chroma_path="./my_db",
    collection_name="my_docs"
)

# Add documents with deterministic, unique IDs
documents = [
    "Machine learning helps analyze data patterns.",
    "Natural language processing understands human text.",
    "Deep learning uses neural networks for complex tasks."
]

# Content-based surrogate keys via hashlib - avoids order dependency
# Alternatively use natural keys when available
doc_ids = [hashlib.sha256(doc.encode()).hexdigest() for doc in documents]

# Add documents
retriever.add_documents_batch(
    documents,
    doc_ids=doc_ids,  # Optional: auto-generated--using chroma's UUID--if not provided
    mode="unified",   # or "sequential"
    show_progress=True
)

# Search with ChromaDB-compatible interface
results = retriever.query(
    query_texts=["machine learning"],
    n_results=5,
    bm25_ratio=0.5,  # 0.0 = vector only, 1.0 = BM25 only, 0.5 = balanced
    include=['documents', 'metadatas', 'distances']
)

# Process results
for doc, meta, dist in zip(results['documents'][0], results['metadatas'][0], results['distances'][0]):
    print(f"Score: {1-dist:.3f} - {doc[:100]}...")

# Legacy interface also available
legacy_results = retriever.hybrid_search("machine learning", top_k=5, bm25_ratio=0.5)
for doc_id, score, metadata in legacy_results:
    print(f"{doc_id[:16]}...: {score:.3f} - {metadata['text'][:100]}...")

Why use hashlib for document IDs?

Deterministic: Same content always produces the same ID
Unique: SHA256 hash collisions are extremely rare
Content-based: ID reflects the actual document content
Database-safe: Perfect for ensuring uniqueness across systems

ChromaDB Interface Compatibility

HybridRetriever provides a drop-in replacement for ChromaDB collections with hybrid BM25+vector search:

# ChromaDB-compatible interface
results = retriever.query(
    query_texts=["machine learning algorithms"],
    n_results=5,
    bm25_ratio=0.5,  # 0.0 = vector only, 1.0 = BM25 only, 0.5 = balanced
    include=['documents', 'metadatas', 'distances']
)

# Returns ChromaDB format
{
    'documents': [['Machine learning helps analyze...', '...']],
    'metadatas': [[{'document_id': 'abc123...'}, {...}]],
    'distances': [[0.234, 0.456, ...]],
    'embeddings': [[...], [...]]  # if requested in include
}

# Single query string (automatically converted to list)
results = retriever.query("deep learning", n_results=3)

# Use as drop-in ChromaDB replacement in existing code
# Just replace: collection.query() with: retriever.query()

Installation

pip install bm25-chroma

Core Architecture

The BM25 component maintains an inverted index with the following structure:

Data Structure:

Vocabulary Set: set(words) containing all unique terms
Inverted Index: dict[word] = [(frequency, document_id), ...]
Posting Lists: Tuples ordered by frequency in descending order

Inverted Index Consistency:

# Example inverted index structure
{
    "machine": [(3, doc_1), (2, doc_5), (1, doc_3)],  # frequency descending
    "learning": [(2, doc_1), (2, doc_2), (1, doc_4)],
    "data": [(1, doc_1), (1, doc_3)]
}

When documents are added or removed:

Addition: Terms added to vocabulary, posting lists updated and re-sorted
Removal: Orphaned terms removed from vocabulary, posting lists cleaned
Consistency: Document statistics and averages recalculated incrementally

Document Management

Adding Documents

# Single batch with auto-generated IDs
retriever.add_documents_batch(documents, mode="unified")

# Single batch with custom IDs
retriever.add_documents_batch(
    documents, 
    doc_ids=["custom_1", "custom_2", "custom_3"],
    mode="unified"
)

# Multiple batches for large datasets
for batch in document_batches:
    retriever.add_documents_batch(batch, mode="unified", show_progress=True)

Removing Documents

# Remove single document
retriever.remove_document("doc_id_1")

# Batch removal (efficient for multiple documents)
retriever.remove_documents_batch(["doc_1", "doc_2", "doc_3"])

# Check system state after removal
stats = retriever.get_system_stats()
print(f"Documents remaining: {stats['chunks']}")

System Management

Reset Collection

# Clear all documents and start fresh
retriever.reset_collection()

# Verify clean state
stats = retriever.get_system_stats()
print(f"Documents after reset: {stats['chunks']}")  # Should be 0

State Persistence

# BM25 state automatically saved to disk
# Reload existing index on initialization
retriever = HybridRetriever(
    chroma_path="./my_db",
    collection_name="my_docs",
    bm25_state_path="./my_bm25_index.pkl"  # Auto-loads if exists
)

Search Methods

# ChromaDB-compatible interface (recommended)
results = retriever.query(
    query_texts=["machine learning"],
    n_results=10,
    bm25_ratio=0.5,  # Hybrid ratio: 0.0=vector only, 1.0=BM25 only
    include=['documents', 'metadatas', 'distances']
)

# Legacy hybrid search interface
hybrid_results = retriever.hybrid_search(
    "deep learning neural networks",
    top_k=10,
    bm25_ratio=0.5
)

# BM25 only (keyword-based)
bm25_results = retriever.search_bm25("machine learning", top_k=10)

# Vector only (semantic similarity)
vector_results = retriever.search_vector("artificial intelligence", top_k=10)

Testing

Run tests to verify functionality:

pytest tests/

Or run directly:

python tests/test_examples.py

Test Coverage:

ChromaDB interface compatibility
Inverted index consistency validation
Document addition/removal workflows
Cross-document term tracking
Posting list ordering verification
Vocabulary cleanup on document removal
Reset collection functionality
Critical method existence validation

Examples

examples/basic_usage.py - Document management workflow with custom documents
examples/brown_corpus_w_ratio.py - Brown corpus analysis with ratio testing

Processing Modes

Unified Mode (Recommended)

Processes both BM25 and ChromaDB together
Usually faster for large datasets
Better for production use

Sequential Mode

Processes ChromaDB first, then BM25
Better for debugging and optimization
Separate timing for each system

Performance

The system is designed for efficiency through incremental operations:

No Full Recalculation: Adding or removing documents updates only affected components:

Vocabulary set adds/removes only new/orphaned terms
Inverted index updates only posting lists for changed terms
Document statistics incrementally adjust averages and counts

Python Native Libraries: Heavy lifting handled by optimized built-ins:

Counter.most_common() provides pre-sorted frequency lists
heapq.merge() efficiently combines sorted posting lists
set operations for O(1) vocabulary lookups and updates
defaultdict(Counter) for sparse term-document matrices

Batch Processing: Configurable batch sizes balance memory usage and processing speed:

Pending additions buffer reduces index update frequency
Automatic flush mechanism maintains data consistency
Progress tracking for large document collections

API

Main Classes

BM25: Fast BM25 implementation with inverted index
HybridRetriever: Main hybrid search interface with ChromaDB compatibility

Key Methods

Search Methods:

query(): ChromaDB-compatible hybrid search interface
hybrid_search(): Legacy hybrid search with RRF
search_bm25(): BM25-only search
search_vector(): Vector-only search

Document Management:

add_documents_batch(): Add documents in batches
remove_document(): Remove single document
remove_documents_batch(): Remove multiple documents

System Methods:

reset_collection(): Clear all documents and restart fresh
get_system_stats(): Performance statistics and document counts
_save_state() / _load_state(): Automatic BM25 persistence

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.1

Sep 15, 2025

0.6.0

Sep 15, 2025

0.5.3

Sep 14, 2025

0.5.1

Sep 14, 2025

0.5.0

Sep 14, 2025

0.4.1

Sep 14, 2025

0.3.2

Sep 13, 2025

0.3.1

Sep 13, 2025

0.3.0

Sep 13, 2025

0.2.2

Sep 13, 2025

0.2.1

Sep 13, 2025

0.2.0

Sep 13, 2025

0.1.0

Sep 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bm25_chroma-0.6.1.tar.gz (25.5 kB view details)

Uploaded Sep 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bm25_chroma-0.6.1-py3-none-any.whl (20.0 kB view details)

Uploaded Sep 15, 2025 Python 3

File details

Details for the file bm25_chroma-0.6.1.tar.gz.

File metadata

Download URL: bm25_chroma-0.6.1.tar.gz
Upload date: Sep 15, 2025
Size: 25.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for bm25_chroma-0.6.1.tar.gz
Algorithm	Hash digest
SHA256	`91b7a380eb36b3a39810db1fddebbc32645c7e26c496f9717407409cc021dc91`
MD5	`951580a8e21b5eb3c98dddb0e80a5f95`
BLAKE2b-256	`0798fa9f7899dfcbee5afe4f1f14181d8834e2dd5bea180584de0805deee3113`

See more details on using hashes here.

File details

Details for the file bm25_chroma-0.6.1-py3-none-any.whl.

File metadata

Download URL: bm25_chroma-0.6.1-py3-none-any.whl
Upload date: Sep 15, 2025
Size: 20.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for bm25_chroma-0.6.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`01a31014e71b34334002966a4b0794f2a0ced1d088d3d6c1a566cfa85cc71c5b`
MD5	`c0f0f5eb96a01b60dc58b24ad998318b`
BLAKE2b-256	`9af5a367ec3f49badad220c729b52a554b471e5bb3ef32848a9b16c72aeb86b8`

See more details on using hashes here.

bm25-chroma 0.6.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BM-25 Chroma Hybrid Retriever

Features

Data Structure Notes

Quickstart

Why use hashlib for document IDs?

ChromaDB Interface Compatibility

Installation

Core Architecture

Document Management

Adding Documents

Removing Documents

System Management

Reset Collection

State Persistence

Search Methods

Testing

Examples

Processing Modes

Performance

API

Main Classes

Key Methods

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes