A production-grade custom LangChain retriever combining Chroma vector search and local BM25 search with Reciprocal Rank Fusion (RRF).
Project description
🚀 Chroma-Hybrid-RRF
chroma-hybrid-rrf is a production-grade, highly performant custom LangChain-compatible retriever that merges dense vector semantic search (using ChromaDB) and sparse keyword keyword search (using BM25) using Reciprocal Rank Fusion (RRF).
By combining keyword matching and vector embeddings, this retriever increases query precision and robustness, mitigating issues like synonym misses and context retrieval gaps.
📐 How It Works: Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion is a highly reliable algorithm that scores documents solely based on their rank order from different retrievers (rather than comparing raw similarity scores or distances, which vary widely between vector spaces and keyword counters).
The RRF score for a document $d$ across retrieval models $M$ is calculated as:
$$RRF_Score(d \in D) = \sum_{m \in M} \frac{1}{k + r_m(d)}$$
Where:
- $M$: The set of retrievers (Dense vector search + Sparse BM25 keyword search).
- $r_m(d)$: The 1-based rank position of document $d$ in the result list returned by retriever $m$.
- $k$: A constant smoothing parameter (default
60) that prevents low ranks (outliers) from dominating the overall scoring.
🛠️ Key Features
- Dual-retrieval pipelines: Performs dense search via ChromaDB and sparse keyword search via BM25.
- Auto-Sync indexing: Dynamically pulls and indexes documents from ChromaDB to construct the BM25 search corpus automatically.
- Metadata preservation: Retains all original source metadata and appends the calculated
rrf_scorefor debugging and evaluation. - LangChain BaseRetriever compliance: Full drop-in integration with LangChain chains (
|) and LCEL (LangChain Expression Language). - Async-ready: Supports standard async calling conventions (
ainvoke).
📦 Installation
To install chroma-hybrid-rrf locally in editable mode for development:
git clone https://github.com/Raj2001A/chroma-hybrid-rrf.git
cd chroma-hybrid-rrf
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate
pip install -e .[dev]
⚡ Quick Start
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from chroma_hybrid_rrf import ChromaHybridRRFRetriever
# 1. Initialize dense Chroma Vector Store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(
collection_name="my_docs",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
# 2. Create the Custom Hybrid RRF Retriever
retriever = ChromaHybridRRFRetriever(
chroma_vectorstore=vectorstore,
rrf_k=60, # RRF constant k
top_n=4 # Return top 4 fused documents
)
# 3. Retrieve fused documents
query = "Explain LangGraph multi-agent orchestration"
fused_docs = retriever.invoke(query)
for rank, doc in enumerate(fused_docs):
print(f"Rank {rank + 1} | Score: {doc.metadata['rrf_score']:.6f}")
print(f"Content: {doc.page_content}\n")
🧪 Evaluation via RAGAS
Evaluating retrieval precision is critical for building production-grade RAG systems. Using the RAGAS framework, you can evaluate the effectiveness of this retriever across key retrieval and generation metrics:
- Context Precision: Measures how well the retriever ranks relevant documents at the top.
- Context Recall: Verifies if all relevant ground-truth facts are successfully retrieved.
Setup RAGAS Evaluation:
from ragas import evaluate
from ragas.metrics import context_precision, context_recall
from datasets import Dataset
# Construct your evaluation dataset
eval_data = {
"question": ["How do you orchestrate agents?"],
"contexts": [[doc.page_content for doc in fused_docs]],
"ground_truth": ["LangGraph is used for building stateful, multi-actor applications with LLMs."]
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset, metrics=[context_precision, context_recall])
print(results)
🧪 Testing
To run the test suite and verify calculation correctness:
pytest tests/
🤝 Contributing
Contributions are highly welcome! To contribute:
- Fork the repository.
- Create a new feature branch:
git checkout -b feat/your-feature. - Write your changes and add tests.
- Run
pytestto make sure all tests pass. - Push to your branch and open a Pull Request.
📜 License
Distributed under the MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chroma_hybrid_rrf-0.1.0.tar.gz.
File metadata
- Download URL: chroma_hybrid_rrf-0.1.0.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
907ed75d4a06baf145169b7df9235c3892087c5df80d04d693d6f800afb373f0
|
|
| MD5 |
3e571ce8f859e930e5e90b9750721abb
|
|
| BLAKE2b-256 |
0a155c808702461f61fdb48648808bfb9bebf4ba364dd30ebd7662a084632cc0
|
File details
Details for the file chroma_hybrid_rrf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chroma_hybrid_rrf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed6de9458f7e96b5a682fd05a2b00413b268bcbb9a8ee9df618f0d8523ca42a4
|
|
| MD5 |
4be562b9cdf61335b02af64538066ad4
|
|
| BLAKE2b-256 |
6873327d55911fcdeadb06f10d885322215cdceaf5d21c0554857ad1600e8ece
|