Skip to main content

A production-grade custom LangChain retriever combining Chroma vector search and local BM25 search with Reciprocal Rank Fusion (RRF).

Project description

🚀 Chroma-Hybrid-RRF

License: MIT Python 3.9+ LangChain Chroma

chroma-hybrid-rrf is a production-grade, highly performant custom LangChain-compatible retriever that merges dense vector semantic search (using ChromaDB) and sparse keyword keyword search (using BM25) using Reciprocal Rank Fusion (RRF).

By combining keyword matching and vector embeddings, this retriever increases query precision and robustness, mitigating issues like synonym misses and context retrieval gaps.


📐 How It Works: Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion is a highly reliable algorithm that scores documents solely based on their rank order from different retrievers (rather than comparing raw similarity scores or distances, which vary widely between vector spaces and keyword counters).

The RRF score for a document $d$ across retrieval models $M$ is calculated as:

$$RRF_Score(d \in D) = \sum_{m \in M} \frac{1}{k + r_m(d)}$$

Where:

  • $M$: The set of retrievers (Dense vector search + Sparse BM25 keyword search).
  • $r_m(d)$: The 1-based rank position of document $d$ in the result list returned by retriever $m$.
  • $k$: A constant smoothing parameter (default 60) that prevents low ranks (outliers) from dominating the overall scoring.

🛠️ Key Features

  • Dual-retrieval pipelines: Performs dense search via ChromaDB and sparse keyword search via BM25.
  • Auto-Sync indexing: Dynamically pulls and indexes documents from ChromaDB to construct the BM25 search corpus automatically.
  • Metadata preservation: Retains all original source metadata and appends the calculated rrf_score for debugging and evaluation.
  • LangChain BaseRetriever compliance: Full drop-in integration with LangChain chains (|) and LCEL (LangChain Expression Language).
  • Async-ready: Supports standard async calling conventions (ainvoke).

📦 Installation

To install chroma-hybrid-rrf locally in editable mode for development:

git clone https://github.com/Raj2001A/chroma-hybrid-rrf.git
cd chroma-hybrid-rrf
python -m venv venv
source venv/bin/activate  # On Windows: .\venv\Scripts\activate
pip install -e .[dev]

⚡ Quick Start

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from chroma_hybrid_rrf import ChromaHybridRRFRetriever

# 1. Initialize dense Chroma Vector Store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(
    collection_name="my_docs", 
    embedding_function=embeddings, 
    persist_directory="./chroma_db"
)

# 2. Create the Custom Hybrid RRF Retriever
retriever = ChromaHybridRRFRetriever(
    chroma_vectorstore=vectorstore,
    rrf_k=60,       # RRF constant k
    top_n=4         # Return top 4 fused documents
)

# 3. Retrieve fused documents
query = "Explain LangGraph multi-agent orchestration"
fused_docs = retriever.invoke(query)

for rank, doc in enumerate(fused_docs):
    print(f"Rank {rank + 1} | Score: {doc.metadata['rrf_score']:.6f}")
    print(f"Content: {doc.page_content}\n")

🧪 Evaluation via RAGAS

Evaluating retrieval precision is critical for building production-grade RAG systems. Using the RAGAS framework, you can evaluate the effectiveness of this retriever across key retrieval and generation metrics:

  • Context Precision: Measures how well the retriever ranks relevant documents at the top.
  • Context Recall: Verifies if all relevant ground-truth facts are successfully retrieved.

Setup RAGAS Evaluation:

from ragas import evaluate
from ragas.metrics import context_precision, context_recall
from datasets import Dataset

# Construct your evaluation dataset
eval_data = {
    "question": ["How do you orchestrate agents?"],
    "contexts": [[doc.page_content for doc in fused_docs]],
    "ground_truth": ["LangGraph is used for building stateful, multi-actor applications with LLMs."]
}

dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset, metrics=[context_precision, context_recall])
print(results)

🧪 Testing

To run the test suite and verify calculation correctness:

pytest tests/

🤝 Contributing

Contributions are highly welcome! To contribute:

  1. Fork the repository.
  2. Create a new feature branch: git checkout -b feat/your-feature.
  3. Write your changes and add tests.
  4. Run pytest to make sure all tests pass.
  5. Push to your branch and open a Pull Request.

📜 License

Distributed under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chroma_hybrid_rrf-0.1.0.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chroma_hybrid_rrf-0.1.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file chroma_hybrid_rrf-0.1.0.tar.gz.

File metadata

  • Download URL: chroma_hybrid_rrf-0.1.0.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for chroma_hybrid_rrf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 907ed75d4a06baf145169b7df9235c3892087c5df80d04d693d6f800afb373f0
MD5 3e571ce8f859e930e5e90b9750721abb
BLAKE2b-256 0a155c808702461f61fdb48648808bfb9bebf4ba364dd30ebd7662a084632cc0

See more details on using hashes here.

File details

Details for the file chroma_hybrid_rrf-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chroma_hybrid_rrf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed6de9458f7e96b5a682fd05a2b00413b268bcbb9a8ee9df618f0d8523ca42a4
MD5 4be562b9cdf61335b02af64538066ad4
BLAKE2b-256 6873327d55911fcdeadb06f10d885322215cdceaf5d21c0554857ad1600e8ece

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page