Production-grade RAG evaluation library with comprehensive metrics and reporting
Project description
RAG Eval Pro 🎯
Production-Grade Evaluation Framework for RAG Systems
A modular, extensible, and production-ready evaluation framework for Retrieval-Augmented Generation (RAG) systems.
🎯 What It Evaluates
- 📄 Retrieval Quality - Recall@k, Precision@k, nDCG, MRR
- 🧾 Generation Quality - Semantic similarity, relevance, completeness
- 🔗 Grounding / Faithfulness - Answer grounding in retrieved context
- ⚙️ System Performance - Latency, token usage, cost estimation
🚀 Why This Library Exists
Most RAG systems fail silently because:
- ❌ Retrieval is not evaluated
- ❌ Hallucinations go undetected
- ❌ Metrics are inconsistent across teams
This library solves that by providing:
- ✅ Unified evaluation across retrieval + generation + grounding
- ✅ Pluggable metric system
- ✅ CI/CD integration for regression detection
- ✅ LLM-as-a-judge paradigm with consistent scoring
🏗️ Architecture Overview
Dataset → RAG Execution → Metrics → Aggregation → Reporting → CI/CD
🔧 Works With
- Custom RAG pipelines
- LangChain / LlamaIndex
- Any LLM provider (OpenAI, Anthropic, Cohere, etc.)
📦 Installation
# Basic offline installation
pip install rag-eval-pro
# Local semantic metrics
pip install rag-eval-pro[semantic]
# LLM-backed metrics
pip install rag-eval-pro[llm]
# Everything
pip install rag-eval-pro[all]
Installation Modes
rag-eval-pro: fully offline core library for datasets, retrieval metrics, aggregation, caching, and reportingrag-eval-pro[semantic]: enables embedding-backed semantic similarityrag-eval-pro[llm]: enables provider-backed LLM judge metricsrag-eval-pro[all]: installs every optional feature
🎯 Quick Start
from rag_eval.pipeline.runner import RAGEvaluator
from rag_eval.metrics.retrieval import RecallAtK
from rag_eval.metrics.semantic import SemanticSimilarity
from rag_eval.dataset.schema import RAGDataset, RAGSample
# Create dataset
dataset = RAGDataset(
name="demo",
version="v1",
samples=[
RAGSample(
query="What is refund policy?",
retrieved_docs=["refund doc"],
generated_answer="Refund allowed in 30 days",
ground_truth="Refund allowed in 30 days",
relevant_docs=["refund doc"]
)
]
)
# Initialize evaluator with metrics
metrics = [
RecallAtK(k=5),
SemanticSimilarity(
backend="lexical",
compare_to="ground_truth"
)
]
evaluator = RAGEvaluator(metrics)
results = evaluator.evaluate(dataset)
print(results)
LLM Metrics
from rag_eval.metrics.llm_metrics import FaithfulnessMetric
metric = FaithfulnessMetric(
provider="openai",
model="gpt-4o-mini",
api_key=os.getenv("OPENAI_API_KEY")
)
Semantic Backends
from rag_eval.metrics.semantic import SemanticSimilarity
# Offline fallback
SemanticSimilarity(backend="lexical")
# Prefer local/downloaded sentence-transformers model, but fall back if unavailable
SemanticSimilarity(
backend="auto",
model="all-MiniLM-L6-v2"
)
# Require embedding model explicitly
SemanticSimilarity(
backend="sentence-transformers",
model="all-MiniLM-L6-v2",
allow_fallback=False
)
Output Format
{
"mean": {
"faithfulness": 0.89,
"recall@5": 0.76
},
"p95": {
"faithfulness": 0.95
}
}
📊 Dataset Format
The library is dataset-first. All evaluations start with a structured dataset.
Required Schema
{
"query": "What is refund policy?",
"retrieved_docs": ["doc1", "doc2"],
"generated_answer": "You can request refund within 30 days.",
"ground_truth": "Refund allowed within 30 days.",
"relevant_docs": ["doc1"]
}
Python Schema
from rag_eval.dataset.schema import RAGSample
sample = RAGSample(
query="What is refund policy?",
retrieved_docs=["doc1", "doc2"],
generated_answer="You can request refund within 30 days.",
ground_truth="Refund allowed within 30 days.",
relevant_docs=["doc1"]
)
🔧 Advanced Usage
Custom Metrics
from rag_eval.metrics import BaseMetric
class CustomMetric(BaseMetric):
def compute(self, sample):
# Your custom logic
score = your_evaluation_logic(sample)
return {"score": score, "details": {...}}
def aggregate(self, scores):
return {"mean": np.mean(scores), "std": np.std(scores)}
Async Evaluation
from rag_eval.pipeline import AsyncRAGEvaluator
evaluator = AsyncRAGEvaluator(
metrics=[...],
max_concurrent=10,
cache_enabled=True
)
results = await evaluator.evaluate_async(dataset)
LangChain Integration
from rag_eval.integrations import LangChainAdapter
from langchain.chains import RetrievalQA
# Your LangChain RAG system
qa_chain = RetrievalQA.from_chain_type(...)
# Wrap with adapter
adapter = LangChainAdapter(qa_chain)
# Evaluate
results = evaluator.evaluate(dataset, rag_system=adapter)
📈 Metrics Overview
Retrieval Metrics
- Recall@K: Measures retrieval coverage
- Precision@K: Measures retrieval accuracy
- nDCG: Normalized Discounted Cumulative Gain
- MRR: Mean Reciprocal Rank
Generation Metrics
- Faithfulness: Answer grounding in retrieved context
- Relevance: Answer relevance to query
- Coherence: Answer logical consistency
- Semantic Similarity: Embedding-based similarity
Runtime Requirements
- Retrieval metrics work offline after
pip install rag-eval-pro SemanticSimilarity(backend="sentence-transformers")needspip install rag-eval-pro[semantic]- LLM judge metrics need
pip install rag-eval-pro[llm]plus provider credentials - For local-only demos and CI, use
provider="mock"for LLM metrics
Hallucination Detection
- Context Contradiction: Detects contradictions with context
- Factual Consistency: Checks factual alignment
- Unsupported Claims: Identifies unsupported statements
🛠️ Configuration
Create a config.yaml:
evaluator:
cache_enabled: true
cache_dir: ".cache/rag_eval"
max_retries: 3
timeout: 30
llm:
provider: "openai"
model: "gpt-4"
temperature: 0.0
max_tokens: 500
embeddings:
model: "all-MiniLM-L6-v2"
batch_size: 32
logging:
level: "INFO"
format: "json"
Load configuration:
from rag_eval.utils import load_config
config = load_config("config.yaml")
evaluator = RAGEvaluator.from_config(config)
🧪 Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=rag_eval --cov-report=html
# Run specific test categories
pytest -m unit
pytest -m integration
pytest -m "not slow"
🚢 Publishing Notes
- Keep the base package lightweight and offline-safe
- Treat cloud-backed metrics and embedding models as optional extras
- Use
provider="mock"in tests, examples, and CI when you do not want live API calls - Document required environment variables for provider-backed metrics
🤝 Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
# Setup development environment
git clone https://github.com/shivangis22/rag-eval.git
cd rag-eval-pro
pip install -e ".[dev]"
pre-commit install
# Run tests
pytest
# Format code
black .
ruff check --fix .
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Inspired by RAGAS, TruLens, and other RAG evaluation frameworks
- Built with modern Python best practices
- Community-driven development
📧 Contact
- Issues: GitHub Issues
- Email: shivangis2208@gmail.com
🗺️ Roadmap
- Support for more LLM providers (Cohere, Gemini)
- Multi-modal RAG evaluation
- Real-time evaluation dashboard
- A/B testing framework
- Automated dataset generation
- Integration with MLOps platforms
Star ⭐ this repo if you find it useful!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_eval_pro-0.1.1.tar.gz.
File metadata
- Download URL: rag_eval_pro-0.1.1.tar.gz
- Upload date:
- Size: 51.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
155b116040cf7b4b11f47abbfa33fafdb9e748782ac515859611e1ce03ef7ac2
|
|
| MD5 |
bc60544b9d716aad42b7ef31976e061b
|
|
| BLAKE2b-256 |
77f9e8fdce1a5f62b72299f6aaee2c175e1adcdb9bd3aa9cf0c9e414847da78b
|
File details
Details for the file rag_eval_pro-0.1.1-py3-none-any.whl.
File metadata
- Download URL: rag_eval_pro-0.1.1-py3-none-any.whl
- Upload date:
- Size: 51.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd2feeaed42dec4ecfdf6cb29827b65fdb430d08524f39098e870af02974c3e9
|
|
| MD5 |
a8a01e60cc104bfd2391992c4cc05ec1
|
|
| BLAKE2b-256 |
76907deb0cec791d8207c641ef47b9eba3ada7b4c064209928ebe86461ec1598
|