Production-grade RAG evaluation library with comprehensive metrics and reporting

These details have not been verified by PyPI

Project description

RAG Eval Pro 🎯

Production-Grade Evaluation Framework for RAG Systems

A modular, extensible, and production-ready evaluation framework for Retrieval-Augmented Generation (RAG) systems.

🎯 What It Evaluates

📄 Retrieval Quality - Recall@k, Precision@k, nDCG, MRR
🧾 Generation Quality - Semantic similarity, relevance, completeness
🔗 Grounding / Faithfulness - Answer grounding in retrieved context
⚙️ System Performance - Latency, token usage, cost estimation

🚀 Why This Library Exists

Most RAG systems fail silently because:

❌ Retrieval is not evaluated
❌ Hallucinations go undetected
❌ Metrics are inconsistent across teams

This library solves that by providing:

✅ Unified evaluation across retrieval + generation + grounding
✅ Pluggable metric system
✅ CI/CD integration for regression detection
✅ LLM-as-a-judge paradigm with consistent scoring

🏗️ Architecture Overview

Dataset → RAG Execution → Metrics → Aggregation → Reporting → CI/CD

🔧 Works With

Custom RAG pipelines
LangChain / LlamaIndex
Any LLM provider (OpenAI, Anthropic, Cohere, etc.)

📦 Installation

# Basic offline installation
pip install rag-eval-pro

# Local semantic metrics
pip install rag-eval-pro[semantic]

# LLM-backed metrics
pip install rag-eval-pro[llm]

# Everything
pip install rag-eval-pro[all]

Installation Modes

rag-eval-pro: fully offline core library for datasets, retrieval metrics, aggregation, caching, and reporting
rag-eval-pro[semantic]: enables embedding-backed semantic similarity
rag-eval-pro[llm]: enables provider-backed LLM judge metrics
rag-eval-pro[all]: installs every optional feature

🎯 Quick Start

from rag_eval.pipeline.runner import RAGEvaluator
from rag_eval.metrics.retrieval import RecallAtK
from rag_eval.metrics.semantic import SemanticSimilarity
from rag_eval.dataset.schema import RAGDataset, RAGSample

# Create dataset
dataset = RAGDataset(
    name="demo",
    version="v1",
    samples=[
        RAGSample(
            query="What is refund policy?",
            retrieved_docs=["refund doc"],
            generated_answer="Refund allowed in 30 days",
            ground_truth="Refund allowed in 30 days",
            relevant_docs=["refund doc"]
        )
    ]
)

# Initialize evaluator with metrics
metrics = [
    RecallAtK(k=5),
    SemanticSimilarity(
        backend="lexical",
        compare_to="ground_truth"
    )
]

evaluator = RAGEvaluator(metrics)
results = evaluator.evaluate(dataset)

print(results)

LLM Metrics

from rag_eval.metrics.llm_metrics import FaithfulnessMetric

metric = FaithfulnessMetric(
    provider="openai",
    model="gpt-4o-mini",
    api_key=os.getenv("OPENAI_API_KEY")
)

Semantic Backends

from rag_eval.metrics.semantic import SemanticSimilarity

# Offline fallback
SemanticSimilarity(backend="lexical")

# Prefer local/downloaded sentence-transformers model, but fall back if unavailable
SemanticSimilarity(
    backend="auto",
    model="all-MiniLM-L6-v2"
)

# Require embedding model explicitly
SemanticSimilarity(
    backend="sentence-transformers",
    model="all-MiniLM-L6-v2",
    allow_fallback=False
)

Output Format

{
  "mean": {
    "faithfulness": 0.89,
    "recall@5": 0.76
  },
  "p95": {
    "faithfulness": 0.95
  }
}

📊 Dataset Format

The library is dataset-first. All evaluations start with a structured dataset.

Required Schema

{
  "query": "What is refund policy?",
  "retrieved_docs": ["doc1", "doc2"],
  "generated_answer": "You can request refund within 30 days.",
  "ground_truth": "Refund allowed within 30 days.",
  "relevant_docs": ["doc1"]
}

Python Schema

from rag_eval.dataset.schema import RAGSample

sample = RAGSample(
    query="What is refund policy?",
    retrieved_docs=["doc1", "doc2"],
    generated_answer="You can request refund within 30 days.",
    ground_truth="Refund allowed within 30 days.",
    relevant_docs=["doc1"]
)

🔧 Advanced Usage

Custom Metrics

from rag_eval.metrics import BaseMetric

class CustomMetric(BaseMetric):
    def compute(self, sample):
        # Your custom logic
        score = your_evaluation_logic(sample)
        return {"score": score, "details": {...}}
    
    def aggregate(self, scores):
        return {"mean": np.mean(scores), "std": np.std(scores)}

Async Evaluation

from rag_eval.pipeline import AsyncRAGEvaluator

evaluator = AsyncRAGEvaluator(
    metrics=[...],
    max_concurrent=10,
    cache_enabled=True
)

results = await evaluator.evaluate_async(dataset)

LangChain Integration

from rag_eval.integrations import LangChainAdapter
from langchain.chains import RetrievalQA

# Your LangChain RAG system
qa_chain = RetrievalQA.from_chain_type(...)

# Wrap with adapter
adapter = LangChainAdapter(qa_chain)

# Evaluate
results = evaluator.evaluate(dataset, rag_system=adapter)

📈 Metrics Overview

Retrieval Metrics

Recall@K: Measures retrieval coverage
Precision@K: Measures retrieval accuracy
nDCG: Normalized Discounted Cumulative Gain
MRR: Mean Reciprocal Rank

Generation Metrics

Faithfulness: Answer grounding in retrieved context
Relevance: Answer relevance to query
Coherence: Answer logical consistency
Semantic Similarity: Embedding-based similarity

Runtime Requirements

Retrieval metrics work offline after pip install rag-eval-pro
SemanticSimilarity(backend="sentence-transformers") needs pip install rag-eval-pro[semantic]
LLM judge metrics need pip install rag-eval-pro[llm] plus provider credentials
For local-only demos and CI, use provider="mock" for LLM metrics

Hallucination Detection

Context Contradiction: Detects contradictions with context
Factual Consistency: Checks factual alignment
Unsupported Claims: Identifies unsupported statements

🛠️ Configuration

Create a config.yaml:

evaluator:
  cache_enabled: true
  cache_dir: ".cache/rag_eval"
  max_retries: 3
  timeout: 30

llm:
  provider: "openai"
  model: "gpt-4"
  temperature: 0.0
  max_tokens: 500

embeddings:
  model: "all-MiniLM-L6-v2"
  batch_size: 32

logging:
  level: "INFO"
  format: "json"

Load configuration:

from rag_eval.utils import load_config

config = load_config("config.yaml")
evaluator = RAGEvaluator.from_config(config)

🧪 Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=rag_eval --cov-report=html

# Run specific test categories
pytest -m unit
pytest -m integration
pytest -m "not slow"

🚢 Publishing Notes

Keep the base package lightweight and offline-safe
Treat cloud-backed metrics and embedding models as optional extras
Use provider="mock" in tests, examples, and CI when you do not want live API calls
Document required environment variables for provider-backed metrics

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

# Setup development environment
git clone https://github.com/shivangis22/rag-eval.git
cd rag-eval-pro
pip install -e ".[dev]"
pre-commit install

# Run tests
pytest

# Format code
black .
ruff check --fix .

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by RAGAS, TruLens, and other RAG evaluation frameworks
Built with modern Python best practices
Community-driven development

📧 Contact

Issues: GitHub Issues
Email: shivangis2208@gmail.com

🗺️ Roadmap

Support for more LLM providers (Cohere, Gemini)
Multi-modal RAG evaluation
Real-time evaluation dashboard
A/B testing framework
Automated dataset generation
Integration with MLOps platforms

Star ⭐ this repo if you find it useful!

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_eval_pro-0.1.1.tar.gz (51.9 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_eval_pro-0.1.1-py3-none-any.whl (51.4 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file rag_eval_pro-0.1.1.tar.gz.

File metadata

Download URL: rag_eval_pro-0.1.1.tar.gz
Upload date: May 5, 2026
Size: 51.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for rag_eval_pro-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`155b116040cf7b4b11f47abbfa33fafdb9e748782ac515859611e1ce03ef7ac2`
MD5	`bc60544b9d716aad42b7ef31976e061b`
BLAKE2b-256	`77f9e8fdce1a5f62b72299f6aaee2c175e1adcdb9bd3aa9cf0c9e414847da78b`

See more details on using hashes here.

File details

Details for the file rag_eval_pro-0.1.1-py3-none-any.whl.

File metadata

Download URL: rag_eval_pro-0.1.1-py3-none-any.whl
Upload date: May 5, 2026
Size: 51.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for rag_eval_pro-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bd2feeaed42dec4ecfdf6cb29827b65fdb430d08524f39098e870af02974c3e9`
MD5	`a8a01e60cc104bfd2391992c4cc05ec1`
BLAKE2b-256	`76907deb0cec791d8207c641ef47b9eba3ada7b4c064209928ebe86461ec1598`

See more details on using hashes here.

rag-eval-pro 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

RAG Eval Pro 🎯

🎯 What It Evaluates

🚀 Why This Library Exists

🏗️ Architecture Overview

🔧 Works With

📦 Installation

Installation Modes

🎯 Quick Start

LLM Metrics

Semantic Backends

Output Format

📊 Dataset Format

Required Schema

Python Schema

🔧 Advanced Usage

Custom Metrics

Async Evaluation

LangChain Integration

📈 Metrics Overview

Retrieval Metrics

Generation Metrics

Runtime Requirements

Hallucination Detection

🛠️ Configuration

🧪 Testing

🚢 Publishing Notes

🤝 Contributing

📝 License

Acknowledgments

📧 Contact

🗺️ Roadmap

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes