CI/CD-integrated RAG evaluation pipeline — quality gate for AI chatbots using Ragas + Groq LLM judge
Project description
rag-eval-gate
A CI/CD-integrated evaluation pipeline that acts as a quality gate for RAG (Retrieval-Augmented Generation) systems. Block bad PRs before they ship hallucinating AI to production.
The Problem
You ship a RAG chatbot. A teammate changes the prompt template. The retriever now returns irrelevant context. The LLM starts hallucinating. Nobody catches it until users complain.
rag-eval-gate prevents this by running automated evaluations on every Pull Request — just like unit tests, but for AI output quality.
How It Works
When a pull request is opened, the GitHub Action:
- Loads a curated test dataset (from Hugging Face or a local
.jsonlfile) - Runs each question through your RAG pipeline
- Evaluates outputs using Ragas metrics with a Groq LLM judge
- Computes a custom Token Efficiency metric (quality per output token)
- Checks scores against configurable thresholds in
eval_config.yaml - Pushes metrics to Grafana Cloud for trend tracking
- Posts a formatted score table as a PR comment
- Fails the CI job if any metric drops below threshold — blocking the merge
Evaluation Metrics
| Metric | What It Measures | Default Threshold |
|---|---|---|
| Faithfulness | Are answers grounded in retrieved context? | ≥ 0.75 |
| Context Relevance | Is the retrieved context relevant to the question? | ≥ 0.70 |
| Answer Correctness | How accurate is the answer vs ground truth? | ≥ 0.65 |
| Token Efficiency | Quality per output token (correctness / log(1 + tokens)) |
≥ 0.50 |
The default LLM Judge is groq/llama-3.3-70b-versatile via LiteLLM — fast, free, and swappable.
Quick Start
# Install from PyPI
pip install rag-eval-gate
# Set your Groq API key (free at console.groq.com)
export GROQ_API_KEY="your_api_key"
# Run evaluation
rag-eval run
# View formatted report
rag-eval report
Try the Hallucination Demo 🚨
See rag-eval-gate catch a hallucinating AI in real-time. This demo intentionally forces the mock RAG pipeline to hallucinate an answer about "RLHF", proving that the quality gate works:
python examples/demo.py
GitHub Actions Setup
Add this workflow to .github/workflows/rag_eval.yml:
name: RAG Evaluation
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install rag-eval-gate
- run: rag-eval run --config eval_config.yaml
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
Set GROQ_API_KEY in your GitHub repository secrets (Settings → Secrets → Actions).
Configuration
Customize thresholds and model settings in eval_config.yaml:
thresholds:
faithfulness_min: 0.75
context_relevance_min: 0.70
answer_correctness_min: 0.65
token_efficiency_min: 0.50
model:
judge: "groq/llama-3.3-70b-versatile"
rag_generator: "groq/llama-3.3-70b-versatile"
embeddings: "sentence-transformers/all-MiniLM-L6-v2"
dataset:
hf_repo: "Manik24/rag-eval-golden"
Architecture
┌─────────────────────────────────────────────────────┐
│ GitHub Actions CI │
├─────────────────────────────────────────────────────┤
│ │
│ Test Dataset (HF Hub / local JSONL) │
│ │ │
│ ▼ │
│ RAG Pipeline (FAISS + Groq LLM via LiteLLM) │
│ │ │
│ ▼ │
│ Ragas Evaluation (Faithfulness, Relevance, etc.) │
│ │ │
│ ▼ │
│ Regression Gate (pass/fail vs thresholds) │
│ │ │
│ ┌────┴────┐ │
│ ▼ ▼ │
│ ✅ Pass ❌ Fail → Block PR merge │
│ │ │ │
│ ▼ ▼ │
│ PR Comment + Grafana Metrics Push │
│ │
└─────────────────────────────────────────────────────┘
Bring Your Own Pipeline
The library ships with a demo RAG pipeline, but you can plug in your own. Subclass BaseRAGPipeline, implement two methods, and point your config at it:
# my_pipeline.py
from rag_eval import BaseRAGPipeline, RAGResult
class MyPipeline(BaseRAGPipeline):
def init(self):
"""Called once before evaluation starts. Load your models here."""
self.db = load_my_vectorstore()
self.llm = load_my_llm()
def query(self, question: str) -> RAGResult:
"""Called for each question in the test dataset."""
docs = self.db.search(question, k=3)
answer = self.llm.generate(question, docs)
return RAGResult(
question=question,
answer=answer,
contexts=[d.text for d in docs],
input_tokens=..., # optional, for token efficiency metric
output_tokens=..., # optional, for token efficiency metric
)
Then set pipeline.class in your eval_config.yaml:
pipeline:
class: "my_pipeline.MyPipeline"
thresholds:
faithfulness_min: 0.75
context_relevance_min: 0.70
answer_correctness_min: 0.65
token_efficiency_min: 0.50
Run it:
export GROQ_API_KEY="..."
rag-eval run --config eval_config.yaml
The evaluator will import your class, call init() once, then call query() for each test question.
Tech Stack
- Evaluation: Ragas for LLM-as-judge metrics
- LLM Provider: Groq via LiteLLM (hot-swappable to OpenAI, Anthropic, etc.)
- Embeddings: sentence-transformers (local, no API calls)
- Vector Store: FAISS (CPU, local)
- Dataset: Hugging Face Datasets
- Observability: Grafana Cloud via Influx Line Protocol
- CLI: Click + Rich
Local Development
git clone https://github.com/ManikBodamwad/RAG-EVAL.git
cd RAG-EVAL
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
# Run local evaluation
rag-eval run
# View formatted report
rag-eval report
# Run unit tests
python -m pytest tests/
Test Dataset
The default test set is hosted at Manik24/rag-eval-golden on Hugging Face. To use your own dataset, create a JSONL file with the following schema:
{"question": "What is X?", "ground_truth": "X is ...", "reference_context": "The passage that answers this..."}
Then specify the local path or your own HF repo in eval_config.yaml.
License
MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_eval_gate-0.2.1.tar.gz.
File metadata
- Download URL: rag_eval_gate-0.2.1.tar.gz
- Upload date:
- Size: 24.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86925001f084fe0cfe8f3ffca2cac194ca4e17bded73863cc9eba3e6ba61a575
|
|
| MD5 |
16671df831110d6f4b9517fd682734e7
|
|
| BLAKE2b-256 |
ace9aec90ebb7a12ff3d9a02b9f386c16d2f9c8ff02d2e03fc60de0ef779e9a5
|
Provenance
The following attestation bundles were made for rag_eval_gate-0.2.1.tar.gz:
Publisher:
publish.yml on ManikBodamwad/RAG-EVAL
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rag_eval_gate-0.2.1.tar.gz -
Subject digest:
86925001f084fe0cfe8f3ffca2cac194ca4e17bded73863cc9eba3e6ba61a575 - Sigstore transparency entry: 1824367809
- Sigstore integration time:
-
Permalink:
ManikBodamwad/RAG-EVAL@a07d528455d75bfaf9ce16b94e8d7c6ba4d5d6ca -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/ManikBodamwad
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a07d528455d75bfaf9ce16b94e8d7c6ba4d5d6ca -
Trigger Event:
push
-
Statement type:
File details
Details for the file rag_eval_gate-0.2.1-py3-none-any.whl.
File metadata
- Download URL: rag_eval_gate-0.2.1-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92096f33061b68a8ff23c2beb5def3bc820dc6a7b386d290b1de8d06e47d1bb1
|
|
| MD5 |
bdeb07951c43b4cd211093619dc22bb0
|
|
| BLAKE2b-256 |
dadffe93f54900f5f89915886848a545a7bcd4ea906baa82ececd351da64becd
|
Provenance
The following attestation bundles were made for rag_eval_gate-0.2.1-py3-none-any.whl:
Publisher:
publish.yml on ManikBodamwad/RAG-EVAL
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rag_eval_gate-0.2.1-py3-none-any.whl -
Subject digest:
92096f33061b68a8ff23c2beb5def3bc820dc6a7b386d290b1de8d06e47d1bb1 - Sigstore transparency entry: 1824367881
- Sigstore integration time:
-
Permalink:
ManikBodamwad/RAG-EVAL@a07d528455d75bfaf9ce16b94e8d7c6ba4d5d6ca -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/ManikBodamwad
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a07d528455d75bfaf9ce16b94e8d7c6ba4d5d6ca -
Trigger Event:
push
-
Statement type: