Skip to main content

CI/CD-integrated RAG evaluation pipeline — quality gate for AI chatbots using Ragas + Groq LLM judge

Project description

rag-eval-gate

A CI/CD-integrated evaluation pipeline that acts as a quality gate for RAG (Retrieval-Augmented Generation) systems. Block bad PRs before they ship hallucinating AI to production.

PyPI version Python 3.10+ License: MIT Tests


The Problem

You ship a RAG chatbot. A teammate changes the prompt template. The retriever now returns irrelevant context. The LLM starts hallucinating. Nobody catches it until users complain.

rag-eval-gate prevents this by running automated evaluations on every Pull Request — just like unit tests, but for AI output quality.

How It Works

When a pull request is opened, the GitHub Action:

  1. Loads a curated test dataset (from Hugging Face or a local .jsonl file)
  2. Runs each question through your RAG pipeline
  3. Evaluates outputs using Ragas metrics with a Groq LLM judge
  4. Computes a custom Token Efficiency metric (quality per output token)
  5. Checks scores against configurable thresholds in eval_config.yaml
  6. Pushes metrics to Grafana Cloud for trend tracking
  7. Posts a formatted score table as a PR comment
  8. Fails the CI job if any metric drops below threshold — blocking the merge

Evaluation Metrics

Metric What It Measures Default Threshold
Faithfulness Are answers grounded in retrieved context? ≥ 0.75
Context Relevance Is the retrieved context relevant to the question? ≥ 0.70
Answer Correctness How accurate is the answer vs ground truth? ≥ 0.65
Token Efficiency Quality per output token (correctness / log(1 + tokens)) ≥ 0.50

The default LLM Judge is groq/llama-3.3-70b-versatile via LiteLLM — fast, free, and swappable.

Quick Start

# Install from PyPI
pip install rag-eval-gate

# Set your Groq API key (free at console.groq.com)
export GROQ_API_KEY="your_api_key"

# Run evaluation
rag-eval run

# View formatted report
rag-eval report

Try the Hallucination Demo 🚨

See rag-eval-gate catch a hallucinating AI in real-time. This demo intentionally forces the mock RAG pipeline to hallucinate an answer about "RLHF", proving that the quality gate works:

python examples/demo.py

GitHub Actions Setup

Add this workflow to .github/workflows/rag_eval.yml:

name: RAG Evaluation
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install rag-eval-gate
      - run: rag-eval run --config eval_config.yaml
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}

Set GROQ_API_KEY in your GitHub repository secrets (Settings → Secrets → Actions).

Configuration

Customize thresholds and model settings in eval_config.yaml:

thresholds:
  faithfulness_min: 0.75
  context_relevance_min: 0.70
  answer_correctness_min: 0.65
  token_efficiency_min: 0.50

model:
  judge: "groq/llama-3.3-70b-versatile"
  rag_generator: "groq/llama-3.3-70b-versatile"
  embeddings: "sentence-transformers/all-MiniLM-L6-v2"

dataset:
  hf_repo: "Manik24/rag-eval-golden"

Architecture

┌─────────────────────────────────────────────────────┐
│                  GitHub Actions CI                   │
├─────────────────────────────────────────────────────┤
│                                                     │
│  Test Dataset (HF Hub / local JSONL)                │
│         │                                           │
│         ▼                                           │
│  RAG Pipeline (FAISS + Groq LLM via LiteLLM)       │
│         │                                           │
│         ▼                                           │
│  Ragas Evaluation (Faithfulness, Relevance, etc.)   │
│         │                                           │
│         ▼                                           │
│  Regression Gate (pass/fail vs thresholds)          │
│         │                                           │
│    ┌────┴────┐                                      │
│    ▼         ▼                                      │
│  ✅ Pass   ❌ Fail → Block PR merge                 │
│    │         │                                      │
│    ▼         ▼                                      │
│  PR Comment + Grafana Metrics Push                  │
│                                                     │
└─────────────────────────────────────────────────────┘

Bring Your Own Pipeline

The library ships with a demo RAG pipeline, but you can plug in your own. Subclass BaseRAGPipeline, implement two methods, and point your config at it:

# my_pipeline.py
from rag_eval import BaseRAGPipeline, RAGResult

class MyPipeline(BaseRAGPipeline):
    def init(self):
        """Called once before evaluation starts. Load your models here."""
        self.db = load_my_vectorstore()
        self.llm = load_my_llm()

    def query(self, question: str) -> RAGResult:
        """Called for each question in the test dataset."""
        docs = self.db.search(question, k=3)
        answer = self.llm.generate(question, docs)
        return RAGResult(
            question=question,
            answer=answer,
            contexts=[d.text for d in docs],
            input_tokens=...,   # optional, for token efficiency metric
            output_tokens=...,  # optional, for token efficiency metric
        )

Then set pipeline.class in your eval_config.yaml:

pipeline:
  class: "my_pipeline.MyPipeline"

thresholds:
  faithfulness_min: 0.75
  context_relevance_min: 0.70
  answer_correctness_min: 0.65
  token_efficiency_min: 0.50

Run it:

export GROQ_API_KEY="..."
rag-eval run --config eval_config.yaml

The evaluator will import your class, call init() once, then call query() for each test question.

Tech Stack

Local Development

git clone https://github.com/ManikBodamwad/RAG-EVAL.git
cd RAG-EVAL
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

cp .env.example .env

# Run local evaluation
rag-eval run

# View formatted report
rag-eval report

# Run unit tests
python -m pytest tests/

Test Dataset

The default test set is hosted at Manik24/rag-eval-golden on Hugging Face. To use your own dataset, create a JSONL file with the following schema:

{"question": "What is X?", "ground_truth": "X is ...", "reference_context": "The passage that answers this..."}

Then specify the local path or your own HF repo in eval_config.yaml.

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_eval_gate-0.2.1.tar.gz (24.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_eval_gate-0.2.1-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file rag_eval_gate-0.2.1.tar.gz.

File metadata

  • Download URL: rag_eval_gate-0.2.1.tar.gz
  • Upload date:
  • Size: 24.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_eval_gate-0.2.1.tar.gz
Algorithm Hash digest
SHA256 86925001f084fe0cfe8f3ffca2cac194ca4e17bded73863cc9eba3e6ba61a575
MD5 16671df831110d6f4b9517fd682734e7
BLAKE2b-256 ace9aec90ebb7a12ff3d9a02b9f386c16d2f9c8ff02d2e03fc60de0ef779e9a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_eval_gate-0.2.1.tar.gz:

Publisher: publish.yml on ManikBodamwad/RAG-EVAL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rag_eval_gate-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: rag_eval_gate-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_eval_gate-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 92096f33061b68a8ff23c2beb5def3bc820dc6a7b386d290b1de8d06e47d1bb1
MD5 bdeb07951c43b4cd211093619dc22bb0
BLAKE2b-256 dadffe93f54900f5f89915886848a545a7bcd4ea906baa82ececd351da64becd

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_eval_gate-0.2.1-py3-none-any.whl:

Publisher: publish.yml on ManikBodamwad/RAG-EVAL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page