CI/CD-integrated RAG evaluation pipeline — quality gate for AI chatbots using Ragas + Groq LLM judge

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

rag-eval-gate

A CI/CD-integrated evaluation pipeline that acts as a quality gate for RAG (Retrieval-Augmented Generation) systems. Block bad PRs before they ship hallucinating AI to production.

The Problem

You ship a RAG chatbot. A teammate changes the prompt template. The retriever now returns irrelevant context. The LLM starts hallucinating. Nobody catches it until users complain.

rag-eval-gate prevents this by running automated evaluations on every Pull Request — just like unit tests, but for AI output quality.

How It Works

When a pull request is opened, the GitHub Action:

Loads a curated test dataset (from Hugging Face or a local .jsonl file)
Runs each question through your RAG pipeline
Evaluates outputs using Ragas metrics with a Groq LLM judge
Computes a custom Token Efficiency metric (quality per output token)
Checks scores against configurable thresholds in eval_config.yaml
Pushes metrics to Grafana Cloud for trend tracking
Posts a formatted score table as a PR comment
Fails the CI job if any metric drops below threshold — blocking the merge

Evaluation Metrics

Metric	What It Measures	Default Threshold
Faithfulness	Are answers grounded in retrieved context?	≥ 0.75
Context Relevance	Is the retrieved context relevant to the question?	≥ 0.70
Answer Correctness	How accurate is the answer vs ground truth?	≥ 0.65
Token Efficiency	Quality per output token (`correctness / log(1 + tokens)`)	≥ 0.50

The default LLM Judge is groq/llama-3.3-70b-versatile via LiteLLM — fast, free, and swappable.

Quick Start

# Install from PyPI
pip install rag-eval-gate

# Set your Groq API key (free at console.groq.com)
export GROQ_API_KEY="your_api_key"

# Run evaluation
rag-eval run

# View formatted report
rag-eval report

Try the Hallucination Demo 🚨

See rag-eval-gate catch a hallucinating AI in real-time. This demo intentionally forces the mock RAG pipeline to hallucinate an answer about "RLHF", proving that the quality gate works:

python examples/demo.py

GitHub Actions Setup

Add this workflow to .github/workflows/rag_eval.yml:

name: RAG Evaluation
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install rag-eval-gate
      - run: rag-eval run --config eval_config.yaml
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}

Set GROQ_API_KEY in your GitHub repository secrets (Settings → Secrets → Actions).

Configuration

Customize thresholds and model settings in eval_config.yaml:

thresholds:
  faithfulness_min: 0.75
  context_relevance_min: 0.70
  answer_correctness_min: 0.65
  token_efficiency_min: 0.50

model:
  judge: "groq/llama-3.3-70b-versatile"
  rag_generator: "groq/llama-3.3-70b-versatile"
  embeddings: "sentence-transformers/all-MiniLM-L6-v2"

dataset:
  hf_repo: "Manik24/rag-eval-golden"

Architecture

┌─────────────────────────────────────────────────────┐
│                  GitHub Actions CI                   │
├─────────────────────────────────────────────────────┤
│                                                     │
│  Test Dataset (HF Hub / local JSONL)                │
│         │                                           │
│         ▼                                           │
│  RAG Pipeline (FAISS + Groq LLM via LiteLLM)       │
│         │                                           │
│         ▼                                           │
│  Ragas Evaluation (Faithfulness, Relevance, etc.)   │
│         │                                           │
│         ▼                                           │
│  Regression Gate (pass/fail vs thresholds)          │
│         │                                           │
│    ┌────┴────┐                                      │
│    ▼         ▼                                      │
│  ✅ Pass   ❌ Fail → Block PR merge                 │
│    │         │                                      │
│    ▼         ▼                                      │
│  PR Comment + Grafana Metrics Push                  │
│                                                     │
└─────────────────────────────────────────────────────┘

Bring Your Own Pipeline

The library ships with a demo RAG pipeline, but you can plug in your own. Subclass BaseRAGPipeline, implement two methods, and point your config at it:

# my_pipeline.py
from rag_eval import BaseRAGPipeline, RAGResult

class MyPipeline(BaseRAGPipeline):
    def init(self):
        """Called once before evaluation starts. Load your models here."""
        self.db = load_my_vectorstore()
        self.llm = load_my_llm()

    def query(self, question: str) -> RAGResult:
        """Called for each question in the test dataset."""
        docs = self.db.search(question, k=3)
        answer = self.llm.generate(question, docs)
        return RAGResult(
            question=question,
            answer=answer,
            contexts=[d.text for d in docs],
            input_tokens=...,   # optional, for token efficiency metric
            output_tokens=...,  # optional, for token efficiency metric
        )

Then set pipeline.class in your eval_config.yaml:

pipeline:
  class: "my_pipeline.MyPipeline"

thresholds:
  faithfulness_min: 0.75
  context_relevance_min: 0.70
  answer_correctness_min: 0.65
  token_efficiency_min: 0.50

Run it:

export GROQ_API_KEY="..."
rag-eval run --config eval_config.yaml

The evaluator will import your class, call init() once, then call query() for each test question.

Tech Stack

Evaluation: Ragas for LLM-as-judge metrics
LLM Provider: Groq via LiteLLM (hot-swappable to OpenAI, Anthropic, etc.)
Embeddings: sentence-transformers (local, no API calls)
Vector Store: FAISS (CPU, local)
Dataset: Hugging Face Datasets
Observability: Grafana Cloud via Influx Line Protocol
CLI: Click + Rich

Local Development

git clone https://github.com/ManikBodamwad/RAG-EVAL.git
cd RAG-EVAL
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

cp .env.example .env

# Run local evaluation
rag-eval run

# View formatted report
rag-eval report

# Run unit tests
python -m pytest tests/

Test Dataset

The default test set is hosted at Manik24/rag-eval-golden on Hugging Face. To use your own dataset, create a JSONL file with the following schema:

{"question": "What is X?", "ground_truth": "X is ...", "reference_context": "The passage that answers this..."}

Then specify the local path or your own HF repo in eval_config.yaml.

License

MIT License.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

manik24

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.1

Jun 15, 2026

0.2.0

Jun 15, 2026

0.1.0

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_eval_gate-0.2.1.tar.gz (24.7 kB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_eval_gate-0.2.1-py3-none-any.whl (19.9 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file rag_eval_gate-0.2.1.tar.gz.

File metadata

Download URL: rag_eval_gate-0.2.1.tar.gz
Upload date: Jun 15, 2026
Size: 24.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_eval_gate-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`86925001f084fe0cfe8f3ffca2cac194ca4e17bded73863cc9eba3e6ba61a575`
MD5	`16671df831110d6f4b9517fd682734e7`
BLAKE2b-256	`ace9aec90ebb7a12ff3d9a02b9f386c16d2f9c8ff02d2e03fc60de0ef779e9a5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_eval_gate-0.2.1.tar.gz:

Publisher: publish.yml on ManikBodamwad/RAG-EVAL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rag_eval_gate-0.2.1.tar.gz
- Subject digest: 86925001f084fe0cfe8f3ffca2cac194ca4e17bded73863cc9eba3e6ba61a575
- Sigstore transparency entry: 1824367809
- Sigstore integration time: Jun 15, 2026
Source repository:
- Permalink: ManikBodamwad/RAG-EVAL@a07d528455d75bfaf9ce16b94e8d7c6ba4d5d6ca
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/ManikBodamwad
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a07d528455d75bfaf9ce16b94e8d7c6ba4d5d6ca
- Trigger Event: push

File details

Details for the file rag_eval_gate-0.2.1-py3-none-any.whl.

File metadata

Download URL: rag_eval_gate-0.2.1-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 19.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_eval_gate-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92096f33061b68a8ff23c2beb5def3bc820dc6a7b386d290b1de8d06e47d1bb1`
MD5	`bdeb07951c43b4cd211093619dc22bb0`
BLAKE2b-256	`dadffe93f54900f5f89915886848a545a7bcd4ea906baa82ececd351da64becd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_eval_gate-0.2.1-py3-none-any.whl:

Publisher: publish.yml on ManikBodamwad/RAG-EVAL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rag_eval_gate-0.2.1-py3-none-any.whl
- Subject digest: 92096f33061b68a8ff23c2beb5def3bc820dc6a7b386d290b1de8d06e47d1bb1
- Sigstore transparency entry: 1824367881
- Sigstore integration time: Jun 15, 2026
Source repository:
- Permalink: ManikBodamwad/RAG-EVAL@a07d528455d75bfaf9ce16b94e8d7c6ba4d5d6ca
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/ManikBodamwad
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a07d528455d75bfaf9ce16b94e8d7c6ba4d5d6ca
- Trigger Event: push

rag-eval-gate 0.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

rag-eval-gate

The Problem

How It Works

Evaluation Metrics

Quick Start

Try the Hallucination Demo 🚨

GitHub Actions Setup

Configuration

Architecture

Bring Your Own Pipeline

Tech Stack

Local Development

Test Dataset

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance