Skip to main content

CI/CD-integrated RAG evaluation pipeline — quality gate for AI chatbots using Ragas + Groq LLM judge

Project description

rag-eval

A CI/CD-integrated evaluation pipeline for RAG systems.

PyPI version Python 3.10+ License: MIT RAG Eval CI

rag-eval acts as a quality gate for your RAG applications. It evaluates Pull Requests and can block merges if the output quality drops below defined thresholds.

How it works

When a pull request is opened, the Github Action:

  1. Installs the rag-eval package.
  2. Loads a golden evaluation dataset (from Hugging Face or a local file).
  3. Runs the dataset through your Mock RAG pipeline.
  4. Evaluates the outputs using Ragas metrics.
  5. Checks scores against your defined thresholds in eval_config.yaml.
  6. Pushes metrics to Grafana for trend tracking.
  7. Posts a summary comment on the Pull Request.
  8. Fails the CI job if any metric drops below the threshold.

Evaluation Metrics

Metric What It Measures Default Threshold
Faithfulness Answers are grounded in retrieved context ≥ 0.75
Context Relevance Retrieved context quality ≥ 0.70
Answer Correctness Accuracy vs ground truth ≥ 0.65
Token Efficiency correctness / log(1 + tokens) ≥ 0.50

The default LLM Judge is groq/llama-3.3-70b-versatile via LiteLLM.

Quick Start

# Install
pip install rag-eval-gate

# Set API key
export GROQ_API_KEY="your_api_key"

# Run evaluation
rag-eval run

# View report
rag-eval report

Try the Hallucination Demo 🚨

Want to see rag-eval catch a hallucinating AI in real-time? We built a cinematic terminal demo that intentionally forces our mock RAG pipeline to hallucinate an answer about "RLHF", proving that the quality gate works:

# Make sure GROQ_API_KEY is exported, then run:
python examples/demo.py

GitHub Actions Setup

Add this workflow to .github/workflows/rag_eval.yml:

name: RAG Evaluation
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install rag-eval-gate
      - run: rag-eval run --config eval_config.yaml
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}

Ensure you set GROQ_API_KEY in your GitHub repository secrets.

Configuration

You can customize the passing thresholds and dataset endpoints in eval_config.yaml:

thresholds:
  faithfulness_min: 0.75
  context_relevance_min: 0.70
  answer_correctness_min: 0.65
  token_efficiency_min: 0.50

dataset:
  hf_repo: "manikbodamwad/rag-eval-golden" 

Local Development

git clone https://github.com/manikbodamwad/rag-eval
cd rag-eval
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

cp .env.example .env

# Run local evaluation
rag-eval run

# View formatted report
rag-eval report

# Run unit tests
python -m pytest tests/

Golden Dataset

The default test set is pushed to manikbodamwad/rag-eval-golden on Hugging Face. To use your own dataset, create a JSONL file with the following schema:

{"question": "What is X?", "ground_truth": "X is ...", "reference_context": "The passage that answers this..."}

Then specify the local path or your own HF repo in eval_config.yaml.

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_eval_gate-0.1.0.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_eval_gate-0.1.0-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file rag_eval_gate-0.1.0.tar.gz.

File metadata

  • Download URL: rag_eval_gate-0.1.0.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_eval_gate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a2d7211c656c7de7dacecea82c7bf02d9b7680eaf0082f9459a1952f520368f9
MD5 ddc93bca53cf50476c305d846bda714c
BLAKE2b-256 687932873846f45ef29d93d47de3fb8aa271a2901f1cac46979a240b899eae65

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_eval_gate-0.1.0.tar.gz:

Publisher: publish.yml on ManikBodamwad/RAG-EVAL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rag_eval_gate-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rag_eval_gate-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_eval_gate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cbc0d2a4238ba885ebe37d0daee27c3ccb93e39998bb12aac1377a4d81c6fb84
MD5 21c9a22645c1cd54c39fc6652e3755fa
BLAKE2b-256 00aa669b6fa582c838d6b9017a7a8cdf45dac6085064dd4e3bbe052ad7dfb9f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_eval_gate-0.1.0-py3-none-any.whl:

Publisher: publish.yml on ManikBodamwad/RAG-EVAL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page