Skip to main content

CI evaluation framework for RAG and AI agents: groundedness, retrieval quality, hallucination, citations, latency, cost, and regression gating before production.

Project description

🛠️ rag-agent-eval-ci

Stop shipping RAG systems you can't test.

A CI evaluation framework for RAG and AI agents — gate every deploy on groundedness, retrieval quality, hallucination, citations, latency, cost, and regression.

CI Status Python License


⚠️ The Problem

Teams are shipping RAG assistants and AI agents into production, but most have no automated way to answer the one question that matters before a deploy:

[!IMPORTANT] "Is this version safe to ship, or did we just make it worse?"

A prompt tweak silently drops retrieval recall. A model swap doubles cost. A new chunking strategy starts hallucinating. Today these regressions are caught by users, not by CI.

rag-agent-eval-ci turns RAG quality into a pull-request gate. Developers write test questions in YAML; the tool measures groundedness, retrieval accuracy, hallucination, citations, latency, and cost, compares against a baseline, and fails the build when quality drops.


📺 Demo

Demo Screenshot


⚡ Quick Start (under 5 minutes, no API keys)

[!TIP] The default config uses a built-in mock provider + in-memory retriever, so the first run works with zero keys and zero network dependency.

Installation

pip install rag-agent-eval-ci          # or: git clone https://github.com/martian7777/rag-agent-eval-ci && pip install -e .

Run the bundled visa-enrollment example end-to-end:

rag-eval gate \
  --tests examples/visa/tests.yaml \
  -c examples/visa/rag_eval.yaml \
  -d examples/visa/docs

The exit code is 0 if the gate passes, and 1 if it fails — making it extremely straightforward to plug into your CI pipelines.

[!NOTE] Prefer Docker? docker compose up brings the API, dashboard, Postgres, and a local Ollama up together.


📄 Example Input

A developer just writes questions and expectations in tests.yaml:

suite: visa-docs
questions:
  - question: "What documents are required for visa enrollment?"
    expected_sources: ["visa_checklist.pdf"]   # retrieval must surface this
    must_include: ["passport", "admission letter", "insurance"]  # answer completeness
    must_not_include: ["bank statement is required"]             # hallucination trap

📊 Example Output

Every run produces clean console output and CI-friendly artifacts in .rag_eval/reports/:

File Use case / Description
report.json Machine-readable full evaluation results.
junit.xml Standard format that renders as native test results in GitHub/GitLab CI.
summary.md Clean markdown summary intended to be auto-posted to PR comments or job summaries.
report.html Self-contained, premium visual report (perfect as a CI build artifact).

🎯 What it Measures

Metric Question it answers Core Evaluation Method
Retrieval accuracy Did we retrieve the documents we expected? Substring checking on returned sources vs expected.
Groundedness Is the answer actually supported by the retrieved context? LLM-judge verification or fallback token overlap.
Hallucination Did the model invent unsupported or forbidden claims? LLM-judge analysis plus must_not_include penalty checks.
Citation Are the cited sources real and the right ones? Compares cited sources vs retrieved context.
Answer completeness Does the answer contain the key facts? Validates presence of must_include phrases.
Latency How fast is the system? Tracks wall-clock time (gates on suite-level p95).
Cost What is the financial footprint? Live calculations (gates on suite-level total USD).
Regression Did quality drop since the last release? Auto-compares metrics against a tagged baseline run.

🏗️ Architecture

flowchart TD
    classDef main fill:#3b82f6,stroke:#1d4ed8,stroke-width:2px,color:#fff,font-weight:bold;
    classDef input fill:#f8fafc,stroke:#64748b,stroke-width:1.5px;
    classDef step fill:#fff,stroke:#cbd5e1,stroke-width:1px;
    classDef eval fill:#fef2f2,stroke:#f87171,stroke-width:1.5px;
    classDef storage fill:#ecfdf5,stroke:#34d399,stroke-width:1.5px;
    classDef report fill:#fff7ed,stroke:#fb923c,stroke-width:1.5px;
    classDef gate fill:#fef08a,stroke:#eab308,stroke-width:1.5px;

    tests["tests.yaml<br><i>(Questions & Expectations)</i>"]:::input --> runner["Runner Engine"]:::main

    runner --> target["Target System"]:::step
    runner --> evals["Evaluator Suite"]:::step
    runner --> reporting["Reporting System"]:::step

    target --> target_desc["• Local RAG Pipeline<br>• HTTP Endpoint"]:::step
    target_desc --> providers["Providers"]:::step
    providers --> providers_list["• Mock (Offline)<br>• Ollama / OpenAI<br>• Gemini / OpenRouter"]:::step
    providers_list --> vector["Vector Stores"]:::step
    vector --> vector_list["• Memory / Chroma / Qdrant"]:::step

    evals --> eval_list["• Retrieval Accuracy<br>• Groundedness <i>(LLM)</i><br>• Hallucination Penalty<br>• Citation & Source Precision<br>• Answer Completeness<br>• Latency & Cost"]:::eval
    eval_list --> storage["Storage DB<br><i>(SQLite / Postgres)</i>"]:::storage
    storage --> dashboard["Streamlit Dashboard<br>& FastAPI Backend"]:::storage

    reporting --> report_formats["• Rich Console Output<br>• report.json<br>• junit.xml<br>• summary.md<br>• report.html"]:::report
    report_formats --> gate_check["Gate Check"]:::gate
    gate_check --> threshold_desc["Thresholds & Regression Checks"]:::gate
    threshold_desc --> exit_code["CI Exit Code<br><i>(0 = Pass, 1 = Fail)</i>"]:::gate

    storage -.-> gate_check
  • Providers: mock, ollama, openai, gemini, openrouter (any model via OpenRouter with live cost tracking). See docs/providers.md.
  • Vector stores: memory (zero dependencies), chroma, qdrant.
  • Targets: Evaluate the built-in pipeline or point it at your own RAG HTTP endpoint — making it a plug-and-play evaluation utility for existing architectures.

💼 Use Cases

  • Pull-Request Gate — Automatically block merges that regress retrieval/groundedness metrics or exceed cost limits.
  • Model & Prompt Optimization — Run side-by-side comparison matrices (e.g. gpt-4o-mini vs local LLM) on quality and cost.
  • Nightly Regression Gates — Schedule automated cron runs against a tagged production baseline to capture silent drift.
  • Provider Migration Assurance — Benchmark a new model/provider to prove compatibility before going live.

🚀 How to Use It in Your Company

  1. Seed a Test Suite: Write 10–20 high-value user questions in a tests.yaml.
  2. Configure Your Target: Point the runner to your pipeline. If using your own server, set target.type: http:
    target:
      type: http
      url: https://your-rag.internal/answer
      question_field: question
      answer_field: answer
      sources_field: sources
    
  3. Select Your Judge: Configure a judging provider. Use ollama for a free local judge, or keys for openrouter/openai/gemini.
  4. Define Quality Thresholds: Set performance/cost bounds in rag_eval.yaml.
  5. Drop into CI: Copy the prebuilt .github/workflows/rag-eval.yml to auto-run on every PR.
  6. Freeze a Baseline: Run rag-eval baseline save <run_id> on your main branch to establish a comparison reference.

Run history is persisted to Postgres/SQLite and can be monitored visually using the Streamlit dashboard (docker compose up).


🐍 Python SDK

from rag_eval import Evaluator

# Run evaluations directly from your custom pipeline or test script
report = Evaluator.from_config("rag_eval.yaml").run(
    "tests.yaml", ingest_dir="examples/visa/docs"
)

print(report.summary.quality)     # e.g., {'groundedness': 0.98, ...}
assert report.passed              # raise exceptions in standard test runners

🗺️ Roadmap

  • Additional Evaluators: Context precision/recall, answer relevancy, and toxicity filters.
  • Native Integrations: Direct connectors for LangChain / LlamaIndex pipelines.
  • Agentic Evaluation: Support multi-turn agent conversations and tool usage tracking.
  • Notification Exporters: Built-in Slack, Discord, and email alerts on gate failures.
  • Visual Diffing: Comprehensive run-to-run comparisons on the dashboard.
  • PyPI & Docker Images: Hosted pre-builds for zero setup.

🤝 Contributing

Contributions are very welcome! Please check out CONTRIBUTING.md to get started.

[!TIP] Looking for entry points? Check out our Good First Issues list.


📄 License

This project is licensed under the terms of the Apache-2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_agent_eval_ci-0.1.0.tar.gz (46.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_agent_eval_ci-0.1.0-py3-none-any.whl (55.7 kB view details)

Uploaded Python 3

File details

Details for the file rag_agent_eval_ci-0.1.0.tar.gz.

File metadata

  • Download URL: rag_agent_eval_ci-0.1.0.tar.gz
  • Upload date:
  • Size: 46.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for rag_agent_eval_ci-0.1.0.tar.gz
Algorithm Hash digest
SHA256 363627b31b29447dc752ec817a7bfe28b40a0f3dd4eec69f4a2f4b31cec517df
MD5 2f8c844ce88b77f1309a1f1e71716348
BLAKE2b-256 929bbb976fbf69f2633dc3f13a048804ba47b3845b28e794f959d02fadc79c12

See more details on using hashes here.

File details

Details for the file rag_agent_eval_ci-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for rag_agent_eval_ci-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bac4d0863c2c386a7901403989664f14ba7ebf982d5d85bc4169e62e2dabb876
MD5 dd663faea9a322790396973bee74b283
BLAKE2b-256 f8d274bbbd9c8851d752a3a20da9956b8ea33c3c33a33a0c594e84dfe709ecea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page