CI evaluation framework for RAG and AI agents: groundedness, retrieval quality, hallucination, citations, latency, cost, and regression gating before production.

These details have not been verified by PyPI

Project links

Project description

🛠️ rag-agent-eval-ci

Stop shipping RAG systems you can't test.

A CI evaluation framework for RAG and AI agents — gate every deploy on groundedness, retrieval quality, hallucination, citations, latency, cost, and regression.

⚠️ The Problem

Teams are shipping RAG assistants and AI agents into production, but most have no automated way to answer the one question that matters before a deploy:

[!IMPORTANT] "Is this version safe to ship, or did we just make it worse?"

A prompt tweak silently drops retrieval recall. A model swap doubles cost. A new chunking strategy starts hallucinating. Today these regressions are caught by users, not by CI.

rag-agent-eval-ci turns RAG quality into a pull-request gate. Developers write test questions in YAML; the tool measures groundedness, retrieval accuracy, hallucination, citations, latency, and cost, compares against a baseline, and fails the build when quality drops.

📺 Demo

Demo Screenshot

⚡ Quick Start (under 5 minutes, no API keys)

[!TIP] The default config uses a built-in mock provider + in-memory retriever, so the first run works with zero keys and zero network dependency.

Installation

pip install rag-agent-eval-ci          # or: git clone https://github.com/martian7777/rag-agent-eval-ci && pip install -e .

Run the bundled visa-enrollment example end-to-end:

rag-eval gate \
  --tests examples/visa/tests.yaml \
  -c examples/visa/rag_eval.yaml \
  -d examples/visa/docs

The exit code is 0 if the gate passes, and 1 if it fails — making it extremely straightforward to plug into your CI pipelines.

[!NOTE] Prefer Docker? docker compose up brings the API, dashboard, Postgres, and a local Ollama up together.

📄 Example Input

A developer just writes questions and expectations in tests.yaml:

suite: visa-docs
questions:
  - question: "What documents are required for visa enrollment?"
    expected_sources: ["visa_checklist.pdf"]   # retrieval must surface this
    must_include: ["passport", "admission letter", "insurance"]  # answer completeness
    must_not_include: ["bank statement is required"]             # hallucination trap

📊 Example Output

Every run produces clean console output and CI-friendly artifacts in .rag_eval/reports/:

File	Use case / Description
`report.json`	Machine-readable full evaluation results.
`junit.xml`	Standard format that renders as native test results in GitHub/GitLab CI.
`summary.md`	Clean markdown summary intended to be auto-posted to PR comments or job summaries.
`report.html`	Self-contained, premium visual report (perfect as a CI build artifact).

🎯 What it Measures

Metric	Question it answers	Core Evaluation Method
Retrieval accuracy	Did we retrieve the documents we expected?	Substring checking on returned sources vs expected.
Groundedness	Is the answer actually supported by the retrieved context?	LLM-judge verification or fallback token overlap.
Hallucination	Did the model invent unsupported or forbidden claims?	LLM-judge analysis plus `must_not_include` penalty checks.
Citation	Are the cited sources real and the right ones?	Compares cited sources vs retrieved context.
Answer completeness	Does the answer contain the key facts?	Validates presence of `must_include` phrases.
Latency	How fast is the system?	Tracks wall-clock time (gates on suite-level p95).
Cost	What is the financial footprint?	Live calculations (gates on suite-level total USD).
Regression	Did quality drop since the last release?	Auto-compares metrics against a tagged baseline run.

🏗️ Architecture

flowchart TD
    classDef main fill:#3b82f6,stroke:#1d4ed8,stroke-width:2px,color:#fff,font-weight:bold;
    classDef input fill:#f8fafc,stroke:#64748b,stroke-width:1.5px;
    classDef step fill:#fff,stroke:#cbd5e1,stroke-width:1px;
    classDef eval fill:#fef2f2,stroke:#f87171,stroke-width:1.5px;
    classDef storage fill:#ecfdf5,stroke:#34d399,stroke-width:1.5px;
    classDef report fill:#fff7ed,stroke:#fb923c,stroke-width:1.5px;
    classDef gate fill:#fef08a,stroke:#eab308,stroke-width:1.5px;

    tests["tests.yaml<br><i>(Questions & Expectations)</i>"]:::input --> runner["Runner Engine"]:::main

    runner --> target["Target System"]:::step
    runner --> evals["Evaluator Suite"]:::step
    runner --> reporting["Reporting System"]:::step

    target --> target_desc["• Local RAG Pipeline<br>• HTTP Endpoint"]:::step
    target_desc --> providers["Providers"]:::step
    providers --> providers_list["• Mock (Offline)<br>• Ollama / OpenAI<br>• Gemini / OpenRouter"]:::step
    providers_list --> vector["Vector Stores"]:::step
    vector --> vector_list["• Memory / Chroma / Qdrant"]:::step

    evals --> eval_list["• Retrieval Accuracy<br>• Groundedness <i>(LLM)</i><br>• Hallucination Penalty<br>• Citation & Source Precision<br>• Answer Completeness<br>• Latency & Cost"]:::eval
    eval_list --> storage["Storage DB<br><i>(SQLite / Postgres)</i>"]:::storage
    storage --> dashboard["Streamlit Dashboard<br>& FastAPI Backend"]:::storage

    reporting --> report_formats["• Rich Console Output<br>• report.json<br>• junit.xml<br>• summary.md<br>• report.html"]:::report
    report_formats --> gate_check["Gate Check"]:::gate
    gate_check --> threshold_desc["Thresholds & Regression Checks"]:::gate
    threshold_desc --> exit_code["CI Exit Code<br><i>(0 = Pass, 1 = Fail)</i>"]:::gate

    storage -.-> gate_check

Providers: mock, ollama, openai, gemini, openrouter (any model via OpenRouter with live cost tracking). See docs/providers.md.
Vector stores: memory (zero dependencies), chroma, qdrant.
Targets: Evaluate the built-in pipeline or point it at your own RAG HTTP endpoint — making it a plug-and-play evaluation utility for existing architectures.

💼 Use Cases

Pull-Request Gate — Automatically block merges that regress retrieval/groundedness metrics or exceed cost limits.
Model & Prompt Optimization — Run side-by-side comparison matrices (e.g. gpt-4o-mini vs local LLM) on quality and cost.
Nightly Regression Gates — Schedule automated cron runs against a tagged production baseline to capture silent drift.
Provider Migration Assurance — Benchmark a new model/provider to prove compatibility before going live.

🚀 How to Use It in Your Company

Seed a Test Suite: Write 10–20 high-value user questions in a tests.yaml.

Configure Your Target: Point the runner to your pipeline. If using your own server, set target.type: http:

target:
  type: http
  url: https://your-rag.internal/answer
  question_field: question
  answer_field: answer
  sources_field: sources

Select Your Judge: Configure a judging provider. Use ollama for a free local judge, or keys for openrouter/openai/gemini.
Define Quality Thresholds: Set performance/cost bounds in rag_eval.yaml.
Drop into CI: Copy the prebuilt .github/workflows/rag-eval.yml to auto-run on every PR.
Freeze a Baseline: Run rag-eval baseline save <run_id> on your main branch to establish a comparison reference.

Run history is persisted to Postgres/SQLite and can be monitored visually using the Streamlit dashboard (docker compose up).

🐍 Python SDK

from rag_eval import Evaluator

# Run evaluations directly from your custom pipeline or test script
report = Evaluator.from_config("rag_eval.yaml").run(
    "tests.yaml", ingest_dir="examples/visa/docs"
)

print(report.summary.quality)     # e.g., {'groundedness': 0.98, ...}
assert report.passed              # raise exceptions in standard test runners

🗺️ Roadmap

Additional Evaluators: Context precision/recall, answer relevancy, and toxicity filters.
Native Integrations: Direct connectors for LangChain / LlamaIndex pipelines.
Agentic Evaluation: Support multi-turn agent conversations and tool usage tracking.
Notification Exporters: Built-in Slack, Discord, and email alerts on gate failures.
Visual Diffing: Comprehensive run-to-run comparisons on the dashboard.
PyPI & Docker Images: Hosted pre-builds for zero setup.

🤝 Contributing

Contributions are very welcome! Please check out CONTRIBUTING.md to get started.

[!TIP] Looking for entry points? Check out our Good First Issues list.

📄 License

This project is licensed under the terms of the Apache-2.0 License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_agent_eval_ci-0.1.0.tar.gz (46.3 kB view details)

Uploaded Jun 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_agent_eval_ci-0.1.0-py3-none-any.whl (55.7 kB view details)

Uploaded Jun 13, 2026 Python 3

File details

Details for the file rag_agent_eval_ci-0.1.0.tar.gz.

File metadata

Download URL: rag_agent_eval_ci-0.1.0.tar.gz
Upload date: Jun 13, 2026
Size: 46.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for rag_agent_eval_ci-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`363627b31b29447dc752ec817a7bfe28b40a0f3dd4eec69f4a2f4b31cec517df`
MD5	`2f8c844ce88b77f1309a1f1e71716348`
BLAKE2b-256	`929bbb976fbf69f2633dc3f13a048804ba47b3845b28e794f959d02fadc79c12`

See more details on using hashes here.

File details

Details for the file rag_agent_eval_ci-0.1.0-py3-none-any.whl.

File metadata

Download URL: rag_agent_eval_ci-0.1.0-py3-none-any.whl
Upload date: Jun 13, 2026
Size: 55.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for rag_agent_eval_ci-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bac4d0863c2c386a7901403989664f14ba7ebf982d5d85bc4169e62e2dabb876`
MD5	`dd663faea9a322790396973bee74b283`
BLAKE2b-256	`f8d274bbbd9c8851d752a3a20da9956b8ea33c3c33a33a0c594e84dfe709ecea`

See more details on using hashes here.

rag-agent-eval-ci 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🛠️ rag-agent-eval-ci

Stop shipping RAG systems you can't test.

⚠️ The Problem

📺 Demo

⚡ Quick Start (under 5 minutes, no API keys)

Installation

Run the bundled visa-enrollment example end-to-end:

📄 Example Input

📊 Example Output

🎯 What it Measures

🏗️ Architecture

💼 Use Cases

🚀 How to Use It in Your Company

🐍 Python SDK

🗺️ Roadmap

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes