Framework-agnostic evaluation harness for RAG and agentic AI systems

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

aiexponent

These details have not been verified by PyPI

Project links

Homepage

Project description

RAG Benchmarking

Python License Status EU AI Act

A framework-agnostic evaluation harness for RAG and agentic AI systems.

Bring your own RAG pipeline — LangChain, LlamaIndex, or custom — and benchmark it against standard classic and agentic-era metrics. Built for teams who need to prove their AI systems work before they ship, not hope they do.

Part of the AiExponent open source portfolio. Maps to EU AI Act Article 15 (accuracy requirements).

Quick Start

Install

# Install from PyPI (recommended)
pip install rag-benchmarking

# Or install from source
git clone https://github.com/aiexponenthq/rag-benchmarking.git
cd rag-benchmarking
pip install -e ".[test]"

Evaluate your existing RAG system in 5 minutes

from app.sdk.client import RagEval

client = RagEval(api_url="http://localhost:5001", api_key="your-key")

# LangChain integration
from langchain.chains import RetrievalQA
chain = RetrievalQA.from_chain_type(...)
result = chain.invoke({"query": "What is RAG?"})
sample = RagEval.from_langchain(result)

# LlamaIndex integration
engine = index.as_query_engine()
response = engine.query("What is RAG?")
sample = RagEval.from_llamaindex(response, "What is RAG?")

# Or any dict with question / contexts / answer
sample = {
    "question": "What is RAG?",
    "contexts": ["RAG stands for Retrieval-Augmented Generation."],
    "answer": "RAG combines retrieval with LLM generation.",
}

report = client.evaluate([sample], metrics=["faithfulness", "answer_relevancy"])
print(report["scores"])
# {"faithfulness": 0.92, "answer_relevancy": 0.88}

Start the evaluation server

# With Docker Compose
docker compose up

# Or directly
uvicorn app.main:app --port 5001

Interactive API docs

Once the server is running, the full OpenAPI reference is available at:

http://localhost:5001/docs

LLM Backend for Evaluation

Several metrics (faithfulness, context_precision, context_recall, agent_faithfulness, tool_call_accuracy, retrieval_necessity) use an LLM as a judge. The harness supports Gemini (recommended) and OpenAI.

# Set in .env:
LLM_PROVIDER=gemini
GEMINI_API_KEY=your-gemini-key

# Or OpenAI:
LLM_PROVIDER=openai
OPENAI_API_KEY=your-openai-key

Determinism: Judge calls run at temperature=0.0 to minimise variance across evaluation runs. For CI/CD integration, run evaluations at least twice and flag changes beyond a ±0.05 threshold rather than asserting exact scores.

Cost guidance: A full evaluation pass (all classic metrics) on 50 samples costs approximately $0.05–$0.15 with Gemini Flash or GPT-4o-mini. Source attribution accuracy is deterministic and costs nothing.

Metrics

Classic RAG Metrics

Metric	What it measures	Requires
`faithfulness`	Are all claims in the answer supported by context?	question, contexts, answer
`answer_relevancy`	Does the answer address the question?	question, answer
`context_precision`	Are retrieved chunks relevant to the query?	+ ground_truth
`context_recall`	Does context contain enough to answer correctly?	+ ground_truth
`precision_at_k`	Fraction of top-K retrieved docs that are relevant	+ retrieved_doc_ids, relevant_doc_ids
`recall_at_k`	Fraction of relevant docs found in top-K	+ retrieved_doc_ids, relevant_doc_ids
`mrr`	Reciprocal rank of first relevant doc	+ retrieved_doc_ids, relevant_doc_ids
`ndcg_at_k`	Rank-weighted retrieval quality	+ retrieved_doc_ids, relevant_doc_ids

Agentic-Era Metrics

For multi-step agents, tool-using systems, and autonomous RAG pipelines:

Metric	What it measures	LLM needed?
`source_attribution_accuracy`	Did the agent cite sources it actually retrieved?	No (deterministic)
`agent_faithfulness`	Is every reasoning step faithful to retrieved sources?	Yes
`tool_call_accuracy`	Did the agent choose the right tool at the right time?	Yes
`retrieval_necessity`	Was retrieval actually needed for this query?	Yes

Metric Groups

Group	Metrics included
`classic`	faithfulness, answer_relevancy
`retrieval`	precision_at_k, recall_at_k, mrr, ndcg_at_k
`agentic_v1`	source_attribution_accuracy, retrieval_necessity, agent_faithfulness, tool_call_accuracy
`agentic_v2`	multihop_faithfulness, agent_trajectory_efficiency, reasoning_hallucination, context_coherence_across_turns
`full`	all classic + retrieval + agentic_v1 metrics

# Use a pre-defined group instead of listing metrics individually
report = client.evaluate(samples, metric_group="classic")

API Reference

REST API

# Classic RAG evaluation
curl -X POST http://localhost:5001/v1/evaluate \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "samples": [{
      "question": "What is RAG?",
      "contexts": ["RAG is Retrieval-Augmented Generation."],
      "answer": "RAG combines retrieval with LLM generation."
    }],
    "metrics": ["faithfulness", "answer_relevancy"]
  }'

# Agentic trace evaluation
curl -X POST http://localhost:5001/v1/evaluate/agent \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "trace": {
      "question": "What is the GPAI deadline?",
      "final_answer": "GPAI obligations apply from August 2025.",
      "tool_calls": [{
        "tool_name": "retrieve",
        "tool_input": {"query": "GPAI deadline"},
        "tool_output": "Article 53 obligations apply from August 2025.",
        "step_index": 0
      }]
    },
    "metrics": ["source_attribution_accuracy", "tool_call_accuracy"]
  }'

# List and compare runs
curl http://localhost:5001/v1/runs -H "X-API-Key: your-key"
curl -X POST http://localhost:5001/v1/runs/compare \
  -H "X-API-Key: your-key" \
  -d '["run-id-a", "run-id-b"]'

Python SDK

from app.sdk.client import RagEval

client = RagEval(api_url="http://localhost:5001", api_key="your-key")

# Classic evaluation
report = client.evaluate(samples, metrics=["faithfulness"])
report = client.evaluate(samples, metric_group="agentic_v1")

# Agentic trace
report = client.evaluate_agent(trace_dict, metrics=["agent_faithfulness"])

# Run history and comparison
runs = client.list_runs()
comparison = client.compare_runs(["run-a", "run-b"])

Architecture

flowchart TD
    A["Your RAG System\n(LangChain / LlamaIndex / Custom)"]
    B["SDK Adapters\nRagEval.from_langchain()\nRagEval.from_llamaindex()"]
    C["EvalSample / AgentTrace\nharness/schemas.py"]
    D["EvaluationRunner\nharness/runner.py"]
    E["RAGAS Metrics\nfaithfulness · answer_relevancy\ncontext_precision · context_recall"]
    F["Retrieval Metrics\nPrecision@K · Recall@K · MRR · NDCG"]
    G["Agentic Metrics\nagent_faithfulness · tool_call_accuracy\nretrieval_necessity · source_attribution"]
    H["BenchmarkReport"]
    I["SQLite ResultStore\nharness/result_store.py"]
    J["REST API\n/v1/evaluate · /v1/evaluate/agent\n/v1/runs · /v1/runs/compare"]

    A --> B --> C --> D
    D --> E
    D --> F
    D --> G
    E --> H
    F --> H
    G --> H
    H --> I --> J

Plug-in contract

Any RAG system implements one method to integrate:

class MyRAG:
    def run(self, question: str, contexts_override=None) -> dict:
        result = self.chain.invoke({"query": question})
        return {
            "answer": result["result"],
            "contexts": [d.page_content for d in result["source_documents"]],
            "retrieved_doc_ids": [d.metadata.get("id") for d in result["source_documents"]],
        }

Configuration

Copy .env.example to .env and set:

# LLM provider for faithfulness judge
LLM_PROVIDER=gemini          # or openai
GEMINI_API_KEY=...
OPENAI_API_KEY=...

# Vector store (built-in RAG pipeline only)
QDRANT_URL=https://your-cluster.qdrant.io
QDRANT_API_KEY=...

# API authentication
API_KEY=your-secret-key
ENFORCE_API_KEY=true

Project Structure

src/
  harness/                   # Framework-agnostic evaluation harness
    schemas.py               # EvalSample, AgentTrace, BenchmarkReport, RunConfig
    protocol.py              # RAGEvaluable Protocol — the plug-in contract
    runner.py                # EvaluationRunner — orchestrates metrics
    result_store.py          # SQLite persistence for BenchmarkReport
  app/
    api/                     # FastAPI endpoints
    eval/
      ragas_runner.py        # RAGAS classic metrics
      retrieval_metrics.py   # Precision@K, Recall@K, MRR, NDCG
      faithfulness.py        # Claim-decomposition faithfulness (LLM-as-judge)
      agentic_metrics.py     # source_attribution_accuracy (deterministic)
      agentic_llm_metrics.py # LLM-as-judge agentic metrics
    sdk/                     # Python SDK (RagEval client)
    engine/                  # Built-in RAG pipeline (optional)

data/
  golden/qa.jsonl            # 50-sample golden dataset (10 domains)

tests/
  unit/                      # Unit tests (no LLM, no network)
  integration/               # SDK integration tests
  e2e/                       # Full API endpoint tests

EU AI Act Context

Maps to Article 15 — Accuracy, Robustness and Cybersecurity for High-Risk AI Systems. Systematic RAG evaluation implements the technical testing requirements for demonstrating accuracy under Article 15.

Known Limitations

Benchmark datasets are English-only; no multilingual evaluation support.
Custom dataset integration requires manual formatting to the JSONL schema.
Accuracy metrics only — latency and throughput are not measured.
LLM-as-judge metrics depend on the quality of the configured judge model.
Rate limiting is in-memory and resets on server restart.

Version Compatibility

Dependency	Tested version
Python	3.11, 3.12
RAGAS	0.4.x
LangChain	≥ 0.1
LlamaIndex	≥ 0.10
FastAPI	≥ 0.110

Contributing

See CONTRIBUTING.md. Apache 2.0 licensed.

Built by AiExponent — Building AI that deserves to be trusted.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

aiexponent

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.0

May 10, 2026

This version

1.0.0rc1 pre-release

Apr 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_benchmarking-1.0.0rc1.tar.gz (47.5 kB view details)

Uploaded Apr 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_benchmarking-1.0.0rc1-py3-none-any.whl (47.2 kB view details)

Uploaded Apr 12, 2026 Python 3

File details

Details for the file rag_benchmarking-1.0.0rc1.tar.gz.

File metadata

Download URL: rag_benchmarking-1.0.0rc1.tar.gz
Upload date: Apr 12, 2026
Size: 47.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_benchmarking-1.0.0rc1.tar.gz
Algorithm	Hash digest
SHA256	`f131b986025fb00beb2a8464c6beaeb413a5269b55ab5069282721e3b5353fc1`
MD5	`6dbbc10ad5686ba0aa490eaf72bdf74e`
BLAKE2b-256	`e516e92793000e96a4999ed6be62b5fa35580f0c11d9ff84a2cdf54b2713ef72`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_benchmarking-1.0.0rc1.tar.gz:

Publisher: publish-pypi.yml on aiexponenthq/rag-benchmarking

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rag_benchmarking-1.0.0rc1.tar.gz
- Subject digest: f131b986025fb00beb2a8464c6beaeb413a5269b55ab5069282721e3b5353fc1
- Sigstore transparency entry: 1280642013
- Sigstore integration time: Apr 12, 2026
Source repository:
- Permalink: aiexponenthq/rag-benchmarking@71de180ee722eb2617b8d12c08baa78fb178003a
- Branch / Tag: refs/tags/v1.0.0-rc1
- Owner: https://github.com/aiexponenthq
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@71de180ee722eb2617b8d12c08baa78fb178003a
- Trigger Event: push

File details

Details for the file rag_benchmarking-1.0.0rc1-py3-none-any.whl.

File metadata

Download URL: rag_benchmarking-1.0.0rc1-py3-none-any.whl
Upload date: Apr 12, 2026
Size: 47.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_benchmarking-1.0.0rc1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`41812b2ba6354d033c8a7e5d89f546ec052aedbe42f5e2194a130392b10799fe`
MD5	`d41f63b371af59a805fb84933bd8b493`
BLAKE2b-256	`613a7fc0d5ac69d499b879cdfec7734382c8633f5416360ebf7e02d9b194325b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_benchmarking-1.0.0rc1-py3-none-any.whl:

Publisher: publish-pypi.yml on aiexponenthq/rag-benchmarking

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rag_benchmarking-1.0.0rc1-py3-none-any.whl
- Subject digest: 41812b2ba6354d033c8a7e5d89f546ec052aedbe42f5e2194a130392b10799fe
- Sigstore transparency entry: 1280642017
- Sigstore integration time: Apr 12, 2026
Source repository:
- Permalink: aiexponenthq/rag-benchmarking@71de180ee722eb2617b8d12c08baa78fb178003a
- Branch / Tag: refs/tags/v1.0.0-rc1
- Owner: https://github.com/aiexponenthq
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@71de180ee722eb2617b8d12c08baa78fb178003a
- Trigger Event: push

rag-benchmarking 1.0.0rc1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAG Benchmarking

Quick Start

Install

Evaluate your existing RAG system in 5 minutes

Start the evaluation server

Interactive API docs

LLM Backend for Evaluation

Metrics

Classic RAG Metrics

Agentic-Era Metrics

Metric Groups

API Reference

REST API

Python SDK

Architecture

Plug-in contract

Configuration

Project Structure

EU AI Act Context

Known Limitations

Version Compatibility

Further Reading

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance