Skip to main content

Pluggable RAG evaluation framework

Project description

rag-eval-kit

PyPI version Python versions CI License: MIT

Pluggable RAG evaluation framework. Run any RAG system against a labeled Q/A dataset and get accuracy, latency, and cost metrics across a configuration sweep.

Why rag-eval-kit?

Most teams end up writing one-off eval scripts per project: a loop that calls their RAG pipeline, compares outputs to a golden set, and prints a number. That works once — but the moment you want to sweep across top_k, temperatures, or chunking strategies, or compare LangChain vs LlamaIndex on the same dataset, those scripts turn into glue code. rag-eval-kit replaces that glue with a YAML config + a single RAGSystem protocol (one method: query), so the same eval loop drives any system, any metric, any config matrix — and spits out CSVs and charts teams can actually share.

Installation

pip install rag-eval-kit

# With optional dependencies:
pip install rag-eval-kit[anthropic]     # Anthropic Claude
pip install rag-eval-kit[openai]        # OpenAI
pip install rag-eval-kit[langchain]     # LangChain adapter
pip install rag-eval-kit[llamaindex]    # LlamaIndex adapter
pip install rag-eval-kit[r2r]           # SciPhi R2R server client
pip install rag-eval-kit[haystack]      # Haystack 2.x adapter
pip install rag-eval-kit[huggingface]   # HuggingFace datasets
pip install rag-eval-kit[cost]          # tiktoken for real pre-run cost estimates
pip install rag-eval-kit[all]           # Everything

For development:

git clone https://github.com/DennisKoshta/rag-eval-kit.git
cd rag-eval-kit
uv venv && uv pip install -e ".[dev,anthropic]"

Quick Start

1. Create a dataset (JSONL)

{"question": "What is the capital of France?", "expected_answer": "Paris"}
{"question": "Who wrote Romeo and Juliet?", "expected_answer": "William Shakespeare"}

JSONL and CSV files use fixed field names: question, expected_answer, and optionally expected_docs (list of strings) and tags (dict). For HuggingFace datasets, field names are configurable via question_field / answer_field / docs_field.

2. Write a config

dataset:
  source: jsonl
  path: ./data/questions.jsonl

system:
  adapter: raw
  adapter_config:
    llm_provider: anthropic
    llm_model: claude-sonnet-4-20250514

sweep:
  top_k: [3, 5, 10]
  temperature: [0.0, 0.3]

metrics:
  - exact_match
  - latency_p50
  - latency_p95
  - token_cost:
      pricing:
        claude-sonnet-4-20250514:
          input_per_1k: 0.003
          output_per_1k: 0.015

output:
  csv: ./results/run.csv
  charts: ./results/charts/
  html: ./results/report.html

3. Set API keys

rag-eval-kit reads API keys from environment variables or a .env file in the working directory:

export ANTHROPIC_API_KEY=sk-ant-...   # for Claude models
export OPENAI_API_KEY=sk-...          # for GPT / o-series models

Or copy .env.example to .env and fill in your keys. Shell-exported values take precedence over .env.

4. Run

rag-eval-kit run config.yaml

This expands the sweep matrix (3 top_k x 2 temperature = 6 configs), runs each against every question, and outputs:

  • CSV with per-question scores and aggregate summary
  • Charts (accuracy bars, latency box plots, cost vs accuracy scatter)
  • Summary table printed to stdout

CLI Reference

rag-eval-kit run CONFIG        Run an evaluation sweep
  --dry-run                  Print plan without executing
  --output-dir DIR           Override output directory
  --filter TEXT              Filter configs (e.g. "top_k=5")
  --no-confirm               Skip cost confirmation prompt
  --verbose                  Show per-question results
  --seed N                   Inject reproducibility seed into adapter_config
  --concurrency N            Parallel queries per config (default 1)
  --checkpoint PATH          JSONL checkpoint for resumable runs

rag-eval-kit validate CONFIG   Validate a config file without running

rag-eval-kit report CSV_PATH   Re-generate charts from existing results
  --output-dir DIR           Output directory for charts
  --html PATH                Generate a self-contained HTML report

rag-eval-kit compare CSV_A CSV_B  Compare two results_summary.csv files
  --output PATH              Write comparison CSV
  --threshold FLOAT          Min absolute delta to flag (default 0.05)
  --html PATH                Write an HTML comparison report

Parallel sweeps

LLM calls are I/O-bound, so --concurrency 8 usually gets a ~6–8× speedup on real datasets with no code changes. Sweep configs still run sequentially (one at a time) so progress output stays readable; parallelism fans out across dataset items within each config. All bundled adapters are thread-safe — if you write a custom adapter, see docs/writing_adapters.md.

Resumable runs

Pass --checkpoint path.jsonl (or set output.checkpoint in YAML) and every completed (config, question) pair appends to that file immediately. If the process dies or you Ctrl-C out, re-running with the same checkpoint skips the completed items — no wasted tokens. The checkpoint records each row's config_params; if you edit your sweep between runs, mismatched rows are flagged and re-run.

HTML reports

Set output.html: ./results/report.html in your config (or pass --html to rag-eval-kit report) and get a single self-contained HTML file with sortable tables, inline charts, and a text filter for per-question results. No external CSS/JS — the file is fully portable and can be shared as-is.

Comparing runs

After tweaking a retriever and re-running, use rag-eval-kit compare to diff two result CSVs:

rag-eval-kit compare results_v1/results_summary.csv results_v2/results_summary.csv --html diff.html

Configs are matched by parameter equality. Each metric gets an absolute delta, percentage change, and directional indicator (improved/regressed/unchanged). Latency and cost metrics are direction-aware — a decrease is an improvement.

Tag-based breakdown

If your dataset includes tags on EvalItem (e.g. {"topic": "physics"}), rag-eval-kit automatically groups per-question scores by tag and reports per-tag averages in results_tags.csv and the HTML report's "Tag Breakdown" section. No config changes needed — tags are detected from the data.

Python API

from rag_eval_kit import RAGSystem, RAGResult, EvalDataset
from rag_eval_kit.config import load_config
from rag_eval_kit.orchestrator import run_sweep
from rag_eval_kit.reporters import write_csv, write_charts, write_html

config = load_config("config.yaml")
result = run_sweep(config, no_confirm=True)
write_csv(result, "results/")
write_charts(result, "results/charts/")
write_html(result, "results/report.html")

Custom RAG Systems

Implement the RAGSystem protocol -- a single query method:

from rag_eval_kit import RAGSystem, RAGResult

class MyRAGSystem:
    def query(self, question: str) -> RAGResult:
        # Your retrieval + generation logic here
        docs = my_retriever.search(question, top_k=5)
        answer = my_llm.generate(question, context=docs)
        return RAGResult(
            answer=answer,
            retrieved_docs=docs,
            metadata={
                "latency_ms": elapsed_ms,
                "prompt_tokens": usage.input_tokens,
                "completion_tokens": usage.output_tokens,
                "model": "my-model",
                "top_k": 5,
            },
        )

No inheritance required. Any object with a conforming query method works.

Metrics

Metric Type Description
exact_match Per-question 1.0 if answer matches expected (case-insensitive)
contains Per-question 1.0 if expected answer appears as a substring of the answer
f1_token Per-question SQuAD-style token F1 between answer and expected
rouge_l Per-question ROUGE-L F-measure based on longest common subsequence
llm_judge Per-question LLM scores correctness 0.0-1.0
llm_faithfulness Per-question LLM scores how grounded the answer is in retrieved docs
precision_at_k Per-question Fraction of retrieved docs in expected set
recall_at_k Per-question Fraction of expected docs found in top-k retrieved
hit_rate_at_k Per-question 1.0 if any expected doc appears in top-k
mrr Per-question Reciprocal rank of the first retrieved doc that hits
ndcg_at_k Per-question Binary-relevance nDCG over the top-k retrieved
latency_p50 Aggregate Median query latency
latency_p95 Aggregate 95th percentile query latency
token_cost Aggregate Total estimated cost from token counts

Configuration Reference

Section Key Default Description
dataset source jsonl jsonl, csv, or huggingface
dataset path Path to dataset file (required for jsonl/csv)
dataset name HuggingFace dataset name (required for huggingface)
dataset split validation HuggingFace split to load
dataset config_name HuggingFace dataset config/subset name
dataset limit Max questions to evaluate
dataset question_field question HuggingFace only — field name for the question text
dataset answer_field answer HuggingFace only — field name for the expected answer
dataset docs_field HuggingFace only — field name for ground-truth docs (retrieval metrics)
dataset trust_remote_code false HuggingFace only — allow datasets to run arbitrary code
system adapter raw, langchain, llamaindex, r2r, or haystack
system adapter_config Adapter-specific parameters (e.g. llm_provider, llm_model)
sweep (any key) Lists of values to sweep (Cartesian product)
metrics [exact_match, latency_p50, latency_p95] List of metric names (see below)
output csv ./results/run_{timestamp}.csv CSV output path
output charts ./results/charts/ Charts output directory
output html Self-contained HTML report path
output checkpoint JSONL checkpoint path for resumable runs
concurrency 1 Parallel queries per config

Parametrized metrics

Most metrics are plain strings. Metrics that accept parameters use a single-key dict:

metrics:
  - exact_match                        # simple
  - precision_at_k:                    # with parameter
      k: 10
  - llm_judge:                         # LLM-based scorer
      provider: anthropic
      model: claude-sonnet-4-20250514
  - token_cost:
      pricing:
        claude-sonnet-4-20250514:
          input_per_1k: 0.003
          output_per_1k: 0.015

Retrieval metrics (precision_at_k, recall_at_k, hit_rate_at_k, mrr, ndcg_at_k) require dataset.docs_field to be set so ground-truth documents are available for comparison.

Adapters

Adapter Status Install
raw Implemented pip install rag-eval-kit[anthropic] or [openai]
langchain Implemented pip install rag-eval-kit[langchain]
llamaindex Implemented pip install rag-eval-kit[llamaindex]
r2r Implemented pip install rag-eval-kit[r2r]
haystack Implemented pip install rag-eval-kit[haystack]

Development

uv venv && uv pip install -e ".[dev]"
pytest                    # Run tests
ruff check .              # Lint
mypy rag_eval_kit/          # Type check

License

MIT © 2026 Dennis Koshta

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_eval_kit-0.4.0.tar.gz (75.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_eval_kit-0.4.0-py3-none-any.whl (55.6 kB view details)

Uploaded Python 3

File details

Details for the file rag_eval_kit-0.4.0.tar.gz.

File metadata

  • Download URL: rag_eval_kit-0.4.0.tar.gz
  • Upload date:
  • Size: 75.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_eval_kit-0.4.0.tar.gz
Algorithm Hash digest
SHA256 3db4bab2018969ca3a3d8f9c4d42574435fdff5383472d84b3892480d34c18bf
MD5 37de474f914214fea7dae26ccdaa90d5
BLAKE2b-256 17263235321672a5779bd08268e0886737ab422c71c9a0e03b4bffb140cdf6cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_eval_kit-0.4.0.tar.gz:

Publisher: publish.yml on DennisKoshta/rag-eval-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rag_eval_kit-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: rag_eval_kit-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 55.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_eval_kit-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 464424696cd567ee26dca153bcde97f99a01d2dfe5f6305a10b7df78ae0dba09
MD5 f21ece4a3612ed45bbe9c58ad31aa5c2
BLAKE2b-256 88a054f6e6ddea10cb5b5f73f46f5bf3f1ed3848f2b404a70dd18004f0379958

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_eval_kit-0.4.0-py3-none-any.whl:

Publisher: publish.yml on DennisKoshta/rag-eval-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page