Pluggable RAG evaluation framework

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

rag-eval-kit

Pluggable RAG evaluation framework. Run any RAG system against a labeled Q/A dataset and get accuracy, latency, and cost metrics across a configuration sweep.

Why rag-eval-kit?

Most teams end up writing one-off eval scripts per project: a loop that calls their RAG pipeline, compares outputs to a golden set, and prints a number. That works once — but the moment you want to sweep across top_k, temperatures, or chunking strategies, or compare LangChain vs LlamaIndex on the same dataset, those scripts turn into glue code. rag-eval-kit replaces that glue with a YAML config + a single RAGSystem protocol (one method: query), so the same eval loop drives any system, any metric, any config matrix — and spits out CSVs and charts teams can actually share.

Installation

pip install rag-eval-kit

# With optional dependencies:
pip install rag-eval-kit[anthropic]     # Anthropic Claude
pip install rag-eval-kit[openai]        # OpenAI
pip install rag-eval-kit[langchain]     # LangChain adapter
pip install rag-eval-kit[llamaindex]    # LlamaIndex adapter
pip install rag-eval-kit[r2r]           # SciPhi R2R server client
pip install rag-eval-kit[haystack]      # Haystack 2.x adapter
pip install rag-eval-kit[huggingface]   # HuggingFace datasets
pip install rag-eval-kit[cost]          # tiktoken for real pre-run cost estimates
pip install rag-eval-kit[all]           # Everything

For development:

git clone https://github.com/DennisKoshta/rag-eval-kit.git
cd rag-eval-kit
uv venv && uv pip install -e ".[dev,anthropic]"

Quick Start

1. Create a dataset (JSONL)

{"question": "What is the capital of France?", "expected_answer": "Paris"}
{"question": "Who wrote Romeo and Juliet?", "expected_answer": "William Shakespeare"}

JSONL and CSV files use fixed field names: question, expected_answer, and optionally expected_docs (list of strings) and tags (dict). For HuggingFace datasets, field names are configurable via question_field / answer_field / docs_field.

2. Write a config

dataset:
  source: jsonl
  path: ./data/questions.jsonl

system:
  adapter: raw
  adapter_config:
    llm_provider: anthropic
    llm_model: claude-sonnet-4-20250514

sweep:
  top_k: [3, 5, 10]
  temperature: [0.0, 0.3]

metrics:
  - exact_match
  - latency_p50
  - latency_p95
  - token_cost:
      pricing:
        claude-sonnet-4-20250514:
          input_per_1k: 0.003
          output_per_1k: 0.015

output:
  csv: ./results/run.csv
  charts: ./results/charts/
  html: ./results/report.html

3. Set API keys

rag-eval-kit reads API keys from environment variables or a .env file in the working directory:

export ANTHROPIC_API_KEY=sk-ant-...   # for Claude models
export OPENAI_API_KEY=sk-...          # for GPT / o-series models

Or copy .env.example to .env and fill in your keys. Shell-exported values take precedence over .env.

4. Run

rag-eval-kit run config.yaml

This expands the sweep matrix (3 top_k x 2 temperature = 6 configs), runs each against every question, and outputs:

CSV with per-question scores and aggregate summary
Charts (accuracy bars, latency box plots, cost vs accuracy scatter)
Summary table printed to stdout

CLI Reference

rag-eval-kit run CONFIG        Run an evaluation sweep
  --dry-run                  Print plan without executing
  --output-dir DIR           Override output directory
  --filter TEXT              Filter configs (e.g. "top_k=5")
  --no-confirm               Skip cost confirmation prompt
  --verbose                  Show per-question results
  --seed N                   Inject reproducibility seed into adapter_config
  --concurrency N            Parallel queries per config (default 1)
  --checkpoint PATH          JSONL checkpoint for resumable runs

rag-eval-kit validate CONFIG   Validate a config file without running

rag-eval-kit report CSV_PATH   Re-generate charts from existing results
  --output-dir DIR           Output directory for charts
  --html PATH                Generate a self-contained HTML report

rag-eval-kit compare CSV_A CSV_B  Compare two results_summary.csv files
  --output PATH              Write comparison CSV
  --threshold FLOAT          Min absolute delta to flag (default 0.05)
  --html PATH                Write an HTML comparison report

Parallel sweeps

LLM calls are I/O-bound, so --concurrency 8 usually gets a ~6–8× speedup on real datasets with no code changes. Sweep configs still run sequentially (one at a time) so progress output stays readable; parallelism fans out across dataset items within each config. All bundled adapters are thread-safe — if you write a custom adapter, see docs/writing_adapters.md.

Resumable runs

Pass --checkpoint path.jsonl (or set output.checkpoint in YAML) and every completed (config, question) pair appends to that file immediately. If the process dies or you Ctrl-C out, re-running with the same checkpoint skips the completed items — no wasted tokens. The checkpoint records each row's config_params; if you edit your sweep between runs, mismatched rows are flagged and re-run.

HTML reports

Set output.html: ./results/report.html in your config (or pass --html to rag-eval-kit report) and get a single self-contained HTML file with sortable tables, inline charts, and a text filter for per-question results. No external CSS/JS — the file is fully portable and can be shared as-is.

Comparing runs

After tweaking a retriever and re-running, use rag-eval-kit compare to diff two result CSVs:

rag-eval-kit compare results_v1/results_summary.csv results_v2/results_summary.csv --html diff.html

Configs are matched by parameter equality. Each metric gets an absolute delta, percentage change, and directional indicator (improved/regressed/unchanged). Latency and cost metrics are direction-aware — a decrease is an improvement.

Tag-based breakdown

If your dataset includes tags on EvalItem (e.g. {"topic": "physics"}), rag-eval-kit automatically groups per-question scores by tag and reports per-tag averages in results_tags.csv and the HTML report's "Tag Breakdown" section. No config changes needed — tags are detected from the data.

Python API

from rag_eval_kit import RAGSystem, RAGResult, EvalDataset
from rag_eval_kit.config import load_config
from rag_eval_kit.orchestrator import run_sweep
from rag_eval_kit.reporters import write_csv, write_charts, write_html

config = load_config("config.yaml")
result = run_sweep(config, no_confirm=True)
write_csv(result, "results/")
write_charts(result, "results/charts/")
write_html(result, "results/report.html")

Custom RAG Systems

Implement the RAGSystem protocol -- a single query method:

from rag_eval_kit import RAGSystem, RAGResult

class MyRAGSystem:
    def query(self, question: str) -> RAGResult:
        # Your retrieval + generation logic here
        docs = my_retriever.search(question, top_k=5)
        answer = my_llm.generate(question, context=docs)
        return RAGResult(
            answer=answer,
            retrieved_docs=docs,
            metadata={
                "latency_ms": elapsed_ms,
                "prompt_tokens": usage.input_tokens,
                "completion_tokens": usage.output_tokens,
                "model": "my-model",
                "top_k": 5,
            },
        )

No inheritance required. Any object with a conforming query method works.

Metrics

Metric	Type	Description
`exact_match`	Per-question	1.0 if answer matches expected (case-insensitive)
`contains`	Per-question	1.0 if expected answer appears as a substring of the answer
`f1_token`	Per-question	SQuAD-style token F1 between answer and expected
`rouge_l`	Per-question	ROUGE-L F-measure based on longest common subsequence
`llm_judge`	Per-question	LLM scores correctness 0.0-1.0
`llm_faithfulness`	Per-question	LLM scores how grounded the answer is in retrieved docs
`precision_at_k`	Per-question	Fraction of retrieved docs in expected set
`recall_at_k`	Per-question	Fraction of expected docs found in top-k retrieved
`hit_rate_at_k`	Per-question	1.0 if any expected doc appears in top-k
`mrr`	Per-question	Reciprocal rank of the first retrieved doc that hits
`ndcg_at_k`	Per-question	Binary-relevance nDCG over the top-k retrieved
`latency_p50`	Aggregate	Median query latency
`latency_p95`	Aggregate	95th percentile query latency
`token_cost`	Aggregate	Total estimated cost from token counts

Configuration Reference

Section	Key	Default	Description
`dataset`	`source`	`jsonl`	`jsonl`, `csv`, or `huggingface`
`dataset`	`path`		Path to dataset file (required for jsonl/csv)
`dataset`	`name`		HuggingFace dataset name (required for huggingface)
`dataset`	`split`	`validation`	HuggingFace split to load
`dataset`	`config_name`		HuggingFace dataset config/subset name
`dataset`	`limit`		Max questions to evaluate
`dataset`	`question_field`	`question`	HuggingFace only — field name for the question text
`dataset`	`answer_field`	`answer`	HuggingFace only — field name for the expected answer
`dataset`	`docs_field`		HuggingFace only — field name for ground-truth docs (retrieval metrics)
`dataset`	`trust_remote_code`	`false`	HuggingFace only — allow datasets to run arbitrary code
`system`	`adapter`		`raw`, `langchain`, `llamaindex`, `r2r`, or `haystack`
`system`	`adapter_config`		Adapter-specific parameters (e.g. `llm_provider`, `llm_model`)
`sweep`	(any key)		Lists of values to sweep (Cartesian product)
`metrics`		`[exact_match, latency_p50, latency_p95]`	List of metric names (see below)
`output`	`csv`	`./results/run_{timestamp}.csv`	CSV output path
`output`	`charts`	`./results/charts/`	Charts output directory
`output`	`html`		Self-contained HTML report path
`output`	`checkpoint`		JSONL checkpoint path for resumable runs
`concurrency`		`1`	Parallel queries per config

Parametrized metrics

Most metrics are plain strings. Metrics that accept parameters use a single-key dict:

metrics:
  - exact_match                        # simple
  - precision_at_k:                    # with parameter
      k: 10
  - llm_judge:                         # LLM-based scorer
      provider: anthropic
      model: claude-sonnet-4-20250514
  - token_cost:
      pricing:
        claude-sonnet-4-20250514:
          input_per_1k: 0.003
          output_per_1k: 0.015

Retrieval metrics (precision_at_k, recall_at_k, hit_rate_at_k, mrr, ndcg_at_k) require dataset.docs_field to be set so ground-truth documents are available for comparison.

Adapters

Adapter	Status	Install
`raw`	Implemented	`pip install rag-eval-kit[anthropic]` or `[openai]`
`langchain`	Implemented	`pip install rag-eval-kit[langchain]`
`llamaindex`	Implemented	`pip install rag-eval-kit[llamaindex]`
`r2r`	Implemented	`pip install rag-eval-kit[r2r]`
`haystack`	Implemented	`pip install rag-eval-kit[haystack]`

Development

uv venv && uv pip install -e ".[dev]"
pytest                    # Run tests
ruff check .              # Lint
mypy rag_eval_kit/          # Type check

License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

DennisKoshta

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_eval_kit-0.4.0.tar.gz (75.4 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_eval_kit-0.4.0-py3-none-any.whl (55.6 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file rag_eval_kit-0.4.0.tar.gz.

File metadata

Download URL: rag_eval_kit-0.4.0.tar.gz
Upload date: Apr 17, 2026
Size: 75.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_eval_kit-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`3db4bab2018969ca3a3d8f9c4d42574435fdff5383472d84b3892480d34c18bf`
MD5	`37de474f914214fea7dae26ccdaa90d5`
BLAKE2b-256	`17263235321672a5779bd08268e0886737ab422c71c9a0e03b4bffb140cdf6cb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_eval_kit-0.4.0.tar.gz:

Publisher: publish.yml on DennisKoshta/rag-eval-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rag_eval_kit-0.4.0.tar.gz
- Subject digest: 3db4bab2018969ca3a3d8f9c4d42574435fdff5383472d84b3892480d34c18bf
- Sigstore transparency entry: 1330388668
- Sigstore integration time: Apr 17, 2026
Source repository:
- Permalink: DennisKoshta/rag-eval-kit@88bf8127cc784cb556c3691b48ebf49463414aaf
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/DennisKoshta
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@88bf8127cc784cb556c3691b48ebf49463414aaf
- Trigger Event: push

File details

Details for the file rag_eval_kit-0.4.0-py3-none-any.whl.

File metadata

Download URL: rag_eval_kit-0.4.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 55.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_eval_kit-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`464424696cd567ee26dca153bcde97f99a01d2dfe5f6305a10b7df78ae0dba09`
MD5	`f21ece4a3612ed45bbe9c58ad31aa5c2`
BLAKE2b-256	`88a054f6e6ddea10cb5b5f73f46f5bf3f1ed3848f2b404a70dd18004f0379958`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_eval_kit-0.4.0-py3-none-any.whl:

Publisher: publish.yml on DennisKoshta/rag-eval-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rag_eval_kit-0.4.0-py3-none-any.whl
- Subject digest: 464424696cd567ee26dca153bcde97f99a01d2dfe5f6305a10b7df78ae0dba09
- Sigstore transparency entry: 1330388763
- Sigstore integration time: Apr 17, 2026
Source repository:
- Permalink: DennisKoshta/rag-eval-kit@88bf8127cc784cb556c3691b48ebf49463414aaf
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/DennisKoshta
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@88bf8127cc784cb556c3691b48ebf49463414aaf
- Trigger Event: push

rag-eval-kit 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

rag-eval-kit

Why rag-eval-kit?

Installation

Quick Start

1. Create a dataset (JSONL)

2. Write a config

3. Set API keys

4. Run

CLI Reference

Parallel sweeps

Resumable runs

HTML reports

Comparing runs

Tag-based breakdown

Python API

Custom RAG Systems

Metrics

Configuration Reference

Parametrized metrics

Adapters

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance