Pluggable RAG evaluation framework
Project description
rag-eval-kit
Pluggable RAG evaluation framework. Run any RAG system against a labeled Q/A dataset and get accuracy, latency, and cost metrics across a configuration sweep.
Why rag-eval-kit?
Most teams end up writing one-off eval scripts per project: a loop that calls their RAG pipeline, compares outputs to a golden set, and prints a number. That works once — but the moment you want to sweep across top_k, temperatures, or chunking strategies, or compare LangChain vs LlamaIndex on the same dataset, those scripts turn into glue code. rag-eval-kit replaces that glue with a YAML config + a single RAGSystem protocol (one method: query), so the same eval loop drives any system, any metric, any config matrix — and spits out CSVs and charts teams can actually share.
Installation
pip install rag-eval-kit
# With optional dependencies:
pip install rag-eval-kit[anthropic] # Anthropic Claude
pip install rag-eval-kit[openai] # OpenAI
pip install rag-eval-kit[langchain] # LangChain adapter
pip install rag-eval-kit[llamaindex] # LlamaIndex adapter
pip install rag-eval-kit[r2r] # SciPhi R2R server client
pip install rag-eval-kit[haystack] # Haystack 2.x adapter
pip install rag-eval-kit[huggingface] # HuggingFace datasets
pip install rag-eval-kit[cost] # tiktoken for real pre-run cost estimates
pip install rag-eval-kit[all] # Everything
For development:
git clone https://github.com/DennisKoshta/rag-eval-kit.git
cd rag-eval-kit
uv venv && uv pip install -e ".[dev,anthropic]"
Quick Start
1. Create a dataset (JSONL)
{"question": "What is the capital of France?", "expected_answer": "Paris"}
{"question": "Who wrote Romeo and Juliet?", "expected_answer": "William Shakespeare"}
JSONL and CSV files use fixed field names: question, expected_answer, and optionally expected_docs (list of strings) and tags (dict). For HuggingFace datasets, field names are configurable via question_field / answer_field / docs_field.
2. Write a config
dataset:
source: jsonl
path: ./data/questions.jsonl
system:
adapter: raw
adapter_config:
llm_provider: anthropic
llm_model: claude-sonnet-4-20250514
sweep:
top_k: [3, 5, 10]
temperature: [0.0, 0.3]
metrics:
- exact_match
- latency_p50
- latency_p95
- token_cost:
pricing:
claude-sonnet-4-20250514:
input_per_1k: 0.003
output_per_1k: 0.015
output:
csv: ./results/run.csv
charts: ./results/charts/
html: ./results/report.html
3. Set API keys
rag-eval-kit reads API keys from environment variables or a .env file in the working directory:
export ANTHROPIC_API_KEY=sk-ant-... # for Claude models
export OPENAI_API_KEY=sk-... # for GPT / o-series models
Or copy .env.example to .env and fill in your keys. Shell-exported values take precedence over .env.
4. Run
rag-eval-kit run config.yaml
This expands the sweep matrix (3 top_k x 2 temperature = 6 configs), runs each against every question, and outputs:
- CSV with per-question scores and aggregate summary
- Charts (accuracy bars, latency box plots, cost vs accuracy scatter)
- Summary table printed to stdout
CLI Reference
rag-eval-kit run CONFIG Run an evaluation sweep
--dry-run Print plan without executing
--output-dir DIR Override output directory
--filter TEXT Filter configs (e.g. "top_k=5")
--no-confirm Skip cost confirmation prompt
--verbose Show per-question results
--seed N Inject reproducibility seed into adapter_config
--concurrency N Parallel queries per config (default 1)
--checkpoint PATH JSONL checkpoint for resumable runs
rag-eval-kit validate CONFIG Validate a config file without running
rag-eval-kit report CSV_PATH Re-generate charts from existing results
--output-dir DIR Output directory for charts
--html PATH Generate a self-contained HTML report
rag-eval-kit compare CSV_A CSV_B Compare two results_summary.csv files
--output PATH Write comparison CSV
--threshold FLOAT Min absolute delta to flag (default 0.05)
--html PATH Write an HTML comparison report
Parallel sweeps
LLM calls are I/O-bound, so --concurrency 8 usually gets a ~6–8× speedup on real datasets with no code changes. Sweep configs still run sequentially (one at a time) so progress output stays readable; parallelism fans out across dataset items within each config. All bundled adapters are thread-safe — if you write a custom adapter, see docs/writing_adapters.md.
Resumable runs
Pass --checkpoint path.jsonl (or set output.checkpoint in YAML) and every completed (config, question) pair appends to that file immediately. If the process dies or you Ctrl-C out, re-running with the same checkpoint skips the completed items — no wasted tokens. The checkpoint records each row's config_params; if you edit your sweep between runs, mismatched rows are flagged and re-run.
HTML reports
Set output.html: ./results/report.html in your config (or pass --html to rag-eval-kit report) and get a single self-contained HTML file with sortable tables, inline charts, and a text filter for per-question results. No external CSS/JS — the file is fully portable and can be shared as-is.
Comparing runs
After tweaking a retriever and re-running, use rag-eval-kit compare to diff two result CSVs:
rag-eval-kit compare results_v1/results_summary.csv results_v2/results_summary.csv --html diff.html
Configs are matched by parameter equality. Each metric gets an absolute delta, percentage change, and directional indicator (improved/regressed/unchanged). Latency and cost metrics are direction-aware — a decrease is an improvement.
Tag-based breakdown
If your dataset includes tags on EvalItem (e.g. {"topic": "physics"}), rag-eval-kit automatically groups per-question scores by tag and reports per-tag averages in results_tags.csv and the HTML report's "Tag Breakdown" section. No config changes needed — tags are detected from the data.
Python API
from rag_eval_kit import RAGSystem, RAGResult, EvalDataset
from rag_eval_kit.config import load_config
from rag_eval_kit.orchestrator import run_sweep
from rag_eval_kit.reporters import write_csv, write_charts, write_html
config = load_config("config.yaml")
result = run_sweep(config, no_confirm=True)
write_csv(result, "results/")
write_charts(result, "results/charts/")
write_html(result, "results/report.html")
Custom RAG Systems
Implement the RAGSystem protocol -- a single query method:
from rag_eval_kit import RAGSystem, RAGResult
class MyRAGSystem:
def query(self, question: str) -> RAGResult:
# Your retrieval + generation logic here
docs = my_retriever.search(question, top_k=5)
answer = my_llm.generate(question, context=docs)
return RAGResult(
answer=answer,
retrieved_docs=docs,
metadata={
"latency_ms": elapsed_ms,
"prompt_tokens": usage.input_tokens,
"completion_tokens": usage.output_tokens,
"model": "my-model",
"top_k": 5,
},
)
No inheritance required. Any object with a conforming query method works.
Metrics
| Metric | Type | Description |
|---|---|---|
exact_match |
Per-question | 1.0 if answer matches expected (case-insensitive) |
contains |
Per-question | 1.0 if expected answer appears as a substring of the answer |
f1_token |
Per-question | SQuAD-style token F1 between answer and expected |
rouge_l |
Per-question | ROUGE-L F-measure based on longest common subsequence |
llm_judge |
Per-question | LLM scores correctness 0.0-1.0 |
llm_faithfulness |
Per-question | LLM scores how grounded the answer is in retrieved docs |
precision_at_k |
Per-question | Fraction of retrieved docs in expected set |
recall_at_k |
Per-question | Fraction of expected docs found in top-k retrieved |
hit_rate_at_k |
Per-question | 1.0 if any expected doc appears in top-k |
mrr |
Per-question | Reciprocal rank of the first retrieved doc that hits |
ndcg_at_k |
Per-question | Binary-relevance nDCG over the top-k retrieved |
latency_p50 |
Aggregate | Median query latency |
latency_p95 |
Aggregate | 95th percentile query latency |
token_cost |
Aggregate | Total estimated cost from token counts |
Configuration Reference
| Section | Key | Default | Description |
|---|---|---|---|
dataset |
source |
jsonl |
jsonl, csv, or huggingface |
dataset |
path |
Path to dataset file (required for jsonl/csv) | |
dataset |
name |
HuggingFace dataset name (required for huggingface) | |
dataset |
split |
validation |
HuggingFace split to load |
dataset |
config_name |
HuggingFace dataset config/subset name | |
dataset |
limit |
Max questions to evaluate | |
dataset |
question_field |
question |
HuggingFace only — field name for the question text |
dataset |
answer_field |
answer |
HuggingFace only — field name for the expected answer |
dataset |
docs_field |
HuggingFace only — field name for ground-truth docs (retrieval metrics) | |
dataset |
trust_remote_code |
false |
HuggingFace only — allow datasets to run arbitrary code |
system |
adapter |
raw, langchain, llamaindex, r2r, or haystack |
|
system |
adapter_config |
Adapter-specific parameters (e.g. llm_provider, llm_model) |
|
sweep |
(any key) | Lists of values to sweep (Cartesian product) | |
metrics |
[exact_match, latency_p50, latency_p95] |
List of metric names (see below) | |
output |
csv |
./results/run_{timestamp}.csv |
CSV output path |
output |
charts |
./results/charts/ |
Charts output directory |
output |
html |
Self-contained HTML report path | |
output |
checkpoint |
JSONL checkpoint path for resumable runs | |
concurrency |
1 |
Parallel queries per config |
Parametrized metrics
Most metrics are plain strings. Metrics that accept parameters use a single-key dict:
metrics:
- exact_match # simple
- precision_at_k: # with parameter
k: 10
- llm_judge: # LLM-based scorer
provider: anthropic
model: claude-sonnet-4-20250514
- token_cost:
pricing:
claude-sonnet-4-20250514:
input_per_1k: 0.003
output_per_1k: 0.015
Retrieval metrics (precision_at_k, recall_at_k, hit_rate_at_k, mrr, ndcg_at_k) require dataset.docs_field to be set so ground-truth documents are available for comparison.
Adapters
| Adapter | Status | Install |
|---|---|---|
raw |
Implemented | pip install rag-eval-kit[anthropic] or [openai] |
langchain |
Implemented | pip install rag-eval-kit[langchain] |
llamaindex |
Implemented | pip install rag-eval-kit[llamaindex] |
r2r |
Implemented | pip install rag-eval-kit[r2r] |
haystack |
Implemented | pip install rag-eval-kit[haystack] |
Development
uv venv && uv pip install -e ".[dev]"
pytest # Run tests
ruff check . # Lint
mypy rag_eval_kit/ # Type check
License
MIT © 2026 Dennis Koshta
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_eval_kit-0.4.0.tar.gz.
File metadata
- Download URL: rag_eval_kit-0.4.0.tar.gz
- Upload date:
- Size: 75.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3db4bab2018969ca3a3d8f9c4d42574435fdff5383472d84b3892480d34c18bf
|
|
| MD5 |
37de474f914214fea7dae26ccdaa90d5
|
|
| BLAKE2b-256 |
17263235321672a5779bd08268e0886737ab422c71c9a0e03b4bffb140cdf6cb
|
Provenance
The following attestation bundles were made for rag_eval_kit-0.4.0.tar.gz:
Publisher:
publish.yml on DennisKoshta/rag-eval-kit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rag_eval_kit-0.4.0.tar.gz -
Subject digest:
3db4bab2018969ca3a3d8f9c4d42574435fdff5383472d84b3892480d34c18bf - Sigstore transparency entry: 1330388668
- Sigstore integration time:
-
Permalink:
DennisKoshta/rag-eval-kit@88bf8127cc784cb556c3691b48ebf49463414aaf -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/DennisKoshta
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@88bf8127cc784cb556c3691b48ebf49463414aaf -
Trigger Event:
push
-
Statement type:
File details
Details for the file rag_eval_kit-0.4.0-py3-none-any.whl.
File metadata
- Download URL: rag_eval_kit-0.4.0-py3-none-any.whl
- Upload date:
- Size: 55.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
464424696cd567ee26dca153bcde97f99a01d2dfe5f6305a10b7df78ae0dba09
|
|
| MD5 |
f21ece4a3612ed45bbe9c58ad31aa5c2
|
|
| BLAKE2b-256 |
88a054f6e6ddea10cb5b5f73f46f5bf3f1ed3848f2b404a70dd18004f0379958
|
Provenance
The following attestation bundles were made for rag_eval_kit-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on DennisKoshta/rag-eval-kit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rag_eval_kit-0.4.0-py3-none-any.whl -
Subject digest:
464424696cd567ee26dca153bcde97f99a01d2dfe5f6305a10b7df78ae0dba09 - Sigstore transparency entry: 1330388763
- Sigstore integration time:
-
Permalink:
DennisKoshta/rag-eval-kit@88bf8127cc784cb556c3691b48ebf49463414aaf -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/DennisKoshta
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@88bf8127cc784cb556c3691b48ebf49463414aaf -
Trigger Event:
push
-
Statement type: