Point your agent at your docs and your RAG app; get a golden test set + an LLM-as-judge & retrieval scorecard, in one command.
Project description
proofrag
Point your agent at your docs and your RAG app. Get a golden test set, an LLM-as-judge + retrieval scorecard, and a CI gate — in one command.
Evaluation is the #1 unmet pain in production RAG/LLM work, and the hardest part
is building a good test set in the first place. proofrag generates one from
your own corpus, judges your system on it, and emits a shareable HTML scorecard.
It's an Agent Skill (works in Claude Code, Codex, Cursor)
and a plain Python CLI — wrapping the eval loop, not reinventing the metrics.
…and the scorecard it produces:
See a scorecard in 5 seconds — no API key needed:
pipx install "proofrag[anthropic]" # or: pip install / uv tool install / uvx
proofrag demo --out scorecard.html && open scorecard.html
Use
[openai]instead of[anthropic]for an OpenAI-compatible or local (Ollama) backend. No install? Run it ad-hoc:uvx "proofrag[anthropic]" demo.
Install as an Agent Skill
proofrag is a skill (the agentskills.io open standard) backed
by a real CLI — so any agent can run "evaluate my RAG" and get a reproducible scorecard.
Claude Code (plugin):
/plugin marketplace add unshDee/proofrag
/plugin install proofrag@proofrag
Then ask "evaluate my RAG" (auto-triggered) or type /proofrag.
Claude Code (manual) — cp -r skills/proofrag ~/.claude/skills/
Codex / other agents — cp -r skills/proofrag .agents/skills/
The skill drives the proofrag CLI; install it with uv tool install "proofrag[anthropic]"
(or pipx install, or run ad-hoc via uvx). See AGENTS.md for details.
Why this exists
"Running evals aren't the problem — the problem is acquiring or building a high-quality, non-contaminated dataset."
Most RAG systems reach production with no evals because writing a balanced golden set by hand is tedious. So teams ship prompt and model changes blind. This closes that loop: change something → re-run → see if quality moved → gate the merge.
The loop
# 1. Generate a golden set from YOUR docs (questions + gold answers + gold contexts)
proofrag generate --corpus ./docs --out goldenset.jsonl --n 20
# 2. Run your RAG over each question -> predictions.jsonl (one line per question)
# {"id": "q000", "answer": "...", "retrieved_contexts": ["...", "..."]}
# See examples/docs-rag/naive_rag.py for a runnable driver.
# 3. Judge: groundedness, correctness, completeness, citation quality + retrieval metrics
proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl --out results.json
# 4. Shareable HTML scorecard
proofrag report --results results.json --out scorecard.html
Run the whole thing end-to-end against the bundled example:
uv sync --extra anthropic && export ANTHROPIC_API_KEY=...
uv run proofrag generate --corpus examples/docs-rag/corpus --out goldenset.jsonl --n 8
uv run python examples/docs-rag/naive_rag.py --goldenset goldenset.jsonl --corpus examples/docs-rag/corpus --out predictions.jsonl
uv run proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl --out results.json
uv run proofrag report --results results.json --out scorecard.html
CI gate
Two kinds of gate. An absolute floor:
proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl \
--out results.json --fail-under 0.7 # non-zero exit if overall score drops below 0.7
…and a regression gate against a committed baseline (a known-good results.json):
proofrag diff --baseline baseline.json --candidate results.json --tolerance 0.02
# prints a per-metric delta table; exits 1 if any metric dropped > tolerance.
# Refuses to compare across different judge models unless --allow-judge-mismatch.
GitHub Action
Drop proofrag into any repo's CI in a few lines — it installs the CLI, evaluates, writes the scorecard, and gates on both the floor and the baseline:
- uses: unshDee/proofrag@v0
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
with:
goldenset: eval/goldenset.jsonl
predictions: predictions.jsonl # produced by your RAG earlier in the job
baseline: eval/baseline.json # optional regression gate
fail-under: "0.7" # optional absolute gate
Full runnable workflow (with artifact upload): examples/ci/proofrag-eval.yml.
A/B: compare two RAG variants
Vector vs GraphRAG? Two prompts? Two models? Run both over the same golden set, then let the same judge pick the better answer per question — blind (answers shown in randomized order, so position bias is shuffled out):
proofrag compare --goldenset goldenset.jsonl \
--a vector_preds.jsonl --a-name vector \
--b graphrag_preds.jsonl --b-name graphrag \
--out comparison.json --html comparison.html
Deterministic retrieval metrics for each variant sit beside the verdict, so you can tell whether a win came from better retrieval or better generation.
What makes it different
- Golden set from your corpus — the wedge. Difficulty tiers: single-doc, multi-doc, and unanswerable (so you catch hallucination-instead-of-refusal).
- Retriever vs generator split — rank-aware retrieval metrics (Recall@k,
Precision@k, NDCG@k, MRR) separate "the context never arrived / ranked too low"
from "the model fluffed it." Lexical by default;
--semanticfor embedding match. - Pinned, fingerprinted judge — every scorecard records its judge model, so you never compare scores produced by different judges.
- Cheap & portable — defaults to a small model; Anthropic, OpenAI, or local/Ollama
(
OPENAI_BASE_URL). Self-contained HTML, zero JS, zero external assets. - Agent-native — drop it in as a skill and say "evaluate my RAG"; the agent wires your pipeline to the kit.
Configuration
| Env | Default | Purpose |
|---|---|---|
ANTHROPIC_API_KEY |
— | Anthropic backend (default) |
OPENAI_API_KEY / OPENAI_BASE_URL |
— | OpenAI-compatible / local |
PROOFRAG_PROVIDER |
auto | anthropic or openai |
PROOFRAG_MODEL |
Haiku / gpt-4o-mini | judge & generator model |
PROOFRAG_EMBED_MODEL |
text-embedding-3-small | embeddings for --semantic retrieval match |
Contributing
Issues and PRs welcome — see CONTRIBUTING.md. MIT licensed.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proofrag-0.4.0.tar.gz.
File metadata
- Download URL: proofrag-0.4.0.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20519cf83e3ac9ac29143feb6df833ed351c756d742992167e190bfe0c953b63
|
|
| MD5 |
fccee1d9edb207574f2e7cc2fc33ac03
|
|
| BLAKE2b-256 |
19a543f638ab891796e569bb5c62602e653c3c53fcee91717c7f084231cadf3f
|
File details
Details for the file proofrag-0.4.0-py3-none-any.whl.
File metadata
- Download URL: proofrag-0.4.0-py3-none-any.whl
- Upload date:
- Size: 27.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb0ee3d7984cca7c9c2660e673fae41de078a7f0e4be7cc45168d73f07867d9e
|
|
| MD5 |
3cbd1669b3f2784dc8e0acbdae6d413e
|
|
| BLAKE2b-256 |
6bb4b2157daa4cd046789ace6b8435065f0a0079652a58a6c51339e0112a6291
|