Skip to main content

Point your agent at your docs and your RAG app; get a golden test set + an LLM-as-judge & retrieval scorecard, in one command.

Project description

proofrag

CI Python 3.11+ License: MIT

Point your agent at your docs and your RAG app. Get a golden test set, an LLM-as-judge + retrieval scorecard, and a CI gate — in one command.

Evaluation is the #1 unmet pain in production RAG/LLM work, and the hardest part is building a good test set in the first place. proofrag generates one from your own corpus, judges your system on it, and emits a shareable HTML scorecard. It's an Agent Skill (works in Claude Code, Codex, Cursor) and a plain Python CLI — wrapping the eval loop, not reinventing the metrics.

proofrag — generate a golden set, judge, and score in one loop

…and the scorecard it produces:

RAG eval scorecard

Try it now — no API key needed:

git clone https://github.com/unshDee/proofrag && cd proofrag
uv run proofrag demo --out scorecard.html && open scorecard.html

Uses uv. uv run auto-creates the environment on first call — nothing else to install. Prefer pip? pipx install proofrag.

Install as an Agent Skill

proofrag is a skill (the agentskills.io open standard) backed by a real CLI — so any agent can run "evaluate my RAG" and get a reproducible scorecard.

Claude Code (plugin):

/plugin marketplace add unshDee/proofrag
/plugin install proofrag@proofrag

Then ask "evaluate my RAG" (auto-triggered) or type /proofrag.

Claude Code (manual)cp -r skills/proofrag ~/.claude/skills/ Codex / other agentscp -r skills/proofrag .agents/skills/

The skill drives the proofrag CLI; install it with uv tool install "proofrag[anthropic]" (or pipx install, or run ad-hoc via uvx). See AGENTS.md for details.

Why this exists

"Running evals aren't the problem — the problem is acquiring or building a high-quality, non-contaminated dataset."

Most RAG systems reach production with no evals because writing a balanced golden set by hand is tedious. So teams ship prompt and model changes blind. This closes that loop: change something → re-run → see if quality moved → gate the merge.

The loop

# 1. Generate a golden set from YOUR docs (questions + gold answers + gold contexts)
proofrag generate --corpus ./docs --out goldenset.jsonl --n 20

# 2. Run your RAG over each question -> predictions.jsonl  (one line per question)
#    {"id": "q000", "answer": "...", "retrieved_contexts": ["...", "..."]}
#    See examples/docs-rag/naive_rag.py for a runnable driver.

# 3. Judge: groundedness, correctness, completeness, citation quality + retrieval metrics
proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl --out results.json

# 4. Shareable HTML scorecard
proofrag report --results results.json --out scorecard.html

Run the whole thing end-to-end against the bundled example:

uv sync --extra anthropic && export ANTHROPIC_API_KEY=...
uv run proofrag generate --corpus examples/docs-rag/corpus --out goldenset.jsonl --n 8
uv run python examples/docs-rag/naive_rag.py --goldenset goldenset.jsonl --corpus examples/docs-rag/corpus --out predictions.jsonl
uv run proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl --out results.json
uv run proofrag report --results results.json --out scorecard.html

CI gate

Two kinds of gate. An absolute floor:

proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl \
  --out results.json --fail-under 0.7      # non-zero exit if overall score drops below 0.7

…and a regression gate against a committed baseline (a known-good results.json):

proofrag diff --baseline baseline.json --candidate results.json --tolerance 0.02
# prints a per-metric delta table; exits 1 if any metric dropped > tolerance.
# Refuses to compare across different judge models unless --allow-judge-mismatch.

GitHub Action

Drop proofrag into any repo's CI in a few lines — it installs the CLI, evaluates, writes the scorecard, and gates on both the floor and the baseline:

- uses: unshDee/proofrag@v0
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  with:
    goldenset: eval/goldenset.jsonl
    predictions: predictions.jsonl     # produced by your RAG earlier in the job
    baseline: eval/baseline.json        # optional regression gate
    fail-under: "0.7"                   # optional absolute gate

Full runnable workflow (with artifact upload): examples/ci/proofrag-eval.yml.

What makes it different

  • Golden set from your corpus — the wedge. Difficulty tiers: single-doc, multi-doc, and unanswerable (so you catch hallucination-instead-of-refusal).
  • Retriever vs generator split — rank-aware retrieval metrics (Recall@k, Precision@k, NDCG@k, MRR) separate "the context never arrived / ranked too low" from "the model fluffed it." Lexical by default; --semantic for embedding match.
  • Pinned, fingerprinted judge — every scorecard records its judge model, so you never compare scores produced by different judges.
  • Cheap & portable — defaults to a small model; Anthropic, OpenAI, or local/Ollama (OPENAI_BASE_URL). Self-contained HTML, zero JS, zero external assets.
  • Agent-native — drop it in as a skill and say "evaluate my RAG"; the agent wires your pipeline to the kit.

Configuration

Env Default Purpose
ANTHROPIC_API_KEY Anthropic backend (default)
OPENAI_API_KEY / OPENAI_BASE_URL OpenAI-compatible / local
PROOFRAG_PROVIDER auto anthropic or openai
PROOFRAG_MODEL Haiku / gpt-4o-mini judge & generator model
PROOFRAG_EMBED_MODEL text-embedding-3-small embeddings for --semantic retrieval match

Roadmap

  • v0.1 — golden-set generator, LLM-as-judge, retrieval recall, HTML scorecard, CI gate
  • v0.2 — rank-aware retrieval metrics (Recall@k / Precision@k / NDCG@k / MRR), lexical + optional embedding match
  • v0.3 — GitHub Action + baseline diffing (regression-aware gate)
  • v0.4 — A/B comparator (vector vs GraphRAG) with blind judging
  • v0.5 — Ragas / DeepEval backends as pluggable scorers

Issues and PRs welcome. MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proofrag-0.3.0.tar.gz (670.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proofrag-0.3.0-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file proofrag-0.3.0.tar.gz.

File metadata

  • Download URL: proofrag-0.3.0.tar.gz
  • Upload date:
  • Size: 670.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for proofrag-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b81282ea59014ad714e86a3d77103df1a5ff3e9d83c7e53a87962f77ffb3fe24
MD5 043de26ca4190c936fd458dcea7d4f4c
BLAKE2b-256 fb29f9882a7a1c81fb7aeee7c83da4c41cd2e7330536d1373c3b7176dafe345c

See more details on using hashes here.

File details

Details for the file proofrag-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: proofrag-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 23.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for proofrag-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf401eff9c70f34ff98456979ab8d0059e58c02ca134cd52aee182f0a967349d
MD5 9dddb2b57cbbfe8ab55cd9c06f21462f
BLAKE2b-256 c5ae403e7168cc5ac97f1067438982e004769906ee780ca6132d55e0b5f572d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page