Point your agent at your docs and your RAG app; get a golden test set + an LLM-as-judge & retrieval scorecard, in one command.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

unshDee

These details have not been verified by PyPI

Project description

proofrag

Point your agent at your docs and your RAG app. Get a golden test set, an LLM-as-judge + retrieval scorecard, and a CI gate — in one command.

Evaluation is the #1 unmet pain in production RAG/LLM work, and the hardest part is building a good test set in the first place. proofrag generates one from your own corpus, judges your system on it, and emits a shareable HTML scorecard. It's an Agent Skill (works in Claude Code, Codex, Cursor) and a plain Python CLI — wrapping the eval loop, not reinventing the metrics.

proofrag — generate a golden set, judge, and score in one loop

…and the scorecard it produces:

RAG eval scorecard

See a scorecard in 5 seconds — no API key needed:

pipx install "proofrag[anthropic]"        # or: pip install / uv tool install / uvx
proofrag demo --out scorecard.html && open scorecard.html

Use [openai] instead of [anthropic] for an OpenAI-compatible or local (Ollama) backend. No install? Run it ad-hoc: uvx "proofrag[anthropic]" demo.

Install as an Agent Skill

proofrag is a skill (the agentskills.io open standard) backed by a real CLI — so any agent can run "evaluate my RAG" and get a reproducible scorecard.

Claude Code (plugin):

/plugin marketplace add unshDee/proofrag
/plugin install proofrag@proofrag

Then ask "evaluate my RAG" (auto-triggered) or type /proofrag.

Claude Code (manual) — cp -r skills/proofrag ~/.claude/skills/ Codex / other agents — cp -r skills/proofrag .agents/skills/

The skill drives the proofrag CLI; install it with uv tool install "proofrag[anthropic]" (or pipx install, or run ad-hoc via uvx). See AGENTS.md for details.

Why this exists

"Running evals aren't the problem — the problem is acquiring or building a high-quality, non-contaminated dataset."

Most RAG systems reach production with no evals because writing a balanced golden set by hand is tedious. So teams ship prompt and model changes blind. This closes that loop: change something → re-run → see if quality moved → gate the merge.

The loop

# 1. Generate a golden set from YOUR docs (questions + gold answers + gold contexts)
proofrag generate --corpus ./docs --out goldenset.jsonl --n 20

# 2. Validate it before committing it
proofrag validate --goldenset goldenset.jsonl --corpus ./docs --out validation.json

# 3. Run your RAG over each question -> predictions.jsonl
proofrag run --goldenset goldenset.jsonl --endpoint http://localhost:8000/ask --out predictions.jsonl
# or: proofrag run --goldenset goldenset.jsonl --callable myapp.rag:answer --out predictions.jsonl

# 4. Judge: groundedness, correctness, completeness, citation quality + retrieval metrics
proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl --out results.json

# 5. Shareable HTML scorecard
proofrag report --results results.json --out scorecard.html

# Optional: Markdown summary for CI logs / job summaries
proofrag summary --results results.json

Run the whole thing end-to-end against the bundled example:

uv sync --extra anthropic && export ANTHROPIC_API_KEY=...
uv run proofrag generate --corpus examples/docs-rag/corpus --out goldenset.jsonl --n 8
uv run proofrag validate --goldenset goldenset.jsonl --corpus examples/docs-rag/corpus
uv run python examples/docs-rag/naive_rag.py --goldenset goldenset.jsonl --corpus examples/docs-rag/corpus --out predictions.jsonl
uv run proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl --out results.json
uv run proofrag report --results results.json --out scorecard.html

Corpus loading

Before generating a golden set, inspect what proofrag will actually read:

proofrag corpus ./docs
proofrag corpus ./docs --include "**/*.md" --exclude "drafts/**"

Corpus loading skips noisy directories by default (.git, .venv, node_modules, dist, build, caches) and honors .gitignore patterns. Use --no-gitignore to disable .gitignore filtering. The same --include, --exclude, --no-gitignore, and --chunk-chars flags work on proofrag generate.

Supported inputs include Markdown, plain text, reStructuredText, MDX, common code files, and HTML. PDF loading is optional:

pip install "proofrag[pdf]"
proofrag corpus ./docs

Generated golden sets include context_metadata for each gold context, preserving source path, chunk id, chunk index, character count, and extension.

Golden set validation

Generated eval sets should be reviewed before they become a committed baseline. proofrag validate checks the JSONL schema, duplicate ids/questions, answerable cases without gold contexts, unanswerable cases that still cite context, difficulty tiers, source coverage, and a stable file fingerprint:

proofrag validate --goldenset goldenset.jsonl --corpus ./docs --out validation.json

It exits non-zero on hard errors. Add --strict to fail on warnings too when you want CI to enforce review hygiene.

Prediction adapters

The only app-specific step is producing predictions.jsonl. You can still write your own driver, but most projects can start with proofrag run:

# HTTP: proofrag POSTs {"id": "...", "question": "..."}
proofrag run --goldenset goldenset.jsonl \
  --endpoint http://localhost:8000/ask \
  --header "Authorization: Bearer $TOKEN" \
  --out predictions.jsonl

# Python: calls myapp.rag.answer(question)
proofrag run --goldenset goldenset.jsonl \
  --callable myapp.rag:answer \
  --out predictions.jsonl

# Python record mode: calls myapp.rag.answer(full_golden_record)
proofrag run --goldenset goldenset.jsonl \
  --callable myapp.rag:answer --call-style record \
  --out predictions.jsonl

Adapters may return an answer string, a tuple like (answer, contexts), or a dict like {"answer": "...", "retrieved_contexts": ["...", "..."]}. The endpoint form accepts the same JSON response shape. See examples/docs-rag/naive_rag.py for a fully custom driver.

CI gate

Two kinds of gate. An absolute floor:

proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl \
  --out results.json --fail-under 0.7      # non-zero exit if overall score drops below 0.7

…and a regression gate against a committed baseline (a known-good results.json):

proofrag diff --baseline baseline.json --candidate results.json --tolerance 0.02
# prints a per-metric delta table; exits 1 if any metric dropped > tolerance.
# Refuses to compare across different judge models unless --allow-judge-mismatch.

GitHub Action

Drop proofrag into any repo's CI in a few lines — it installs the CLI, evaluates, writes the scorecard, adds a GitHub Actions job summary, uploads the scorecard and results as an artifact, and gates on both the floor and the baseline:

- uses: unshDee/proofrag@v0
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  with:
    goldenset: eval/goldenset.jsonl
    predictions: predictions.jsonl     # produced by your RAG earlier in the job
    baseline: eval/baseline.json        # optional regression gate
    fail-under: "0.7"                   # optional absolute gate

Full runnable workflow: examples/ci/proofrag-eval.yml.

The artifact and job summary are on by default. Disable them with upload-artifact: "false" or summary: "false" if your workflow handles those separately.

A/B: compare two RAG variants

Vector vs GraphRAG? Two prompts? Two models? Run both over the same golden set, then let the same judge pick the better answer per question — blind (answers shown in randomized order, so position bias is shuffled out):

proofrag compare --goldenset goldenset.jsonl \
  --a vector_preds.jsonl  --a-name vector \
  --b graphrag_preds.jsonl --b-name graphrag \
  --out comparison.json --html comparison.html

blind A/B comparison report

Deterministic retrieval metrics for each variant sit beside the verdict, so you can tell whether a win came from better retrieval or better generation.

What makes it different

Golden set from your corpus — the wedge. Difficulty tiers: single-doc, multi-doc, and unanswerable (so you catch hallucination-instead-of-refusal).
Golden set validation — schema checks, duplicate detection, source coverage, and a stable fingerprint help teams review generated evals before committing them.
Retriever vs generator split — rank-aware retrieval metrics (Recall@k, Precision@k, NDCG@k, MRR) separate "the context never arrived / ranked too low" from "the model fluffed it." Lexical by default; --semantic for embedding match.
Pinned, fingerprinted judge — every scorecard records its judge model, so you never compare scores produced by different judges.
Cheap & portable — defaults to a small model; Anthropic, OpenAI, or local/Ollama (OPENAI_BASE_URL). Self-contained HTML, zero JS, zero external assets.
Prediction adapters — proofrag run can call an HTTP endpoint or Python callable so teams do not need to hand-write predictions.jsonl glue on day one.
CI-native output — the GitHub Action writes a markdown job summary and uploads the HTML scorecard/results artifact automatically, including when a gate fails.
Agent-native — drop it in as a skill and say "evaluate my RAG"; the agent wires your pipeline to the kit.
Pluggable scoring backends — swap proofrag's own judge for DeepEval or Ragas without changing the workflow, scorecard, CI gate, or A/B flow.

Scoring backends

By default proofrag judges generation with its own pinned LLM-as-judge. You can swap in an external library instead — the retrieval metrics, scorecard, diff, and compare all stay the same; only the generation metrics change.

pip install "proofrag[deepeval]"
proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl \
  --backend deepeval --out results.json
# generation metrics become: faithfulness, answer_relevancy, correctness (GEval)

pip install "proofrag[ragas]"
proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl \
  --backend ragas --out results.json
# generation metrics become: faithfulness, factual_correctness
# plus answer_relevancy when OpenAI-compatible embeddings are configured

The DeepEval judge uses the same model config as proofrag (ANTHROPIC_API_KEY → AnthropicModel, OPENAI_API_KEY → GPTModel). Verified against deepeval 4.0.6. Metric reasons are preserved in the scorecard's weakest-case notes when DeepEval provides them.

The Ragas backend is verified against ragas 0.4.3. It uses proofrag's configured LLM provider for faithfulness and factual correctness. Ragas answer relevancy needs embeddings, so it is enabled when OPENAI_API_KEY or OPENAI_BASE_URL is set.

Providers

proofrag is provider-agnostic. Set one of these and everything — generate, judge, compare, and the DeepEval/Ragas backends — uses it:

Provider	How to enable	Notes
Anthropic (default)	`ANTHROPIC_API_KEY`	cheap Haiku judge by default
OpenAI	`OPENAI_API_KEY`
OpenAI-compatible / local	`OPENAI_BASE_URL` (e.g. Ollama, vLLM, LM Studio)	API key optional — local servers accept any token

--semantic retrieval matching uses embeddings, which only exist on the OpenAI-compatible path (Anthropic has no embeddings API), so it needs OPENAI_API_KEY or OPENAI_BASE_URL even when your judge is Anthropic.

Environment

Env	Default	Purpose
`ANTHROPIC_API_KEY`	—	Anthropic provider
`OPENAI_API_KEY`	—	OpenAI provider
`OPENAI_BASE_URL`	—	OpenAI-compatible / local endpoint (key optional)
`PROOFRAG_PROVIDER`	auto	force `anthropic` or `openai`
`PROOFRAG_MODEL`	Haiku / gpt-4o-mini	judge & generator model
`PROOFRAG_EMBED_MODEL`	text-embedding-3-small	embedding model for `--semantic`

Contributing

Issues and PRs welcome — see CONTRIBUTING.md. MIT licensed.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

unshDee

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.7.0

Jun 14, 2026

0.6.0

Jun 13, 2026

0.5.2

Jun 1, 2026

0.5.1

Jun 1, 2026

0.5.0

Jun 1, 2026

0.4.0

Jun 1, 2026

0.3.0

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proofrag-0.7.0.tar.gz (275.2 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

proofrag-0.7.0-py3-none-any.whl (46.2 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file proofrag-0.7.0.tar.gz.

File metadata

Download URL: proofrag-0.7.0.tar.gz
Upload date: Jun 14, 2026
Size: 275.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for proofrag-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`993349d0e013975b6e448b63698f1f3b7eb4a9b352895fe2737966ff0f03b667`
MD5	`3060bd39c18bdd4e1257fc8d7214496e`
BLAKE2b-256	`9000df814669f8ad6ffc2407c15244bceb843d123fee8b9be179bd4a410d83d7`

See more details on using hashes here.

File details

Details for the file proofrag-0.7.0-py3-none-any.whl.

File metadata

Download URL: proofrag-0.7.0-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 46.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for proofrag-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b08234cddacdd9c4c81d17634ee8a797fa585c29a913b1ec32547ee6eba24b43`
MD5	`457ce4949e4ed71ad4047fc1e578a549`
BLAKE2b-256	`720819cbd17cb068432c8104a766c8f42c11c53edce9be63d7bfd03c6d983b23`

See more details on using hashes here.

proofrag 0.7.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

proofrag

Install as an Agent Skill

Why this exists

The loop

Corpus loading

Golden set validation

Prediction adapters

CI gate

GitHub Action

A/B: compare two RAG variants

What makes it different

Scoring backends

Providers

Environment

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes