pytest for RAG agents — behavioral audits with PASS/FAIL/WEAK verdicts
Project description
ragverdict
pytest for RAG agents. Behavioral audits of any RAG system — tool coverage, retrieval quality, citation verification, hallucination guardrails — with PASS / FAIL / WEAK verdicts, not floating-point metric averages.
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test ┃ Evaluator ┃ Verdict ┃ Latency ┃ Detail ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ tool_coverage_all │ tool_coverage │ PASS │ 8ms │ 2/2 tools fired cleanly │
│ direct_retrieval │ rag_quality │ PASS │ 12662ms │ 3/3 cases passed │
│ hallucination_g… │ rag_quality │ PASS │ 3985ms │ 2/2 cases passed │
│ citation_audit │ citation_audit │ PASS │ 5535ms │ mean support_score=1.00 │
│ edge_cases_battery │ edge_cases │ PASS │ 2380ms │ 4/4 cases passed │
└────────────────────┴────────────────┴─────────┴─────────┴─────────────────────────────────┘
Why ragverdict
Existing RAG evaluation tools score metrics. RAGAs, DeepEval, TruLens, and Arize Phoenix all answer "how faithful was the response on average" via LLM-as-judge — they tell you the mean of a fleet of scores. They do not answer does the agent actually work end-to-end.
| Tool | Does | Doesn't |
|---|---|---|
| RAGAs | LLM-as-judge metric scores (faithfulness, context P/R) | No tool-call testing, no citation-vs-corpus verification, no assertions |
| DeepEval | pytest-style assertions on the RAGAs metric family | Same metric-centric model |
| TruLens | RAG Triad + OpenTelemetry tracing | Observability-centric |
| Phoenix | Tracing platform that wraps the above | Heavy infra, not a CLI |
The gap ragverdict fills: behavioral audits of RAG agents — assertions about whether the system behaves correctly, with PASS/FAIL/WEAK verdicts that map cleanly to CI.
What it checks
tool_coverage— Fires every tool the agent exposes and confirms it returns without error. Reports per-tool pass/fail + latency. None of the four competitors do this.rag_quality— Hard assertions (must_mention,must_not_cite,must_refuse,expects_citations) plus LLM-as-judge faithfulness + relevance scoring forWEAK/FAILverdicts when hard checks pass.citation_audit— Verifies every[src:ID]citation resolves to a real document in the agent's corpus, then asks the judge whether the cited claim is actually supported by the source. Dangling citations are a hardFAIL.edge_cases(v0.2) — Input-boundary failure modes:long_input(10K-char prompts, timeout-bounded),multi_turn(conversation-context recall),contradiction(false premises must be pushed back on, judge-graded with--no-judgeheuristic fallback), andempty_input(clean rejection). None of the four competitors do this either.
Quickstart
1. Install
pip install -e ".[dev]" # from a clone; PyPI release pending
export ANTHROPIC_API_KEY=sk-ant-… # required for the LLM judge
2. Run the bundled demo
ragverdict run examples/demo_rag/config.yaml
This runs five tests against a tiny reference RAG agent (DemoAdapter) over a fictional
"Acme Corp" corpus, exercising all four evaluators.
To run without burning API tokens:
ragverdict run examples/demo_rag/config.yaml --no-judge
(Hard assertions still run; WEAK verdicts and citation support scoring are skipped.)
3. Write a config for your own RAG system
config.yaml:
adapter:
type: python
module: my_app.rag_adapter
class: MyRagAdapter
judge:
provider: anthropic
model: claude-sonnet-4-6
tests:
- name: tool_coverage_all
evaluator: tool_coverage
- name: golden_path
evaluator: rag_quality
cases:
- query: "What was Q1 2025 revenue?"
must_mention: ["$5.2M"]
expects_citations: true
- name: out_of_corpus
evaluator: rag_quality
cases:
- query: "Predict 2030 revenue."
must_refuse: true
must_not_cite: true
- name: citations
evaluator: citation_audit
sample_queries:
- "Summarize Q1 2025 risks."
- "Who is the CTO?"
4. Write your adapter
Subclass RagAdapter and implement query():
from ragverdict import RagAdapter, RagResponse, Citation, ToolCall, ToolSpec, SourceDoc
class MyRagAdapter(RagAdapter):
def query(self, prompt, *, conversation=None) -> RagResponse:
# Call your real RAG pipeline:
text, retrieved, citations, tool_calls = my_pipeline.run(prompt)
return RagResponse(
text=text,
citations=[Citation(id=c.id, source_id=c.source, span=c.span) for c in citations],
tool_calls=[ToolCall(name=t.name, args=t.args, latency_ms=t.ms) for t in tool_calls],
retrieved_context=retrieved,
)
def available_tools(self) -> list[ToolSpec]:
return [ToolSpec(name="search_kb", description="Knowledge-base lookup")]
def corpus(self):
for doc in my_pipeline.iter_docs():
yield SourceDoc(source_id=doc.id, content=doc.text, title=doc.title)
The runner inserts the current working directory into sys.path before resolving your
module: import, so a project-local my_app/ package just works.
See examples/demo_rag/adapter.py for a complete
reference adapter and examples/README.md for a walkthrough.
Verdicts
- PASS — All hard assertions hold; judge scores (if configured) are at or above the pass threshold (defaults: faithfulness 0.85, relevance 0.85, citation support 0.95).
- WEAK — Hard assertions hold but a judge score falls in
[weak, pass)(defaults: 0.7–0.85 for faithfulness/relevance, 0.8–0.95 for citation support). - FAIL — A hard assertion failed, or a judge score fell below the weak threshold.
- ERROR — The evaluator crashed or the judge returned unparseable output.
Tune thresholds via the thresholds: section of config.yaml. Exit codes:
| Code | Meaning |
|---|---|
| 0 | All tests PASS or WEAK |
| 1 | At least one FAIL or ERROR |
| 2 | Config error / adapter load failure / unknown evaluator |
| 3 | All tests ERROR (typically: judge unavailable) |
Reports
After each run, two files land in ./report/ (override with --out-dir):
report.json— Machine-readable: full per-test verdicts, metrics, judge artifacts, per-citation audit detail. Stable shape — seedocs/json-report-schema.md.report.md— Human-readable summary table.
FAQ
When should I use ragverdict vs RAGAs / DeepEval / TruLens?
They're complementary, not competing. The metric-centric tools (RAGAs, ARES, TruLens, Phoenix, DeepEval) score response quality dimensions like faithfulness and relevance — useful for tracking quality over time. ragverdict tests agent behavior — did the tools fire, do the citations resolve to real documents, did the agent push back on a false premise, does it survive a 10K-character prompt. A mature RAG team uses both: RAGAs-style scoring for quality tracking + ragverdict for behavioral regression in CI.
Does it work without an API key?
Yes. Pass --no-judge (or set no ANTHROPIC_API_KEY and the runner degrades
automatically). Hard assertions still run — tool_coverage, citation-vs-corpus
dangling checks, must_mention / must_refuse / must_not_cite,
long-input/multi-turn/empty-input edge cases. The contradiction edge case falls back
to a narrow regex heuristic (_PUSHBACK_HINTS) with a clear caveat in the FAIL detail
when it can't confidently grade.
Can I write my own evaluator?
Yes. Subclass Evaluator, set a class-level name, decorate with @register, and
implement run(adapter, spec, *, judge, thresholds) -> TestResult. Then import your
module before ragverdict run or add it to the package's autoload. The bundled
evaluators (src/ragverdict/evaluators/) are reference implementations.
Can I use it with a RAG system written in another language?
Yes — use the HttpAdapter. Set adapter.type: http + an endpoint URL in your
config. The runner POSTs {prompt, conversation} and expects a JSON response matching
the RagResponse shape. Your Rust / Go / Node / TypeScript / etc. service just needs
to speak that protocol.
What's the difference between WEAK and FAIL?
FAIL = a hard assertion failed (a required substring was missing, a citation didn't
resolve, an edge case crashed). WEAK = all hard assertions held but a judge score
fell into the configurable weak band (default: faithfulness or relevance in [0.7, 0.85)). WEAK is "watch this," FAIL is "fix this." Both PASS and WEAK give
exit code 0; FAIL gives exit code 1.
Why four-state verdicts instead of floating-point scores?
So they map cleanly to CI exit codes and a 5-second scan of the terminal table. Raw
judge scores still live in report.json for users who want them — but the headline
output is a verdict, not a number you have to threshold yourself. The pitch is "pytest
for RAG, not metrics for RAG."
Can I use a model other than Claude for the judge?
The judge is configurable via judge.model in config.yaml (defaults to
claude-sonnet-4-6). Any current Anthropic model works out of the box. Other
providers require swapping LLMJudge for a sibling implementation — the runner
accepts any object that satisfies the judge interface.
How do I integrate this into GitHub Actions?
- name: RAG behavioral audit
run: |
pip install git+https://github.com/Shauryagulati/ragverdict.git # PyPI release pending
ragverdict run config.yaml --no-judge
CI exit code propagates naturally — PASS/WEAK is exit 0, any FAIL is exit 1,
config errors are exit 2, all-ERROR (typically: judge unreachable) is exit 3. For
live-judge CI runs, set ANTHROPIC_API_KEY as a repo secret and drop the
--no-judge flag.
Does prompt caching actually fire?
The wiring is correct on every judge rubric (cache_control={"type": "ephemeral"}),
but Sonnet 4.6's minimum cacheable prefix is 2048 tokens and current rubrics are
400-600 tokens. Caching activates as rubrics grow (more examples) or on models with
smaller minimums. Documented honestly in LLMJudge's module docstring rather than
silently shipping a feature that doesn't fire yet.
Roadmap
v0.2 shipped the edge-case battery. Next up:
- Write-tool safety evaluator (preview-only verification, version chain checks)
auth_negativekind for theedge_casesevaluator (requires adapter ABC extension)- Native
OpenAI/LangChainadapters - Concurrent test execution
- Hosted dashboard with regression tracking across runs
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragverdict-0.2.1.tar.gz.
File metadata
- Download URL: ragverdict-0.2.1.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
116313a1acd4aa3e168de655d9edc682cc583f5b49d36fb631ce6d95cf461808
|
|
| MD5 |
269e80dad5ea5955bae6cf9a60c8eec3
|
|
| BLAKE2b-256 |
9bea00617176d05f569af14e495fbb5198619c1ae34077fc7d0c483f123a7bf5
|
File details
Details for the file ragverdict-0.2.1-py3-none-any.whl.
File metadata
- Download URL: ragverdict-0.2.1-py3-none-any.whl
- Upload date:
- Size: 34.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30c5c20551ca03168165578f75ec31f7651a1a47f3bcc87adcb59bc33fa5af29
|
|
| MD5 |
dce8434489f22666730c890450b57e4c
|
|
| BLAKE2b-256 |
e1fc0a1b2831f6bdbefc28a57bf2fa3b5ac407d6cd3883c464e8ba55c5a2ef64
|