pytest for RAG agents — behavioral audits with PASS/FAIL/WEAK verdicts

These details have not been verified by PyPI

Project links

Project description

ragverdict

pytest for RAG agents. Behavioral audits of any RAG system — tool coverage, retrieval quality, citation verification, hallucination guardrails — with PASS / FAIL / WEAK verdicts, not floating-point metric averages.

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test               ┃ Evaluator      ┃ Verdict ┃ Latency ┃ Detail                          ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ tool_coverage_all  │ tool_coverage  │ PASS    │     8ms │ 2/2 tools fired cleanly         │
│ direct_retrieval   │ rag_quality    │ PASS    │ 12662ms │ 3/3 cases passed                │
│ hallucination_g…   │ rag_quality    │ PASS    │  3985ms │ 2/2 cases passed                │
│ citation_audit     │ citation_audit │ PASS    │  5535ms │ mean support_score=1.00         │
│ edge_cases_battery │ edge_cases     │ PASS    │  2380ms │ 4/4 cases passed                │
└────────────────────┴────────────────┴─────────┴─────────┴─────────────────────────────────┘

Why ragverdict

Existing RAG evaluation tools score metrics. RAGAs, DeepEval, TruLens, and Arize Phoenix all answer "how faithful was the response on average" via LLM-as-judge — they tell you the mean of a fleet of scores. They do not answer does the agent actually work end-to-end.

Tool	Does	Doesn't
RAGAs	LLM-as-judge metric scores (faithfulness, context P/R)	No tool-call testing, no citation-vs-corpus verification, no assertions
DeepEval	pytest-style assertions on the RAGAs metric family	Same metric-centric model
TruLens	RAG Triad + OpenTelemetry tracing	Observability-centric
Phoenix	Tracing platform that wraps the above	Heavy infra, not a CLI

The gap ragverdict fills: behavioral audits of RAG agents — assertions about whether the system behaves correctly, with PASS/FAIL/WEAK verdicts that map cleanly to CI.

What it checks

tool_coverage — Fires every tool the agent exposes and confirms it returns without error. Reports per-tool pass/fail + latency. None of the four competitors do this.
rag_quality — Hard assertions (must_mention, must_not_cite, must_refuse, expects_citations) plus LLM-as-judge faithfulness + relevance scoring for WEAK / FAIL verdicts when hard checks pass.
citation_audit — Verifies every [src:ID] citation resolves to a real document in the agent's corpus, then asks the judge whether the cited claim is actually supported by the source. Dangling citations are a hard FAIL.
edge_cases (v0.2) — Input-boundary failure modes: long_input (10K-char prompts, timeout-bounded), multi_turn (conversation-context recall), contradiction (false premises must be pushed back on, judge-graded with --no-judge heuristic fallback), and empty_input (clean rejection). None of the four competitors do this either.

Quickstart

1. Install

pip install -e ".[dev]"  # from a clone; PyPI release pending
export ANTHROPIC_API_KEY=sk-ant-…  # required for the LLM judge

2. Run the bundled demo

ragverdict run examples/demo_rag/config.yaml

This runs five tests against a tiny reference RAG agent (DemoAdapter) over a fictional "Acme Corp" corpus, exercising all four evaluators.

To run without burning API tokens:

ragverdict run examples/demo_rag/config.yaml --no-judge

(Hard assertions still run; WEAK verdicts and citation support scoring are skipped.)

3. Write a config for your own RAG system

config.yaml:

adapter:
  type: python
  module: my_app.rag_adapter
  class: MyRagAdapter

judge:
  provider: anthropic
  model: claude-sonnet-4-6

tests:
  - name: tool_coverage_all
    evaluator: tool_coverage

  - name: golden_path
    evaluator: rag_quality
    cases:
      - query: "What was Q1 2025 revenue?"
        must_mention: ["$5.2M"]
        expects_citations: true

  - name: out_of_corpus
    evaluator: rag_quality
    cases:
      - query: "Predict 2030 revenue."
        must_refuse: true
        must_not_cite: true

  - name: citations
    evaluator: citation_audit
    sample_queries:
      - "Summarize Q1 2025 risks."
      - "Who is the CTO?"

4. Write your adapter

Subclass RagAdapter and implement query():

from ragverdict import RagAdapter, RagResponse, Citation, ToolCall, ToolSpec, SourceDoc

class MyRagAdapter(RagAdapter):
    def query(self, prompt, *, conversation=None) -> RagResponse:
        # Call your real RAG pipeline:
        text, retrieved, citations, tool_calls = my_pipeline.run(prompt)
        return RagResponse(
            text=text,
            citations=[Citation(id=c.id, source_id=c.source, span=c.span) for c in citations],
            tool_calls=[ToolCall(name=t.name, args=t.args, latency_ms=t.ms) for t in tool_calls],
            retrieved_context=retrieved,
        )

    def available_tools(self) -> list[ToolSpec]:
        return [ToolSpec(name="search_kb", description="Knowledge-base lookup")]

    def corpus(self):
        for doc in my_pipeline.iter_docs():
            yield SourceDoc(source_id=doc.id, content=doc.text, title=doc.title)

The runner inserts the current working directory into sys.path before resolving your module: import, so a project-local my_app/ package just works.

See examples/demo_rag/adapter.py for a complete reference adapter and examples/README.md for a walkthrough.

Verdicts

PASS — All hard assertions hold; judge scores (if configured) are at or above the pass threshold (defaults: faithfulness 0.85, relevance 0.85, citation support 0.95).
WEAK — Hard assertions hold but a judge score falls in [weak, pass) (defaults: 0.7–0.85 for faithfulness/relevance, 0.8–0.95 for citation support).
FAIL — A hard assertion failed, or a judge score fell below the weak threshold.
ERROR — The evaluator crashed or the judge returned unparseable output.

Tune thresholds via the thresholds: section of config.yaml. Exit codes:

Code	Meaning
0	All tests PASS or WEAK
1	At least one FAIL or ERROR
2	Config error / adapter load failure / unknown evaluator
3	All tests ERROR (typically: judge unavailable)

Reports

After each run, two files land in ./report/ (override with --out-dir):

report.json — Machine-readable: full per-test verdicts, metrics, judge artifacts, per-citation audit detail. Stable shape — see docs/json-report-schema.md.
report.md — Human-readable summary table.

FAQ

When should I use ragverdict vs RAGAs / DeepEval / TruLens?

They're complementary, not competing. The metric-centric tools (RAGAs, ARES, TruLens, Phoenix, DeepEval) score response quality dimensions like faithfulness and relevance — useful for tracking quality over time. ragverdict tests agent behavior — did the tools fire, do the citations resolve to real documents, did the agent push back on a false premise, does it survive a 10K-character prompt. A mature RAG team uses both: RAGAs-style scoring for quality tracking + ragverdict for behavioral regression in CI.

Does it work without an API key?

Yes. Pass --no-judge (or set no ANTHROPIC_API_KEY and the runner degrades automatically). Hard assertions still run — tool_coverage, citation-vs-corpus dangling checks, must_mention / must_refuse / must_not_cite, long-input/multi-turn/empty-input edge cases. The contradiction edge case falls back to a narrow regex heuristic (_PUSHBACK_HINTS) with a clear caveat in the FAIL detail when it can't confidently grade.

Can I write my own evaluator?

Yes. Subclass Evaluator, set a class-level name, decorate with @register, and implement run(adapter, spec, *, judge, thresholds) -> TestResult. Then import your module before ragverdict run or add it to the package's autoload. The bundled evaluators (src/ragverdict/evaluators/) are reference implementations.

Can I use it with a RAG system written in another language?

Yes — use the HttpAdapter. Set adapter.type: http + an endpoint URL in your config. The runner POSTs {prompt, conversation} and expects a JSON response matching the RagResponse shape. Your Rust / Go / Node / TypeScript / etc. service just needs to speak that protocol.

What's the difference between `WEAK` and `FAIL`?

FAIL = a hard assertion failed (a required substring was missing, a citation didn't resolve, an edge case crashed). WEAK = all hard assertions held but a judge score fell into the configurable weak band (default: faithfulness or relevance in [0.7, 0.85)). WEAK is "watch this," FAIL is "fix this." Both PASS and WEAK give exit code 0; FAIL gives exit code 1.

Why four-state verdicts instead of floating-point scores?

So they map cleanly to CI exit codes and a 5-second scan of the terminal table. Raw judge scores still live in report.json for users who want them — but the headline output is a verdict, not a number you have to threshold yourself. The pitch is "pytest for RAG, not metrics for RAG."

Can I use a model other than Claude for the judge?

The judge is configurable via judge.model in config.yaml (defaults to claude-sonnet-4-6). Any current Anthropic model works out of the box. Other providers require swapping LLMJudge for a sibling implementation — the runner accepts any object that satisfies the judge interface.

How do I integrate this into GitHub Actions?

- name: RAG behavioral audit
  run: |
    pip install git+https://github.com/Shauryagulati/ragverdict.git  # PyPI release pending
    ragverdict run config.yaml --no-judge

CI exit code propagates naturally — PASS/WEAK is exit 0, any FAIL is exit 1, config errors are exit 2, all-ERROR (typically: judge unreachable) is exit 3. For live-judge CI runs, set ANTHROPIC_API_KEY as a repo secret and drop the --no-judge flag.

Does prompt caching actually fire?

The wiring is correct on every judge rubric (cache_control={"type": "ephemeral"}), but Sonnet 4.6's minimum cacheable prefix is 2048 tokens and current rubrics are 400-600 tokens. Caching activates as rubrics grow (more examples) or on models with smaller minimums. Documented honestly in LLMJudge's module docstring rather than silently shipping a feature that doesn't fire yet.

Roadmap

v0.2 shipped the edge-case battery. Next up:

Write-tool safety evaluator (preview-only verification, version chain checks)
auth_negative kind for the edge_cases evaluator (requires adapter ABC extension)
Native OpenAI / LangChain adapters
Concurrent test execution
Hosted dashboard with regression tracking across runs

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Jun 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragverdict-0.2.1.tar.gz (30.3 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragverdict-0.2.1-py3-none-any.whl (34.3 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file ragverdict-0.2.1.tar.gz.

File metadata

Download URL: ragverdict-0.2.1.tar.gz
Upload date: Jun 26, 2026
Size: 30.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ragverdict-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`116313a1acd4aa3e168de655d9edc682cc583f5b49d36fb631ce6d95cf461808`
MD5	`269e80dad5ea5955bae6cf9a60c8eec3`
BLAKE2b-256	`9bea00617176d05f569af14e495fbb5198619c1ae34077fc7d0c483f123a7bf5`

See more details on using hashes here.

File details

Details for the file ragverdict-0.2.1-py3-none-any.whl.

File metadata

Download URL: ragverdict-0.2.1-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 34.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ragverdict-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`30c5c20551ca03168165578f75ec31f7651a1a47f3bcc87adcb59bc33fa5af29`
MD5	`dce8434489f22666730c890450b57e4c`
BLAKE2b-256	`e1fc0a1b2831f6bdbefc28a57bf2fa3b5ac407d6cd3883c464e8ba55c5a2ef64`

See more details on using hashes here.

ragverdict 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ragverdict

Why ragverdict

What it checks

Quickstart

1. Install

2. Run the bundled demo

3. Write a config for your own RAG system

4. Write your adapter

Verdicts

Reports

FAQ

When should I use ragverdict vs RAGAs / DeepEval / TruLens?

Does it work without an API key?

Can I write my own evaluator?

Can I use it with a RAG system written in another language?

What's the difference between WEAK and FAIL?

Why four-state verdicts instead of floating-point scores?

Can I use a model other than Claude for the judge?

How do I integrate this into GitHub Actions?

Does prompt caching actually fire?

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What's the difference between `WEAK` and `FAIL`?