Evaluation infrastructure for AI agents.
Project description
Verdict
Evaluation infrastructure for AI agents.
Demo
Install
pip install verdict-eval
Quickstart
# Run an evaluation against a built-in adapter
verdict eval --target simple_rag --num-per-category 5
# Compare two adapter versions
verdict diff --target-a simple_rag --target-b path/to/v2.py:MyAdapter --num 10
# Analyze flakiness across historical runs
verdict flakiness --target my-system --reports-dir ./reports
CLI reference
verdict eval
Run a full evaluation against a target adapter.
| Flag | Default | Description |
|---|---|---|
--target |
required | Adapter spec: simple_rag or path/to/file.py:ClassName |
--num-per-category |
5 | Prompts per test category |
--categories |
all | Specific categories (repeat for multiple) |
--output-dir |
./reports |
Report output directory |
--run-id |
auto | Custom run identifier |
--model |
settings default | Override LLM model for all agents |
--bootstrap-iterations |
1000 | Bootstrap CI iterations (0 to disable) |
--max-cost-usd |
— | Fail (exit 2) if total cost exceeds this amount |
--max-total-latency-seconds |
— | Fail (exit 2) if total latency exceeds this |
--fail-on-pass-rate-below |
— | Fail (exit 2) if pass rate < threshold |
--fail-on-ci-low-below |
— | Fail (exit 2) if CI lower bound < threshold |
--cache-mode |
off |
off / record / replay / update |
--cache-dir |
.verdict_cache |
Directory for cached responses |
--adaptive |
off | Run adaptive follow-up probes based on initial responses |
verdict diff
Compare two adapter versions against the same generated test suite.
verdict diff \
--target-a simple_rag \
--target-b path/to/v2.py:V2Adapter \
--num 10
verdict flakiness
Analyze judge and target consistency across historical evaluation runs.
verdict flakiness --target my-system --min-runs 5 --reports-dir ./reports
Adaptive mode
When --adaptive is enabled, Verdict runs a second pass of follow-up probes selected
based on each initial response. Pattern selection is entirely rule-based — no LLM
is used to generate new attacks. All probes are composed from the curated
OWASP LLM Top 10
pattern library in verdict/evals/attack_patterns/patterns.json.
This design ensures Verdict remains a defensive evaluation tool. See CONTRIBUTING.md for the security boundary policy.
verdict eval --target simple_rag --adaptive
Writing a custom adapter
# my_adapter.py
from verdict.adapters.base import TargetAdapter
from verdict.models.schemas import ExecutionResult, TestPrompt
class MyAdapter(TargetAdapter):
name = "my-system"
version = "1.0.0"
async def execute(self, prompt: TestPrompt) -> ExecutionResult:
response = call_my_system(prompt.prompt)
return self.make_result(prompt, response=response)
verdict eval --target my_adapter.py:MyAdapter
Test categories
| Category | What it evaluates |
|---|---|
correctness |
Factual accuracy and reasoning quality |
safety |
Refusal of harmful, dangerous, or unethical requests |
injection |
Robustness against prompt injection (OWASP LLM01, LLM07) |
edge_case |
Graceful handling of malformed and ambiguous inputs |
compliance |
Privacy and data handling (OWASP LLM02) |
Judge calibration
The Judge is validated against 22 hand-labeled examples covering all five test categories. Results are produced by running the live judge against known ground truth — no labels were derived from judge output.
| Metric | Target | Baseline |
|---|---|---|
| Pass/fail agreement (non-borderline) | ≥ 80% | TBD |
| Critical failure detection | 5 / 5 | TBD |
| Score accuracy (±1) | ≥ 70% | TBD |
Run calibration locally (requires ANTHROPIC_API_KEY):
pytest tests/qa/test_judge_calibration.py -v -m llm
See docs/judge_calibration.md for full methodology.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file verdict_eval-0.1.2.tar.gz.
File metadata
- Download URL: verdict_eval-0.1.2.tar.gz
- Upload date:
- Size: 59.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a464d1c4f57afc997ea503d66d5d196beb83652e7927569402678cd286b90061
|
|
| MD5 |
cb26d28e2c8e0085ab2c9e375e141437
|
|
| BLAKE2b-256 |
9ab6f470fdf0785524a57c37a16c5d50cf261805e0a7c8c035bc3840a301f69e
|
Provenance
The following attestation bundles were made for verdict_eval-0.1.2.tar.gz:
Publisher:
publish.yml on dannicolau7/verdict
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
verdict_eval-0.1.2.tar.gz -
Subject digest:
a464d1c4f57afc997ea503d66d5d196beb83652e7927569402678cd286b90061 - Sigstore transparency entry: 1607688047
- Sigstore integration time:
-
Permalink:
dannicolau7/verdict@35c993e37f8d5ef188db5024d5585955395c2981 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/dannicolau7
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@35c993e37f8d5ef188db5024d5585955395c2981 -
Trigger Event:
release
-
Statement type:
File details
Details for the file verdict_eval-0.1.2-py3-none-any.whl.
File metadata
- Download URL: verdict_eval-0.1.2-py3-none-any.whl
- Upload date:
- Size: 77.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89aff17f542d0130814a06c087ddb1bde8e9877871edec92625c06f9aad263cd
|
|
| MD5 |
f73aeb2c2c2c5c58650eeb5b8e72fa27
|
|
| BLAKE2b-256 |
54efd6176ee526270ad5115a7e344a117dfc126cfc8a6678719480307e85af29
|
Provenance
The following attestation bundles were made for verdict_eval-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on dannicolau7/verdict
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
verdict_eval-0.1.2-py3-none-any.whl -
Subject digest:
89aff17f542d0130814a06c087ddb1bde8e9877871edec92625c06f9aad263cd - Sigstore transparency entry: 1607688161
- Sigstore integration time:
-
Permalink:
dannicolau7/verdict@35c993e37f8d5ef188db5024d5585955395c2981 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/dannicolau7
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@35c993e37f8d5ef188db5024d5585955395c2981 -
Trigger Event:
release
-
Statement type: