Skip to main content

Evaluation infrastructure for AI agents.

Project description

Verdict

Evaluation infrastructure for AI agents.

CI Python Version License PyPI

Demo

asciicast

Install

pip install verdict-eval

Quickstart

# Run an evaluation against a built-in adapter
verdict eval --target simple_rag --num-per-category 5

# Compare two adapter versions
verdict diff --target-a simple_rag --target-b path/to/v2.py:MyAdapter --num 10

# Analyze flakiness across historical runs
verdict flakiness --target my-system --reports-dir ./reports

CLI reference

verdict eval

Run a full evaluation against a target adapter.

Flag Default Description
--target required Adapter spec: simple_rag or path/to/file.py:ClassName
--num-per-category 5 Prompts per test category
--categories all Specific categories (repeat for multiple)
--output-dir ./reports Report output directory
--run-id auto Custom run identifier
--model settings default Override LLM model for all agents
--bootstrap-iterations 1000 Bootstrap CI iterations (0 to disable)
--max-cost-usd Fail (exit 2) if total cost exceeds this amount
--max-total-latency-seconds Fail (exit 2) if total latency exceeds this
--fail-on-pass-rate-below Fail (exit 2) if pass rate < threshold
--fail-on-ci-low-below Fail (exit 2) if CI lower bound < threshold
--cache-mode off off / record / replay / update
--cache-dir .verdict_cache Directory for cached responses
--adaptive off Run adaptive follow-up probes based on initial responses

verdict diff

Compare two adapter versions against the same generated test suite.

verdict diff \
  --target-a simple_rag \
  --target-b path/to/v2.py:V2Adapter \
  --num 10

verdict flakiness

Analyze judge and target consistency across historical evaluation runs.

verdict flakiness --target my-system --min-runs 5 --reports-dir ./reports

Adaptive mode

When --adaptive is enabled, Verdict runs a second pass of follow-up probes selected based on each initial response. Pattern selection is entirely rule-based — no LLM is used to generate new attacks. All probes are composed from the curated OWASP LLM Top 10 pattern library in verdict/evals/attack_patterns/patterns.json.

This design ensures Verdict remains a defensive evaluation tool. See CONTRIBUTING.md for the security boundary policy.

verdict eval --target simple_rag --adaptive

Writing a custom adapter

# my_adapter.py
from verdict.adapters.base import TargetAdapter
from verdict.models.schemas import ExecutionResult, TestPrompt

class MyAdapter(TargetAdapter):
    name = "my-system"
    version = "1.0.0"

    async def execute(self, prompt: TestPrompt) -> ExecutionResult:
        response = call_my_system(prompt.prompt)
        return self.make_result(prompt, response=response)
verdict eval --target my_adapter.py:MyAdapter

Test categories

Category What it evaluates
correctness Factual accuracy and reasoning quality
safety Refusal of harmful, dangerous, or unethical requests
injection Robustness against prompt injection (OWASP LLM01, LLM07)
edge_case Graceful handling of malformed and ambiguous inputs
compliance Privacy and data handling (OWASP LLM02)

Judge calibration

The Judge is validated against 22 hand-labeled examples covering all five test categories. Results are produced by running the live judge against known ground truth — no labels were derived from judge output.

Metric Target Baseline
Pass/fail agreement (non-borderline) ≥ 80% TBD
Critical failure detection 5 / 5 TBD
Score accuracy (±1) ≥ 70% TBD

Run calibration locally (requires ANTHROPIC_API_KEY):

pytest tests/qa/test_judge_calibration.py -v -m llm

See docs/judge_calibration.md for full methodology.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verdict_eval-0.1.2.tar.gz (59.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verdict_eval-0.1.2-py3-none-any.whl (77.2 kB view details)

Uploaded Python 3

File details

Details for the file verdict_eval-0.1.2.tar.gz.

File metadata

  • Download URL: verdict_eval-0.1.2.tar.gz
  • Upload date:
  • Size: 59.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verdict_eval-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a464d1c4f57afc997ea503d66d5d196beb83652e7927569402678cd286b90061
MD5 cb26d28e2c8e0085ab2c9e375e141437
BLAKE2b-256 9ab6f470fdf0785524a57c37a16c5d50cf261805e0a7c8c035bc3840a301f69e

See more details on using hashes here.

Provenance

The following attestation bundles were made for verdict_eval-0.1.2.tar.gz:

Publisher: publish.yml on dannicolau7/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file verdict_eval-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: verdict_eval-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 77.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verdict_eval-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 89aff17f542d0130814a06c087ddb1bde8e9877871edec92625c06f9aad263cd
MD5 f73aeb2c2c2c5c58650eeb5b8e72fa27
BLAKE2b-256 54efd6176ee526270ad5115a7e344a117dfc126cfc8a6678719480307e85af29

See more details on using hashes here.

Provenance

The following attestation bundles were made for verdict_eval-0.1.2-py3-none-any.whl:

Publisher: publish.yml on dannicolau7/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page