Skip to main content

Evaluation infrastructure for AI agents.

Project description

Verdict

Evaluation infrastructure for AI agents.

CI Python Version License PyPI

Demo

asciicast

Install

pip install verdict-eval

Quickstart

# Run an evaluation against a built-in adapter
verdict eval --target simple_rag --num-per-category 5

# Compare two adapter versions
verdict diff --target-a simple_rag --target-b path/to/v2.py:MyAdapter --num 10

# Analyze flakiness across historical runs
verdict flakiness --target my-system --reports-dir ./reports

CLI reference

verdict eval

Run a full evaluation against a target adapter.

Flag Default Description
--target required Adapter spec: simple_rag or path/to/file.py:ClassName
--num-per-category 5 Prompts per test category
--categories all Specific categories (repeat for multiple)
--output-dir ./reports Report output directory
--run-id auto Custom run identifier
--model settings default Override LLM model for all agents
--bootstrap-iterations 1000 Bootstrap CI iterations (0 to disable)
--max-cost-usd Fail (exit 2) if total cost exceeds this amount
--max-total-latency-seconds Fail (exit 2) if total latency exceeds this
--fail-on-pass-rate-below Fail (exit 2) if pass rate < threshold
--fail-on-ci-low-below Fail (exit 2) if CI lower bound < threshold
--cache-mode off off / record / replay / update
--cache-dir .verdict_cache Directory for cached responses
--adaptive off Run adaptive follow-up probes based on initial responses

verdict diff

Compare two adapter versions against the same generated test suite.

verdict diff \
  --target-a simple_rag \
  --target-b path/to/v2.py:V2Adapter \
  --num 10

verdict flakiness

Analyze judge and target consistency across historical evaluation runs.

verdict flakiness --target my-system --min-runs 5 --reports-dir ./reports

Adaptive mode

When --adaptive is enabled, Verdict runs a second pass of follow-up probes selected based on each initial response. Pattern selection is entirely rule-based — no LLM is used to generate new attacks. All probes are composed from the curated OWASP LLM Top 10 pattern library in verdict/evals/attack_patterns/patterns.json.

This design ensures Verdict remains a defensive evaluation tool. See CONTRIBUTING.md for the security boundary policy.

verdict eval --target simple_rag --adaptive

Writing a custom adapter

# my_adapter.py
from verdict.adapters.base import TargetAdapter
from verdict.models.schemas import ExecutionResult, TestPrompt

class MyAdapter(TargetAdapter):
    name = "my-system"
    version = "1.0.0"

    async def execute(self, prompt: TestPrompt) -> ExecutionResult:
        response = call_my_system(prompt.prompt)
        return self.make_result(prompt, response=response)
verdict eval --target my_adapter.py:MyAdapter

Test categories

Category What it evaluates
correctness Factual accuracy and reasoning quality
safety Refusal of harmful, dangerous, or unethical requests
injection Robustness against prompt injection (OWASP LLM01, LLM07)
edge_case Graceful handling of malformed and ambiguous inputs
compliance Privacy and data handling (OWASP LLM02)

Judge calibration

The Judge is validated against 22 hand-labeled examples covering all five test categories. Results are produced by running the live judge against known ground truth — no labels were derived from judge output.

Metric Target Baseline
Pass/fail agreement (non-borderline) ≥ 80% 100% (18/18)
Critical failure detection 5 / 5 5 / 5
Score accuracy (±1) ≥ 70% 100% (10/10)

Measured on claude-sonnet-4-6, 2026-05-22.

Run calibration locally (requires ANTHROPIC_API_KEY):

pytest tests/qa/test_judge_calibration.py -v -m llm

See docs/judge_calibration.md for full methodology.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verdict_eval-0.1.3.tar.gz (60.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verdict_eval-0.1.3-py3-none-any.whl (77.3 kB view details)

Uploaded Python 3

File details

Details for the file verdict_eval-0.1.3.tar.gz.

File metadata

  • Download URL: verdict_eval-0.1.3.tar.gz
  • Upload date:
  • Size: 60.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verdict_eval-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c5e03fc68c47da7cf3cb07ea07f8834590cee2e1dd62fa39994c7333431b4b5b
MD5 015cde32b4f008fc3bfe048cf8afde5a
BLAKE2b-256 51d6df68b7693eecd8182b1bb64bf5c3f7eaba44d873e38fac9322e5fad17e70

See more details on using hashes here.

Provenance

The following attestation bundles were made for verdict_eval-0.1.3.tar.gz:

Publisher: publish.yml on dannicolau7/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file verdict_eval-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: verdict_eval-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 77.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verdict_eval-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 440be6015930021ea0bbcd6e3d72bde899de2a3aa11c66d325e22b608bf0d558
MD5 2ba0758d04e3d755fe945fb0f5054588
BLAKE2b-256 c1710a4199ea2df7f823dde7c56cd50213987bb4bbf21564dc3ef89baf69a03e

See more details on using hashes here.

Provenance

The following attestation bundles were made for verdict_eval-0.1.3-py3-none-any.whl:

Publisher: publish.yml on dannicolau7/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page