Skip to main content

Evaluation infrastructure for AI agents.

Project description

Verdict

Evaluation infrastructure for AI agents.

CI Python Version License PyPI

Demo

Recording coming soon — run ./scripts/demo.sh locally after pip install verdict-eval.

Install

pip install verdict-eval

Quickstart

# Run an evaluation against a built-in adapter
verdict eval --target simple_rag --num-per-category 5

# Compare two adapter versions
verdict diff --target-a simple_rag --target-b path/to/v2.py:MyAdapter --num 10

# Analyze flakiness across historical runs
verdict flakiness --target my-system --reports-dir ./reports

CLI reference

verdict eval

Run a full evaluation against a target adapter.

Flag Default Description
--target required Adapter spec: simple_rag or path/to/file.py:ClassName
--num-per-category 5 Prompts per test category
--categories all Specific categories (repeat for multiple)
--output-dir ./reports Report output directory
--run-id auto Custom run identifier
--model settings default Override LLM model for all agents
--bootstrap-iterations 1000 Bootstrap CI iterations (0 to disable)
--max-cost-usd Fail (exit 2) if total cost exceeds this amount
--max-total-latency-seconds Fail (exit 2) if total latency exceeds this
--fail-on-pass-rate-below Fail (exit 2) if pass rate < threshold
--fail-on-ci-low-below Fail (exit 2) if CI lower bound < threshold
--cache-mode off off / record / replay / update
--cache-dir .verdict_cache Directory for cached responses
--adaptive off Run adaptive follow-up probes based on initial responses

verdict diff

Compare two adapter versions against the same generated test suite.

verdict diff \
  --target-a simple_rag \
  --target-b path/to/v2.py:V2Adapter \
  --num 10

verdict flakiness

Analyze judge and target consistency across historical evaluation runs.

verdict flakiness --target my-system --min-runs 5 --reports-dir ./reports

Adaptive mode

When --adaptive is enabled, Verdict runs a second pass of follow-up probes selected based on each initial response. Pattern selection is entirely rule-based — no LLM is used to generate new attacks. All probes are composed from the curated OWASP LLM Top 10 pattern library in verdict/evals/attack_patterns/patterns.json.

This design ensures Verdict remains a defensive evaluation tool. See CONTRIBUTING.md for the security boundary policy.

verdict eval --target simple_rag --adaptive

Writing a custom adapter

# my_adapter.py
from verdict.adapters.base import TargetAdapter
from verdict.models.schemas import ExecutionResult, TestPrompt

class MyAdapter(TargetAdapter):
    name = "my-system"
    version = "1.0.0"

    async def execute(self, prompt: TestPrompt) -> ExecutionResult:
        response = call_my_system(prompt.prompt)
        return self.make_result(prompt, response=response)
verdict eval --target my_adapter.py:MyAdapter

Test categories

Category What it evaluates
correctness Factual accuracy and reasoning quality
safety Refusal of harmful, dangerous, or unethical requests
injection Robustness against prompt injection (OWASP LLM01, LLM07)
edge_case Graceful handling of malformed and ambiguous inputs
compliance Privacy and data handling (OWASP LLM02)

Judge calibration

The Judge is validated against 22 hand-labeled examples covering all five test categories. Results are produced by running the live judge against known ground truth — no labels were derived from judge output.

Metric Target Baseline
Pass/fail agreement (non-borderline) ≥ 80% TBD
Critical failure detection 5 / 5 TBD
Score accuracy (±1) ≥ 70% TBD

Run calibration locally (requires ANTHROPIC_API_KEY):

pytest tests/qa/test_judge_calibration.py -v -m llm

See docs/judge_calibration.md for full methodology.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verdict_eval-0.1.0.tar.gz (60.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verdict_eval-0.1.0-py3-none-any.whl (77.0 kB view details)

Uploaded Python 3

File details

Details for the file verdict_eval-0.1.0.tar.gz.

File metadata

  • Download URL: verdict_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 60.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verdict_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0f89cbbfaea34b3587dd124731531ee15fd98e127ab437f6b0906c1d12349286
MD5 d21f5a301ff6d04f6f2830570c37e2f2
BLAKE2b-256 6ca69f6bf80c52e518b460534915a0bcf189e3e658cf2578b29ce9804e15597d

See more details on using hashes here.

Provenance

The following attestation bundles were made for verdict_eval-0.1.0.tar.gz:

Publisher: publish.yml on dannicolau7/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file verdict_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: verdict_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 77.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verdict_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8108cb04e3985a56733a041eb07f773d2e38c8c4de651697740fc94c2c51ef55
MD5 0fa6aebb397053298560f1c0c5400772
BLAKE2b-256 ed41e852cbbbe948f6c94d5766f31049fd7f3658f7fd0421f4a393ec682b5a7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for verdict_eval-0.1.0-py3-none-any.whl:

Publisher: publish.yml on dannicolau7/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page