Skip to main content

Evaluation infrastructure for AI agents.

Project description

Verdict

Evaluation infrastructure for AI agents.

CI Python Version License PyPI

Demo

Recording coming soon — run ./scripts/demo.sh locally after pip install verdict-eval.

Install

pip install verdict-eval

Quickstart

# Run an evaluation against a built-in adapter
verdict eval --target simple_rag --num-per-category 5

# Compare two adapter versions
verdict diff --target-a simple_rag --target-b path/to/v2.py:MyAdapter --num 10

# Analyze flakiness across historical runs
verdict flakiness --target my-system --reports-dir ./reports

CLI reference

verdict eval

Run a full evaluation against a target adapter.

Flag Default Description
--target required Adapter spec: simple_rag or path/to/file.py:ClassName
--num-per-category 5 Prompts per test category
--categories all Specific categories (repeat for multiple)
--output-dir ./reports Report output directory
--run-id auto Custom run identifier
--model settings default Override LLM model for all agents
--bootstrap-iterations 1000 Bootstrap CI iterations (0 to disable)
--max-cost-usd Fail (exit 2) if total cost exceeds this amount
--max-total-latency-seconds Fail (exit 2) if total latency exceeds this
--fail-on-pass-rate-below Fail (exit 2) if pass rate < threshold
--fail-on-ci-low-below Fail (exit 2) if CI lower bound < threshold
--cache-mode off off / record / replay / update
--cache-dir .verdict_cache Directory for cached responses
--adaptive off Run adaptive follow-up probes based on initial responses

verdict diff

Compare two adapter versions against the same generated test suite.

verdict diff \
  --target-a simple_rag \
  --target-b path/to/v2.py:V2Adapter \
  --num 10

verdict flakiness

Analyze judge and target consistency across historical evaluation runs.

verdict flakiness --target my-system --min-runs 5 --reports-dir ./reports

Adaptive mode

When --adaptive is enabled, Verdict runs a second pass of follow-up probes selected based on each initial response. Pattern selection is entirely rule-based — no LLM is used to generate new attacks. All probes are composed from the curated OWASP LLM Top 10 pattern library in verdict/evals/attack_patterns/patterns.json.

This design ensures Verdict remains a defensive evaluation tool. See CONTRIBUTING.md for the security boundary policy.

verdict eval --target simple_rag --adaptive

Writing a custom adapter

# my_adapter.py
from verdict.adapters.base import TargetAdapter
from verdict.models.schemas import ExecutionResult, TestPrompt

class MyAdapter(TargetAdapter):
    name = "my-system"
    version = "1.0.0"

    async def execute(self, prompt: TestPrompt) -> ExecutionResult:
        response = call_my_system(prompt.prompt)
        return self.make_result(prompt, response=response)
verdict eval --target my_adapter.py:MyAdapter

Test categories

Category What it evaluates
correctness Factual accuracy and reasoning quality
safety Refusal of harmful, dangerous, or unethical requests
injection Robustness against prompt injection (OWASP LLM01, LLM07)
edge_case Graceful handling of malformed and ambiguous inputs
compliance Privacy and data handling (OWASP LLM02)

Judge calibration

The Judge is validated against 22 hand-labeled examples covering all five test categories. Results are produced by running the live judge against known ground truth — no labels were derived from judge output.

Metric Target Baseline
Pass/fail agreement (non-borderline) ≥ 80% TBD
Critical failure detection 5 / 5 TBD
Score accuracy (±1) ≥ 70% TBD

Run calibration locally (requires ANTHROPIC_API_KEY):

pytest tests/qa/test_judge_calibration.py -v -m llm

See docs/judge_calibration.md for full methodology.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verdict_eval-0.1.1.tar.gz (60.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verdict_eval-0.1.1-py3-none-any.whl (77.3 kB view details)

Uploaded Python 3

File details

Details for the file verdict_eval-0.1.1.tar.gz.

File metadata

  • Download URL: verdict_eval-0.1.1.tar.gz
  • Upload date:
  • Size: 60.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verdict_eval-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a11ea7e25edd0a5efae4f6898815ec298687d8be17d935d52c677defa3fb42e2
MD5 e2b0b6367648ab942db3d494a2a90a4a
BLAKE2b-256 e4a92b4fe5224c68760d8b9dd09d8fd332fe35a9249fbbea0e12e5c2a8d92bd6

See more details on using hashes here.

Provenance

The following attestation bundles were made for verdict_eval-0.1.1.tar.gz:

Publisher: publish.yml on dannicolau7/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file verdict_eval-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: verdict_eval-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 77.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verdict_eval-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fe7549054e911e5b8d6c17b4512df2bf54b3b496b41fc6326ce2f5e91e3c9654
MD5 241a46b5a40d95576cdf4c6c5429418c
BLAKE2b-256 059536ffd8ab8f741744b4e9623263fec3a9988f818ec7543b7b29f5aef1c1bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for verdict_eval-0.1.1-py3-none-any.whl:

Publisher: publish.yml on dannicolau7/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page