Evaluation infrastructure for AI agents.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dnicolau2027

These details have not been verified by PyPI

Project description

Verdict

Evaluation infrastructure for AI agents.

Python Version License PyPI

Demo

Install

pip install verdict-eval

Quickstart

# Run an evaluation against a built-in adapter
verdict eval --target simple_rag --num-per-category 5

# Compare two adapter versions
verdict diff --target-a simple_rag --target-b path/to/v2.py:MyAdapter --num 10

# Analyze flakiness across historical runs
verdict flakiness --target my-system --reports-dir ./reports

CLI reference

`verdict eval`

Run a full evaluation against a target adapter.

Flag	Default	Description
`--target`	required	Adapter spec: `simple_rag` or `path/to/file.py:ClassName`
`--num-per-category`	5	Prompts per test category
`--categories`	all	Specific categories (repeat for multiple)
`--output-dir`	`./reports`	Report output directory
`--run-id`	auto	Custom run identifier
`--model`	settings default	Override LLM model for all agents
`--bootstrap-iterations`	1000	Bootstrap CI iterations (0 to disable)
`--max-cost-usd`	—	Fail (exit 2) if total cost exceeds this amount
`--max-total-latency-seconds`	—	Fail (exit 2) if total latency exceeds this
`--fail-on-pass-rate-below`	—	Fail (exit 2) if pass rate < threshold
`--fail-on-ci-low-below`	—	Fail (exit 2) if CI lower bound < threshold
`--cache-mode`	`off`	`off` / `record` / `replay` / `update`
`--cache-dir`	`.verdict_cache`	Directory for cached responses
`--adaptive`	off	Run adaptive follow-up probes based on initial responses

`verdict diff`

Compare two adapter versions against the same generated test suite.

verdict diff \
  --target-a simple_rag \
  --target-b path/to/v2.py:V2Adapter \
  --num 10

`verdict flakiness`

Analyze judge and target consistency across historical evaluation runs.

verdict flakiness --target my-system --min-runs 5 --reports-dir ./reports

Adaptive mode

When --adaptive is enabled, Verdict runs a second pass of follow-up probes selected based on each initial response. Pattern selection is entirely rule-based — no LLM is used to generate new attacks. All probes are composed from the curated OWASP LLM Top 10 pattern library in verdict/evals/attack_patterns/patterns.json.

This design ensures Verdict remains a defensive evaluation tool. See CONTRIBUTING.md for the security boundary policy.

verdict eval --target simple_rag --adaptive

Writing a custom adapter

# my_adapter.py
from verdict.adapters.base import TargetAdapter
from verdict.models.schemas import ExecutionResult, TestPrompt

class MyAdapter(TargetAdapter):
    name = "my-system"
    version = "1.0.0"

    async def execute(self, prompt: TestPrompt) -> ExecutionResult:
        response = call_my_system(prompt.prompt)
        return self.make_result(prompt, response=response)

verdict eval --target my_adapter.py:MyAdapter

Test categories

Category	What it evaluates
`correctness`	Factual accuracy and reasoning quality
`safety`	Refusal of harmful, dangerous, or unethical requests
`injection`	Robustness against prompt injection (OWASP LLM01, LLM07)
`edge_case`	Graceful handling of malformed and ambiguous inputs
`compliance`	Privacy and data handling (OWASP LLM02)

Judge calibration

The Judge is validated against 22 hand-labeled examples covering all five test categories. Results are produced by running the live judge against known ground truth — no labels were derived from judge output.

Metric	Target	Baseline
Pass/fail agreement (non-borderline)	≥ 80%	100% (18/18)
Critical failure detection	5 / 5	5 / 5
Score accuracy (±1)	≥ 70%	100% (10/10)

Measured on claude-sonnet-4-6, 2026-05-22.

Run calibration locally (requires ANTHROPIC_API_KEY):

pytest tests/qa/test_judge_calibration.py -v -m llm

See docs/judge_calibration.md for full methodology.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dnicolau2027

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

May 22, 2026

0.1.2

May 22, 2026

0.1.1

May 22, 2026

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verdict_eval-0.1.3.tar.gz (60.0 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

verdict_eval-0.1.3-py3-none-any.whl (77.3 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file verdict_eval-0.1.3.tar.gz.

File metadata

Download URL: verdict_eval-0.1.3.tar.gz
Upload date: May 22, 2026
Size: 60.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verdict_eval-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`c5e03fc68c47da7cf3cb07ea07f8834590cee2e1dd62fa39994c7333431b4b5b`
MD5	`015cde32b4f008fc3bfe048cf8afde5a`
BLAKE2b-256	`51d6df68b7693eecd8182b1bb64bf5c3f7eaba44d873e38fac9322e5fad17e70`

See more details on using hashes here.

Provenance

The following attestation bundles were made for verdict_eval-0.1.3.tar.gz:

Publisher: publish.yml on dannicolau7/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: verdict_eval-0.1.3.tar.gz
- Subject digest: c5e03fc68c47da7cf3cb07ea07f8834590cee2e1dd62fa39994c7333431b4b5b
- Sigstore transparency entry: 1607978790
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: dannicolau7/verdict@1dfd71fa9c0c1fca44bc83848b6903bed3a699bb
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/dannicolau7
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@1dfd71fa9c0c1fca44bc83848b6903bed3a699bb
- Trigger Event: release

File details

Details for the file verdict_eval-0.1.3-py3-none-any.whl.

File metadata

Download URL: verdict_eval-0.1.3-py3-none-any.whl
Upload date: May 22, 2026
Size: 77.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verdict_eval-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`440be6015930021ea0bbcd6e3d72bde899de2a3aa11c66d325e22b608bf0d558`
MD5	`2ba0758d04e3d755fe945fb0f5054588`
BLAKE2b-256	`c1710a4199ea2df7f823dde7c56cd50213987bb4bbf21564dc3ef89baf69a03e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for verdict_eval-0.1.3-py3-none-any.whl:

Publisher: publish.yml on dannicolau7/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: verdict_eval-0.1.3-py3-none-any.whl
- Subject digest: 440be6015930021ea0bbcd6e3d72bde899de2a3aa11c66d325e22b608bf0d558
- Sigstore transparency entry: 1607978978
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: dannicolau7/verdict@1dfd71fa9c0c1fca44bc83848b6903bed3a699bb
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/dannicolau7
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@1dfd71fa9c0c1fca44bc83848b6903bed3a699bb
- Trigger Event: release

verdict-eval 0.1.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Verdict

Demo

Install

Quickstart

CLI reference

verdict eval

verdict diff

verdict flakiness

Adaptive mode

Writing a custom adapter

Test categories

Judge calibration

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`verdict eval`

`verdict diff`

`verdict flakiness`