Skip to main content

Pydantic contracts and JSON Schemas for portable agent evaluation records.

Project description

Agent Eval Contract

Pydantic contracts and JSON Schemas for portable agent evaluation records.

Use this package when you are experimenting with agents, harnesses, CI checks, or benchmark runners and need a stable record shape for tasks, runs, scores, failures, and normalized external results. It does not run evaluations, call model providers, store dashboards, or orchestrate agents. It gives those tools a shared contract.

Install

pip install agent-eval-contract

For local development from this repo:

uv sync --dev

Validate A Record

from agent_eval_contract import validate_eval_run

run = validate_eval_run(
    {
        "run_id": "run-login-flow-001",
        "task_id": "task-login-flow-001",
        "harness": "pytest",
        "model": "gpt-5",
        "mode": "autonomous",
        "context_profile": "repo_only",
        "final_status": "success",
        "checks": ["pytest tests/test_auth_redirect.py -q"],
    }
)

print(run.model_dump(mode="json"))

Validation returns typed Pydantic model instances. Invalid records raise pydantic.ValidationError with structured field errors.

CLI

agent-eval-contract validate --kind run --file examples/eval_run.json
agent-eval-contract schemas --output-dir /tmp/agent-eval-contract-schemas
agent-eval-contract fixtures --output-dir /tmp/agent-eval-contract-fixtures
agent-eval-contract normalize --harness terminal-bench --file examples/terminal_bench_result.json --task-id task-login-flow-001 --model gpt-5
agent-eval-contract normalize --harness swe-bench --file examples/swe_bench_result.json

The legacy agent-eval-contract-fixtures command still writes fixture bundles for one release.

What It Provides

  • Pydantic models for eval tasks, runs, scores, failures, external results, normalized runs, and fixture manifests
  • runtime validators that return typed model instances
  • JSON Schema export for all public models
  • bundled sample records and markdown templates
  • Terminal-Bench and SWE-bench oriented normalization helpers
  • a small CLI for validation, schema export, fixture generation, and normalization

Contract Vocabulary

The public core uses generic vocabulary only. Project-specific concepts should live in metadata or a separate adapter package.

  • context_profile: repo_only, provided_context, clean_room, tool_augmented, full_workspace
  • source: manual, ci, benchmark, production_trace, synthetic
  • mode: interactive, autonomous, shadow, replay, benchmark
  • final_status: success, partial, failed, abandoned, error

See docs/contract.md, docs/field-reference.md, and docs/adapters.md for the model contract and adapter guidance.

Development

uv run ruff check agent_eval_contract tests
uv run ruff format --check agent_eval_contract tests
uv run basedpyright agent_eval_contract tests
uv run pytest -q
uv build --out-dir /tmp/agent-eval-contract-dist

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_eval_contract-0.2.0.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_eval_contract-0.2.0-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file agent_eval_contract-0.2.0.tar.gz.

File metadata

  • Download URL: agent_eval_contract-0.2.0.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agent_eval_contract-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5b345a6b67d08e8afa86e9bb2a27ace9d65446db72dcb7589482db163725f5f3
MD5 3fc212fb53663ab9165704aa482549e7
BLAKE2b-256 235e63d9c428598fe73c5d445933378ada8d6aae76a69e0d89211fc5693cd22c

See more details on using hashes here.

File details

Details for the file agent_eval_contract-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: agent_eval_contract-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agent_eval_contract-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9a6e2994945b920985e566f3627184c8a584305027b3675f8e65d038afc86cbd
MD5 933a1ef8612dcadefa47308fd8d227ed
BLAKE2b-256 6b03e502f3561f1e991fd1d5b4ca4b9e916d0e18335719a048b43814cb92a628

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page