Agent evals framework that gives Claude Code the tools to instrument your codebase, capture traces, write evals, and catch LLM regressions.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

satyaborg

These details have not been verified by PyPI

Project links

Homepage

Project description

aevals

Agent evals framework that gives Claude Code the tools to instrument your codebase, capture traces, write evals, and catch LLM regressions.

pip install aevals
aevals init        # detects your agent, generates config
aevals run         # runs scenarios, reports pass/fail

Why

Most teams building agents know they should eval. They don't. The problem isn't motivation — it's that nobody knows where to start.

aevals closes that gap. Point it at your codebase and it figures out the rest — which SDKs you use, where your entrypoint is, what tools your agent has. You go from nothing to a working eval suite without writing boilerplate.

Install

pip install aevals

# Add instrumentation for your provider
pip install aevals[openai]       # OpenAI
pip install aevals[anthropic]    # Anthropic
pip install aevals[google]       # Google GenAI
pip install aevals[bedrock]      # AWS Bedrock
pip install aevals[mistral]      # Mistral
pip install aevals[cohere]       # Cohere

Quick start

1. Initialize — scans your project, detects SDKs and entrypoints, generates aevals.yaml:

aevals init

2. Define scenarios:

# aevals.yaml
config_version: 1
entry: src.agent:main            # module:callable

judge:
  model: openai/gpt-5.4           # any litellm model

scenarios:
  - name: simple-booking
    input: "Book a flight from SFO to JFK for next Tuesday"
    rubric:
      - "Agent calls search_flights before book_flight"
      - "Agent confirms with user before booking"
      - "Final output includes a confirmation number"
    constraints:
      max_steps: 5
      max_duration_ms: 10000

3. Run:

aevals run

── simple-booking ──────────────────────────────────────
  3 spans | 4.2s | 1,840 tokens

  Constraints:
    ✓ steps: 3 <= 5
    ✗ duration: 4200ms > 10000ms

  Rubric: (judge: openai/gpt-5.4)
    ✓ Agent calls search_flights before book_flight
    ✓ Agent confirms with user before booking
    ✓ Final output includes a confirmation number

── Summary ─────────────────────────────────────────────
  1 scenario, 0 passed, 1 failed

How it works

Each scenario spawns your agent in an isolated subprocess. OpenLLMetry auto-instruments your SDK and captures every LLM call as OpenTelemetry spans. The spans are parsed into a trajectory, then scored on two tracks:

Constraints — deterministic, zero LLM cost:

Constraint	Checks
`max_duration_ms`	Wall-clock time under limit
`max_steps`	Number of LLM calls under limit
`tool_sequence`	Required tools called in order (subsequence match)
`no_repeat_calls`	No tool called N+ times with identical arguments
`output_contains`	Final output includes a substring

Rubric — natural-language assertions scored pass/fail by a judge model against the full trajectory (every LLM call, tool invocation, intermediate step). Uses litellm, so any model it supports works as a judge. No judge configured? Rubrics stay pending and don't fail the run.

A scenario passes when all constraints pass AND all rubric items pass.

CI

# .github/workflows/eval.yml
- name: Run evals
  run: aevals run --json
  # Exit codes: 0 = all pass, 1 = any fail, 2 = no traces

Constraints need no API keys. Add judge keys as secrets for rubric evaluation; if omitted, rubrics stay pending and don't block the pipeline.

Claude Code

aevals ships as an MCP server. aevals init writes the config to .claude/mcp.json automatically.

aevals mcp-serve

OTel compatibility

Traces are standard OpenTelemetry. Pipe them to Langfuse, Phoenix, Jaeger, or any OTel backend.

Development

pip install -e ".[dev]"
pytest
ruff check src/ tests/

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

satyaborg

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.1

Mar 15, 2026

0.1.0

Mar 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aevals-0.1.1.tar.gz (75.9 kB view details)

Uploaded Mar 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aevals-0.1.1-py3-none-any.whl (26.3 kB view details)

Uploaded Mar 15, 2026 Python 3

File details

Details for the file aevals-0.1.1.tar.gz.

File metadata

Download URL: aevals-0.1.1.tar.gz
Upload date: Mar 15, 2026
Size: 75.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aevals-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`29c6843219300c03914ecd3664def3f6298b55fb010bdb3cac7db8cfd036e822`
MD5	`b8e1972764d840dfcc0ff2888c2cdf62`
BLAKE2b-256	`c7353b7aa0ae343e749a2afc54dcd4933f385db1d15c6797833a1bf64e0bbf84`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aevals-0.1.1.tar.gz:

Publisher: release.yml on satyaborg/aevals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aevals-0.1.1.tar.gz
- Subject digest: 29c6843219300c03914ecd3664def3f6298b55fb010bdb3cac7db8cfd036e822
- Sigstore transparency entry: 1107645999
- Sigstore integration time: Mar 15, 2026
Source repository:
- Permalink: satyaborg/aevals@f1f099c5b445614a17870c498d21fa305186102c
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/satyaborg
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f1f099c5b445614a17870c498d21fa305186102c
- Trigger Event: push

File details

Details for the file aevals-0.1.1-py3-none-any.whl.

File metadata

Download URL: aevals-0.1.1-py3-none-any.whl
Upload date: Mar 15, 2026
Size: 26.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aevals-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`990c49fde272856ebbac10e8324788cc43b583d9e04c9b4f514b7908dc025c71`
MD5	`c1fa5387fea688801007d6db21f4ef67`
BLAKE2b-256	`08de628779eabd21bb9e13860063b7c0114200e89c5430e0e35602b8365236de`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aevals-0.1.1-py3-none-any.whl:

Publisher: release.yml on satyaborg/aevals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aevals-0.1.1-py3-none-any.whl
- Subject digest: 990c49fde272856ebbac10e8324788cc43b583d9e04c9b4f514b7908dc025c71
- Sigstore transparency entry: 1107646004
- Sigstore integration time: Mar 15, 2026
Source repository:
- Permalink: satyaborg/aevals@f1f099c5b445614a17870c498d21fa305186102c
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/satyaborg
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f1f099c5b445614a17870c498d21fa305186102c
- Trigger Event: push

aevals 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

aevals

Why

Install

Quick start

How it works

CI

Claude Code

OTel compatibility

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance