Agent evals framework that gives Claude Code the tools to instrument your codebase, capture traces, write evals, and catch LLM regressions.
Project description
aevals
Agent evals framework that gives Claude Code the tools to instrument your codebase, capture traces, write evals, and catch LLM regressions.
pip install aevals
aevals init # detects your agent, generates config
aevals run # runs scenarios, reports pass/fail
Why
Most teams building agents know they should eval. They don't. The problem isn't motivation — it's that nobody knows where to start.
aevals closes that gap. Point it at your codebase and it figures out the rest — which SDKs you use, where your entrypoint is, what tools your agent has. You go from nothing to a working eval suite without writing boilerplate.
Install
pip install aevals
# Add instrumentation for your provider
pip install aevals[openai] # OpenAI
pip install aevals[anthropic] # Anthropic
pip install aevals[google] # Google GenAI
pip install aevals[bedrock] # AWS Bedrock
pip install aevals[mistral] # Mistral
pip install aevals[cohere] # Cohere
Quick start
1. Initialize — scans your project, detects SDKs and entrypoints, generates aevals.yaml:
aevals init
2. Define scenarios:
# aevals.yaml
config_version: 1
entry: src.agent:main # module:callable
judge:
model: openai/gpt-5.4 # any litellm model
scenarios:
- name: simple-booking
input: "Book a flight from SFO to JFK for next Tuesday"
rubric:
- "Agent calls search_flights before book_flight"
- "Agent confirms with user before booking"
- "Final output includes a confirmation number"
constraints:
max_steps: 5
max_duration_ms: 10000
3. Run:
aevals run
── simple-booking ──────────────────────────────────────
3 spans | 4.2s | 1,840 tokens
Constraints:
✓ steps: 3 <= 5
✗ duration: 4200ms > 10000ms
Rubric: (judge: openai/gpt-5.4)
✓ Agent calls search_flights before book_flight
✓ Agent confirms with user before booking
✓ Final output includes a confirmation number
── Summary ─────────────────────────────────────────────
1 scenario, 0 passed, 1 failed
How it works
Each scenario spawns your agent in an isolated subprocess. OpenLLMetry auto-instruments your SDK and captures every LLM call as OpenTelemetry spans. The spans are parsed into a trajectory, then scored on two tracks:
Constraints — deterministic, zero LLM cost:
| Constraint | Checks |
|---|---|
max_duration_ms |
Wall-clock time under limit |
max_steps |
Number of LLM calls under limit |
tool_sequence |
Required tools called in order (subsequence match) |
no_repeat_calls |
No tool called N+ times with identical arguments |
output_contains |
Final output includes a substring |
Rubric — natural-language assertions scored pass/fail by a judge model against the full trajectory (every LLM call, tool invocation, intermediate step). Uses litellm, so any model it supports works as a judge. No judge configured? Rubrics stay pending and don't fail the run.
A scenario passes when all constraints pass AND all rubric items pass.
CI
# .github/workflows/eval.yml
- name: Run evals
run: aevals run --json
# Exit codes: 0 = all pass, 1 = any fail, 2 = no traces
Constraints need no API keys. Add judge keys as secrets for rubric evaluation; if omitted, rubrics stay pending and don't block the pipeline.
Claude Code
aevals ships as an MCP server. aevals init writes the config to .claude/mcp.json automatically.
aevals mcp-serve
OTel compatibility
Traces are standard OpenTelemetry. Pipe them to Langfuse, Phoenix, Jaeger, or any OTel backend.
Development
pip install -e ".[dev]"
pytest
ruff check src/ tests/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aevals-0.1.1.tar.gz.
File metadata
- Download URL: aevals-0.1.1.tar.gz
- Upload date:
- Size: 75.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29c6843219300c03914ecd3664def3f6298b55fb010bdb3cac7db8cfd036e822
|
|
| MD5 |
b8e1972764d840dfcc0ff2888c2cdf62
|
|
| BLAKE2b-256 |
c7353b7aa0ae343e749a2afc54dcd4933f385db1d15c6797833a1bf64e0bbf84
|
Provenance
The following attestation bundles were made for aevals-0.1.1.tar.gz:
Publisher:
release.yml on satyaborg/aevals
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aevals-0.1.1.tar.gz -
Subject digest:
29c6843219300c03914ecd3664def3f6298b55fb010bdb3cac7db8cfd036e822 - Sigstore transparency entry: 1107645999
- Sigstore integration time:
-
Permalink:
satyaborg/aevals@f1f099c5b445614a17870c498d21fa305186102c -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/satyaborg
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f1f099c5b445614a17870c498d21fa305186102c -
Trigger Event:
push
-
Statement type:
File details
Details for the file aevals-0.1.1-py3-none-any.whl.
File metadata
- Download URL: aevals-0.1.1-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
990c49fde272856ebbac10e8324788cc43b583d9e04c9b4f514b7908dc025c71
|
|
| MD5 |
c1fa5387fea688801007d6db21f4ef67
|
|
| BLAKE2b-256 |
08de628779eabd21bb9e13860063b7c0114200e89c5430e0e35602b8365236de
|
Provenance
The following attestation bundles were made for aevals-0.1.1-py3-none-any.whl:
Publisher:
release.yml on satyaborg/aevals
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aevals-0.1.1-py3-none-any.whl -
Subject digest:
990c49fde272856ebbac10e8324788cc43b583d9e04c9b4f514b7908dc025c71 - Sigstore transparency entry: 1107646004
- Sigstore integration time:
-
Permalink:
satyaborg/aevals@f1f099c5b445614a17870c498d21fa305186102c -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/satyaborg
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f1f099c5b445614a17870c498d21fa305186102c -
Trigger Event:
push
-
Statement type: