Skip to main content

AI-operated agent eval. LLM-as-judge with BYO keys. OpenTelemetry traces. Works in CI.

Project description

aevals

CI PyPI Python 3.12+ License: MIT

Agent eval framework. Scans your code, captures OpenTelemetry traces, scores runs against your deterministic constraints and custom LLM-judged rubrics.

pip install aevals
aevals init        # detects your agent, generates config
aevals run         # runs scenarios, reports pass/fail

Why

Most teams building agents know they should eval. They don't. The problem isn't motivation — it's that nobody knows where to start.

aevals closes that gap. Point it at your codebase and it figures out the rest — which SDKs you use, where your entrypoint is, what tools your agent has. You go from nothing to a working eval suite without writing boilerplate.

Install

pip install aevals

# Add instrumentation for your provider
pip install aevals[openai]       # OpenAI
pip install aevals[anthropic]    # Anthropic
pip install aevals[google]       # Google GenAI
pip install aevals[bedrock]      # AWS Bedrock
pip install aevals[mistral]      # Mistral
pip install aevals[cohere]       # Cohere

Quick start

1. Initialize — scans your project, detects SDKs and entrypoints, generates aevals.yaml:

aevals init

2. Define scenarios:

# aevals.yaml
config_version: 1
entry: src.agent:main            # module:callable

judge:
  model: openai/gpt-5.4           # any litellm model

scenarios:
  - name: simple-booking
    input: "Book a flight from SFO to JFK for next Tuesday"
    rubric:
      - "Agent calls search_flights before book_flight"
      - "Agent confirms with user before booking"
      - "Final output includes a confirmation number"
    constraints:
      max_steps: 5
      max_duration_ms: 10000

3. Run:

aevals run
── simple-booking ──────────────────────────────────────
  3 spans | 4.2s | 1,840 tokens

  Constraints:
    ✓ steps: 3 <= 5
    ✗ duration: 4200ms > 10000ms

  Rubric: (judge: openai/gpt-5.4)
    ✓ Agent calls search_flights before book_flight
    ✓ Agent confirms with user before booking
    ✓ Final output includes a confirmation number

── Summary ─────────────────────────────────────────────
  1 scenario, 0 passed, 1 failed

How it works

Each scenario spawns your agent in an isolated subprocess. OpenLLMetry auto-instruments your SDK and captures every LLM call as OpenTelemetry spans. The spans are parsed into a trajectory, then scored on two tracks:

Constraints — deterministic, zero LLM cost:

Constraint Checks
max_duration_ms Wall-clock time under limit
max_steps Number of LLM calls under limit
tool_sequence Required tools called in order (subsequence match)
no_repeat_calls No tool called N+ times with identical arguments
output_contains Final output includes a substring

Rubric — natural-language assertions scored pass/fail by a judge model against the full trajectory (every LLM call, tool invocation, intermediate step). Uses litellm, so any model it supports works as a judge. No judge configured? Rubrics stay pending and don't fail the run.

A scenario passes when all constraints pass AND all rubric items pass.

CI

# .github/workflows/eval.yml
- name: Run evals
  run: aevals run --json
  # Exit codes: 0 = all pass, 1 = any fail, 2 = no traces

Constraints need no API keys. Add judge keys as secrets for rubric evaluation; if omitted, rubrics stay pending and don't block the pipeline.

Claude Code

aevals ships as an MCP server. aevals init writes the config to .claude/mcp.json automatically.

aevals mcp-serve

OTel compatibility

Traces are standard OpenTelemetry. Pipe them to Langfuse, Phoenix, Jaeger, or any OTel backend.

Development

pip install -e ".[dev]"
pytest
ruff check src/ tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aevals-0.1.0.tar.gz (75.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aevals-0.1.0-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file aevals-0.1.0.tar.gz.

File metadata

  • Download URL: aevals-0.1.0.tar.gz
  • Upload date:
  • Size: 75.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aevals-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9a6f744c03f8ef9971220e199f56c5a6bc875812251e4452af562f5f4a1ae752
MD5 0db1cd14a0e9aee6e20a789bfaee1a3f
BLAKE2b-256 a10a29018b0adcc5aa31e64b9b20c392fd0b1fab03b1c6672a1c9d12de05585e

See more details on using hashes here.

Provenance

The following attestation bundles were made for aevals-0.1.0.tar.gz:

Publisher: release.yml on satyaborg/aevals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file aevals-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: aevals-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aevals-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e48072e116dde4b2e776328ead56773073413469ed1880e633cf114d682120f1
MD5 fd5faac8c2647bafb2da577b27119767
BLAKE2b-256 62e30d5fad5e46ace163da83d98871c03f401542b85e496effb0cce5ee79c6a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for aevals-0.1.0-py3-none-any.whl:

Publisher: release.yml on satyaborg/aevals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page