CLI benchmark suite for evaluating AI agents across providers.

Project description

Agent Builder Evals

Agent Builder Evals is a CLI benchmark suite for evaluating AI agents, not just base models. It compares provider/model/architecture runs across task completion, tool-call accuracy, citation quality, latency, cost, and failure rate.

The MVP is CLI-only. Results are timestamped JSON files in results/; agent-evals show and agent-evals compare are the presentation layer. A future dashboard can read the same JSON contract directly.

Setup

uv sync --extra dev
cp .env.example .env

Set keys as needed:

OPENAI_API_KEY=
ANTHROPIC_API_KEY=
YOU_API_KEY=

Commands

uv run agent-evals list-tasks
uv run agent-evals run --provider anthropic --model claude-opus-4-8 --tasks company_research_001 --out results/
uv run agent-evals run --provider openai --model gpt-5.1 --tasks company_research_001 --out results/
uv run agent-evals show results/run_*.json
uv run agent-evals compare results/run_a.json results/run_b.json
uv run agent-evals replay results/run_a.json

Fairness Model

Both providers receive the same tool specs and execute the same local tool code:

web_search
fetch_url
code_exec
support_lookup

The provider adapters only translate the shared ToolSpec into each SDK's expected shape.

Tests

make test

The default tests are mocked and do not require API keys. Live smoke runs require provider keys and You.com.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_builder_evals-0.1.0.tar.gz (90.3 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_builder_evals-0.1.0-py3-none-any.whl (43.6 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file agent_builder_evals-0.1.0.tar.gz.

File metadata

Download URL: agent_builder_evals-0.1.0.tar.gz
Upload date: May 29, 2026
Size: 90.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agent_builder_evals-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9bdce75b82097a65c65ac45dfd11adffb172d4fcec30d56ae0d57972945864c2`
MD5	`8f5e3feb3c70bf296d5696714a37b4b7`
BLAKE2b-256	`835d50bd81c205302a5e00f15a417a311bcafd2c24448a423e752933f27db22e`

See more details on using hashes here.

File details

Details for the file agent_builder_evals-0.1.0-py3-none-any.whl.

File metadata

Download URL: agent_builder_evals-0.1.0-py3-none-any.whl
Upload date: May 29, 2026
Size: 43.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agent_builder_evals-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`111691d234ac77acc358aa68f4d196aee226d960bce9e8bf3ebe7e1dca7569f9`
MD5	`90811106fbae61cbe1a3146365138696`
BLAKE2b-256	`7eee14b8bf4a7c9d992da4b7596301ee23abaf55e1a62415d1896c8091424db2`

See more details on using hashes here.

agent-builder-evals 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Agent Builder Evals

Setup

Commands

Fairness Model

Tests

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes