Skip to main content

CLI benchmark suite for evaluating AI agents across providers.

Project description

Agent Builder Evals

Agent Builder Evals is a CLI benchmark suite for evaluating AI agents, not just base models. It compares provider/model/architecture runs across task completion, tool-call accuracy, citation quality, latency, cost, and failure rate.

The MVP is CLI-only. Results are timestamped JSON files in results/; agent-evals show and agent-evals compare are the presentation layer. A future dashboard can read the same JSON contract directly.

Setup

uv sync --extra dev
cp .env.example .env

Set keys as needed:

OPENAI_API_KEY=
ANTHROPIC_API_KEY=
YOU_API_KEY=

Commands

uv run agent-evals list-tasks
uv run agent-evals run --provider anthropic --model claude-opus-4-8 --tasks company_research_001 --out results/
uv run agent-evals run --provider openai --model gpt-5.1 --tasks company_research_001 --out results/
uv run agent-evals show results/run_*.json
uv run agent-evals compare results/run_a.json results/run_b.json
uv run agent-evals replay results/run_a.json

Fairness Model

Both providers receive the same tool specs and execute the same local tool code:

  • web_search
  • fetch_url
  • code_exec
  • support_lookup

The provider adapters only translate the shared ToolSpec into each SDK's expected shape.

Tests

make test

The default tests are mocked and do not require API keys. Live smoke runs require provider keys and You.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_builder_evals-0.1.0.tar.gz (90.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_builder_evals-0.1.0-py3-none-any.whl (43.6 kB view details)

Uploaded Python 3

File details

Details for the file agent_builder_evals-0.1.0.tar.gz.

File metadata

  • Download URL: agent_builder_evals-0.1.0.tar.gz
  • Upload date:
  • Size: 90.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agent_builder_evals-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9bdce75b82097a65c65ac45dfd11adffb172d4fcec30d56ae0d57972945864c2
MD5 8f5e3feb3c70bf296d5696714a37b4b7
BLAKE2b-256 835d50bd81c205302a5e00f15a417a311bcafd2c24448a423e752933f27db22e

See more details on using hashes here.

File details

Details for the file agent_builder_evals-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agent_builder_evals-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 43.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agent_builder_evals-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 111691d234ac77acc358aa68f4d196aee226d960bce9e8bf3ebe7e1dca7569f9
MD5 90811106fbae61cbe1a3146365138696
BLAKE2b-256 7eee14b8bf4a7c9d992da4b7596301ee23abaf55e1a62415d1896c8091424db2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page