Skip to main content

CLI for benchmarking agent setups

Project description

benchrail

CLI for benchmarking agent setups.

benchrail is a simple CLI for running the same tasks across different agent setups and comparing the results.

An agent setup can include:

  • a different agent (Codex, Claude code)
  • a different model
  • a different skill
  • a different tool
  • a different AGENTS.md
  • a different prompt or context-engineering strategy
  • a different execution environment

Use benchrail to measure whether a change actually makes the agent better on the tasks you care about.

Quick Start

Install

Install from PyPI with uv:

uv tool install benchrail

Install from PyPI with pip:

pip install benchrail

Run

Run in Docker mode:

benchrail run \
  --dataset examples/multi-swe-bench-universal-smoke \
  --mode docker

Run the same dataset locally:

benchrail run \
  --dataset examples/multi-swe-bench-universal-smoke \
  --mode local 

If you use Docker mode and want to reuse your local AI agent login instead of passing an API key into the container, add:

--auth-session

First-Look Mental Model

The workflow is intentionally simple:

  1. Create or choose a dataset directory
  2. Validate it with benchrail validate
  3. Run it against one or more agents with benchrail run
  4. Inspect per-task JSON results, logs, and the aggregated run summary

At runtime, the tool builds the cartesian product of:

  • dataset instances
  • manifest agents

That becomes the task queue for the run.

Core Concepts

Dataset

A dataset is a directory containing:

  • a manifest.json file describing which agents to run
  • an optional config.json and environment/ directory containing instance defaults
  • one subdirectory per benchmark instance

Instance

Each instance contains a config.json plus optional environment scripts and patches.

Agent

An agent entry in manifest.json maps an agent id to an adapter and optional CLI arguments such as model selection.

Dataset Layout

Expected dataset shape:

<dataset>/
  manifest.json
  config.json
  environment/
    Dockerfile
    setup.sh
  <instance_id>/
    config.json
    environment/
      Dockerfile
      setup.sh
      run-gold-tests.sh
      any_check.sh
    patches/
      test.patch

Dataset config.json fields are inherited by each instance. Nested Docker environment values, hooks, and named check commands are merged, while explicit instance values override dataset defaults.

Dataset environment/ files are copied first, then instance environment/ files are copied on top. For an inherited Dockerfile such as "dockerfile": "environment/Dockerfile", the instance path is used when it exists; otherwise Benchrail falls back to the dataset path. An explicit instance docker.image overrides an inherited Dockerfile, and an explicit instance docker.dockerfile overrides an inherited image.

Example included in this repository:

  • examples/multi-swe-bench-universal-smoke

Validate it before your first run:

benchrail validate \
  --dataset examples/multi-swe-bench-universal-smoke

Example manifest.json

{
  "agents": [
    {
      "id": "codex-gpt-5.4-mini-medium",
      "agent": "codex",
      "version": "latest",
      "command": "--model gpt-5.4-mini --config model_reasoning_effort=\"medium\" --disable fast_mode"
    }
  ]
}

Current built-in agent types:

  • codex
  • claude-code

Execution Modes

local

Use local mode when the host machine already has the right toolchains and agent CLI access.

Pros:

  • Faster iteration
  • No container setup
  • Easier local debugging

Tradeoffs:

  • Depends on host environment consistency
  • Harder to make fully reproducible across machines

docker

Use Docker mode when you want a more reproducible execution environment or need the provided universal image flow.

Pros:

  • Better environment isolation
  • Better fit for multi-language benchmark runs
  • Easier to standardize across machines and CI

Tradeoffs:

  • Requires Docker
  • Adds image and container overhead

Output Artifacts

By default, result artifacts are written under the run workspace. If --output is provided, result JSON and CSV summaries are written there instead.

Aggregated run artifacts:

<output-or-workspace>/<run_id>/
  result.json
  result.csv

Per-task artifacts:

<output-or-workspace>/<run_id>/<agent_id>/<instance_id>/
  result.json
  agent.patch

Per-task logs:

<logs-root>/<run_id>/<agent_id>/<instance_id>/
  runner.log
  logs/
    agent.stdout
    agent.stderr
    check_<name>.stdout
    check_<name>.stderr
    ...

The aggregated run summary includes:

  • passed / failed / total tasks
  • total duration
  • token counts, when available
  • cost in USD and credits, when available
  • per-check pass/fail counts

Development

Run unit tests:

make unit

Manual release prep:

make bump BUMP=patch
git commit -am "Release $(make print-release-tag)"
git push origin main
make tag-release
git push origin "$(make print-release-tag)"

After the tag is pushed, run the GitHub release workflow with that tag.

Run lint and type checks:

make lint

Format the codebase:

make format

Equivalent direct commands:

uv run pytest tests/unit/ -v
uv run ruff check benchrail/ tests/
uv run mypy benchrail tests
uv run ruff format benchrail/ tests/
uv run ruff check --fix benchrail/ tests/

License

The source code in this repository is licensed under the MIT License. See LICENSES/LICENSE.

This repository also contains third-party derived materials:

  • docker/universal/ was adapted in part from https://github.com/openai/codex-universal (MIT)
  • examples/multi-swe-bench-universal-smoke/ is derived from SWE-bench_Lite and SWE-bench_Multilingual

See LICENSES/THIRD_PARTY.md for attribution and redistribution caveats for dataset-derived content.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchrail-0.2.2.tar.gz (145.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchrail-0.2.2-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file benchrail-0.2.2.tar.gz.

File metadata

  • Download URL: benchrail-0.2.2.tar.gz
  • Upload date:
  • Size: 145.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for benchrail-0.2.2.tar.gz
Algorithm Hash digest
SHA256 e3a23192fb9295f79f80f0d64b511ddcfe120a2f3f15773b6cf504c1290556bf
MD5 f9b1c9642db8a7a34308d07005db1914
BLAKE2b-256 a142549e6832af41534dab789b7a31f0fba3054bd1d24f2bf0450166e44f3e4c

See more details on using hashes here.

Provenance

The following attestation bundles were made for benchrail-0.2.2.tar.gz:

Publisher: publish.yml on tripcher/benchrail

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file benchrail-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: benchrail-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for benchrail-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 220a3e0d3d0c2ad22a5977aa9e43499e7a62617b78c500ee5c189729f39b3e8a
MD5 efdd17667e646a05120e14f1e76d7081
BLAKE2b-256 700c6d9f111c686d3914bed643300820b94892a4a2166129ade7de3fc77b3960

See more details on using hashes here.

Provenance

The following attestation bundles were made for benchrail-0.2.2-py3-none-any.whl:

Publisher: publish.yml on tripcher/benchrail

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page