CLI for benchmarking agent setups
Project description
benchrail
CLI for benchmarking agent setups.
benchrail is a simple CLI for running the same tasks across different agent
setups and comparing the results.
An agent setup can include:
- a different agent (Codex, Claude code)
- a different model
- a different skill
- a different tool
- a different
AGENTS.md - a different prompt or context-engineering strategy
- a different execution environment
Use benchrail to measure whether a change actually makes the agent better on
the tasks you care about.
Quick Start
Install
Install from PyPI with uv:
uv tool install benchrail
Install from PyPI with pip:
pip install benchrail
Run
Run in Docker mode:
benchrail run \
--dataset examples/multi-swe-bench-universal-smoke \
--mode docker
Run the same dataset locally:
benchrail run \
--dataset examples/multi-swe-bench-universal-smoke \
--mode local
If you use Docker mode and want to reuse your local AI agent login instead of passing an API key into the container, add:
--auth-session
First-Look Mental Model
The workflow is intentionally simple:
- Create or choose a dataset directory
- Validate it with
benchrail validate - Run it against one or more agents with
benchrail run - Inspect per-task JSON results, logs, and the aggregated run summary
At runtime, the tool builds the cartesian product of:
- dataset instances
- manifest agents
That becomes the task queue for the run.
Core Concepts
Dataset
A dataset is a directory containing:
- a
manifest.jsonfile describing which agents to run - an optional
config.jsonandenvironment/directory containing instance defaults - one subdirectory per benchmark instance
Instance
Each instance contains a config.json plus optional environment scripts and patches.
Agent
An agent entry in manifest.json maps an agent id to an adapter and optional CLI
arguments such as model selection.
Dataset Layout
Expected dataset shape:
<dataset>/
manifest.json
config.json
environment/
Dockerfile
setup.sh
<instance_id>/
config.json
environment/
Dockerfile
setup.sh
run-gold-tests.sh
any_check.sh
patches/
test.patch
Dataset config.json fields are inherited by each instance. Nested Docker environment
values, hooks, and named check commands are merged, while explicit instance values override
dataset defaults.
Dataset environment/ files are copied first, then instance environment/ files are copied
on top. For an inherited Dockerfile such as "dockerfile": "environment/Dockerfile", the
instance path is used when it exists; otherwise Benchrail falls back to the dataset path.
An explicit instance docker.image overrides an inherited Dockerfile, and an explicit
instance docker.dockerfile overrides an inherited image.
Example included in this repository:
examples/multi-swe-bench-universal-smoke
Validate it before your first run:
benchrail validate \
--dataset examples/multi-swe-bench-universal-smoke
Example manifest.json
{
"agents": [
{
"id": "codex-gpt-5.4-mini-medium",
"agent": "codex",
"version": "latest",
"command": "--model gpt-5.4-mini --config model_reasoning_effort=\"medium\" --disable fast_mode"
}
]
}
Current built-in agent types:
codexclaude-code
Execution Modes
local
Use local mode when the host machine already has the right toolchains and agent CLI access.
Pros:
- Faster iteration
- No container setup
- Easier local debugging
Tradeoffs:
- Depends on host environment consistency
- Harder to make fully reproducible across machines
docker
Use Docker mode when you want a more reproducible execution environment or need the provided universal image flow.
Pros:
- Better environment isolation
- Better fit for multi-language benchmark runs
- Easier to standardize across machines and CI
Tradeoffs:
- Requires Docker
- Adds image and container overhead
Output Artifacts
By default, result artifacts are written under the run workspace. If --output is
provided, result JSON and CSV summaries are written there instead.
Aggregated run artifacts:
<output-or-workspace>/<run_id>/
result.json
result.csv
Per-task artifacts:
<output-or-workspace>/<run_id>/<agent_id>/<instance_id>/
result.json
agent.patch
Per-task logs:
<logs-root>/<run_id>/<agent_id>/<instance_id>/
runner.log
logs/
agent.stdout
agent.stderr
check_<name>.stdout
check_<name>.stderr
...
The aggregated run summary includes:
- passed / failed / total tasks
- total duration
- token counts, when available
- cost in USD and credits, when available
- per-check pass/fail counts
Development
Run unit tests:
make unit
Manual release prep:
make bump BUMP=patch
git commit -am "Release $(make print-release-tag)"
git push origin main
make tag-release
git push origin "$(make print-release-tag)"
After the tag is pushed, run the GitHub release workflow with that tag.
Run lint and type checks:
make lint
Format the codebase:
make format
Equivalent direct commands:
uv run pytest tests/unit/ -v
uv run ruff check benchrail/ tests/
uv run mypy benchrail tests
uv run ruff format benchrail/ tests/
uv run ruff check --fix benchrail/ tests/
License
The source code in this repository is licensed under the MIT License. See LICENSES/LICENSE.
This repository also contains third-party derived materials:
docker/universal/was adapted in part fromhttps://github.com/openai/codex-universal(MIT)examples/multi-swe-bench-universal-smoke/is derived fromSWE-bench_LiteandSWE-bench_Multilingual
See LICENSES/THIRD_PARTY.md for attribution and redistribution caveats for dataset-derived content.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchrail-0.2.2.tar.gz.
File metadata
- Download URL: benchrail-0.2.2.tar.gz
- Upload date:
- Size: 145.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3a23192fb9295f79f80f0d64b511ddcfe120a2f3f15773b6cf504c1290556bf
|
|
| MD5 |
f9b1c9642db8a7a34308d07005db1914
|
|
| BLAKE2b-256 |
a142549e6832af41534dab789b7a31f0fba3054bd1d24f2bf0450166e44f3e4c
|
Provenance
The following attestation bundles were made for benchrail-0.2.2.tar.gz:
Publisher:
publish.yml on tripcher/benchrail
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
benchrail-0.2.2.tar.gz -
Subject digest:
e3a23192fb9295f79f80f0d64b511ddcfe120a2f3f15773b6cf504c1290556bf - Sigstore transparency entry: 1803559356
- Sigstore integration time:
-
Permalink:
tripcher/benchrail@5513ae8fe0140b666dea9c7c1826505045cb2247 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tripcher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5513ae8fe0140b666dea9c7c1826505045cb2247 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file benchrail-0.2.2-py3-none-any.whl.
File metadata
- Download URL: benchrail-0.2.2-py3-none-any.whl
- Upload date:
- Size: 42.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
220a3e0d3d0c2ad22a5977aa9e43499e7a62617b78c500ee5c189729f39b3e8a
|
|
| MD5 |
efdd17667e646a05120e14f1e76d7081
|
|
| BLAKE2b-256 |
700c6d9f111c686d3914bed643300820b94892a4a2166129ade7de3fc77b3960
|
Provenance
The following attestation bundles were made for benchrail-0.2.2-py3-none-any.whl:
Publisher:
publish.yml on tripcher/benchrail
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
benchrail-0.2.2-py3-none-any.whl -
Subject digest:
220a3e0d3d0c2ad22a5977aa9e43499e7a62617b78c500ee5c189729f39b3e8a - Sigstore transparency entry: 1803559366
- Sigstore integration time:
-
Permalink:
tripcher/benchrail@5513ae8fe0140b666dea9c7c1826505045cb2247 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tripcher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5513ae8fe0140b666dea9c7c1826505045cb2247 -
Trigger Event:
workflow_dispatch
-
Statement type: