CLI for benchmarking agent setups

These details have not been verified by PyPI

Project description

benchrail

CLI for benchmarking agent setups.

benchrail is a simple CLI for running the same tasks across different agent setups and comparing the results.

An agent setup can include:

a different agent (Codex, Claude code)
a different model
a different skill
a different tool
a different AGENTS.md
a different prompt or context-engineering strategy
a different execution environment

Use benchrail to measure whether a change actually makes the agent better on the tasks you care about.

Quick Start

Install

Install from PyPI with uv:

uv tool install benchrail

Install from PyPI with pip:

pip install benchrail

Run

Run in Docker mode:

benchrail run \
  --dataset examples/multi-swe-bench-universal-smoke \
  --mode docker

Run the same dataset locally:

benchrail run \
  --dataset examples/multi-swe-bench-universal-smoke \
  --mode local

If you use Docker mode and want to reuse your local AI agent login instead of passing an API key into the container, add:

--auth-session

First-Look Mental Model

The workflow is intentionally simple:

Create or choose a dataset directory
Validate it with benchrail validate
Run it against one or more agents with benchrail run
Inspect per-task JSON results, logs, and the aggregated run summary

At runtime, the tool builds the cartesian product of:

dataset instances
manifest agents

That becomes the task queue for the run.

Core Concepts

Dataset

A dataset is a directory containing:

a manifest.json file describing which agents to run
an optional config.json and environment/ directory containing instance defaults
one subdirectory per benchmark instance

Instance

Each instance contains a config.json plus optional environment scripts and patches.

Agent

An agent entry in manifest.json maps an agent id to an adapter and optional CLI arguments such as model selection.

Dataset Layout

Expected dataset shape:

<dataset>/
  manifest.json
  config.json
  environment/
    Dockerfile
    setup.sh
  <instance_id>/
    config.json
    environment/
      Dockerfile
      setup.sh
      run-gold-tests.sh
      any_check.sh
    patches/
      test.patch

Dataset config.json fields are inherited by each instance. Nested Docker environment values, hooks, and named check commands are merged, while explicit instance values override dataset defaults.

Dataset environment/ files are copied first, then instance environment/ files are copied on top. For an inherited Dockerfile such as "dockerfile": "environment/Dockerfile", the instance path is used when it exists; otherwise Benchrail falls back to the dataset path. An explicit instance docker.image overrides an inherited Dockerfile, and an explicit instance docker.dockerfile overrides an inherited image.

Example included in this repository:

examples/multi-swe-bench-universal-smoke

Validate it before your first run:

benchrail validate \
  --dataset examples/multi-swe-bench-universal-smoke

Example `manifest.json`

{
  "agents": [
    {
      "id": "codex-gpt-5.4-mini-medium",
      "agent": "codex",
      "version": "latest",
      "command": "--model gpt-5.4-mini --config model_reasoning_effort=\"medium\" --disable fast_mode"
    }
  ]
}

Current built-in agent types:

codex
claude-code

Execution Modes

`local`

Use local mode when the host machine already has the right toolchains and agent CLI access.

Pros:

Faster iteration
No container setup
Easier local debugging

Tradeoffs:

Depends on host environment consistency
Harder to make fully reproducible across machines

`docker`

Use Docker mode when you want a more reproducible execution environment or need the provided universal image flow.

Pros:

Better environment isolation
Better fit for multi-language benchmark runs
Easier to standardize across machines and CI

Tradeoffs:

Requires Docker
Adds image and container overhead

Output Artifacts

By default, result artifacts are written under the run workspace. If --output is provided, result JSON and CSV summaries are written there instead.

Aggregated run artifacts:

<output-or-workspace>/<run_id>/
  result.json
  result.csv

Per-task artifacts:

<output-or-workspace>/<run_id>/<agent_id>/<instance_id>/
  result.json
  agent.patch

Per-task logs:

<logs-root>/<run_id>/<agent_id>/<instance_id>/
  runner.log
  logs/
    agent.stdout
    agent.stderr
    check_<name>.stdout
    check_<name>.stderr
    ...

The aggregated run summary includes:

passed / failed / total tasks
total duration
token counts, when available
cost in USD and credits, when available
per-check pass/fail counts

Development

Run unit tests:

make unit

Manual release prep:

make bump BUMP=patch
git commit -am "Release $(make print-release-tag)"
git push origin main
make tag-release
git push origin "$(make print-release-tag)"

After the tag is pushed, run the GitHub release workflow with that tag.

Run lint and type checks:

make lint

Format the codebase:

make format

Equivalent direct commands:

uv run pytest tests/unit/ -v
uv run ruff check benchrail/ tests/
uv run mypy benchrail tests
uv run ruff format benchrail/ tests/
uv run ruff check --fix benchrail/ tests/

License

The source code in this repository is licensed under the MIT License. See LICENSES/LICENSE.

This repository also contains third-party derived materials:

docker/universal/ was adapted in part from https://github.com/openai/codex-universal (MIT)
examples/multi-swe-bench-universal-smoke/ is derived from SWE-bench_Lite and SWE-bench_Multilingual

See LICENSES/THIRD_PARTY.md for attribution and redistribution caveats for dataset-derived content.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.3

Jun 16, 2026

This version

0.2.2

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchrail-0.2.2.tar.gz (145.0 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

benchrail-0.2.2-py3-none-any.whl (42.9 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file benchrail-0.2.2.tar.gz.

File metadata

Download URL: benchrail-0.2.2.tar.gz
Upload date: Jun 12, 2026
Size: 145.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for benchrail-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`e3a23192fb9295f79f80f0d64b511ddcfe120a2f3f15773b6cf504c1290556bf`
MD5	`f9b1c9642db8a7a34308d07005db1914`
BLAKE2b-256	`a142549e6832af41534dab789b7a31f0fba3054bd1d24f2bf0450166e44f3e4c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for benchrail-0.2.2.tar.gz:

Publisher: publish.yml on tripcher/benchrail

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: benchrail-0.2.2.tar.gz
- Subject digest: e3a23192fb9295f79f80f0d64b511ddcfe120a2f3f15773b6cf504c1290556bf
- Sigstore transparency entry: 1803559356
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: tripcher/benchrail@5513ae8fe0140b666dea9c7c1826505045cb2247
- Branch / Tag: refs/heads/main
- Owner: https://github.com/tripcher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5513ae8fe0140b666dea9c7c1826505045cb2247
- Trigger Event: workflow_dispatch

File details

Details for the file benchrail-0.2.2-py3-none-any.whl.

File metadata

Download URL: benchrail-0.2.2-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 42.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for benchrail-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`220a3e0d3d0c2ad22a5977aa9e43499e7a62617b78c500ee5c189729f39b3e8a`
MD5	`efdd17667e646a05120e14f1e76d7081`
BLAKE2b-256	`700c6d9f111c686d3914bed643300820b94892a4a2166129ade7de3fc77b3960`

See more details on using hashes here.

Provenance

The following attestation bundles were made for benchrail-0.2.2-py3-none-any.whl:

Publisher: publish.yml on tripcher/benchrail

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: benchrail-0.2.2-py3-none-any.whl
- Subject digest: 220a3e0d3d0c2ad22a5977aa9e43499e7a62617b78c500ee5c189729f39b3e8a
- Sigstore transparency entry: 1803559366
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: tripcher/benchrail@5513ae8fe0140b666dea9c7c1826505045cb2247
- Branch / Tag: refs/heads/main
- Owner: https://github.com/tripcher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5513ae8fe0140b666dea9c7c1826505045cb2247
- Trigger Event: workflow_dispatch

benchrail 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

benchrail

Quick Start

Install

Run

First-Look Mental Model

Core Concepts

Dataset

Instance

Agent

Dataset Layout

Example manifest.json

Execution Modes

local

docker

Output Artifacts

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Example `manifest.json`

`local`

`docker`