Skip to main content

A lightweight benchmark for action-oriented agents.

Project description

TraceCore (Agent Bench CLI)

Tests Python License: MIT

TraceCore

TraceCore overview

A lightweight benchmark for action-oriented agents inspired by the OpenClaw style—planner loops, tool APIs, partial observability—but open to any implementation that satisfies the harness.

TraceCore evaluates whether an agent can operate—not just reason. No LLM judges. No vibes. No giant simulators.

Brand note: TraceCore is the product name; the CLI/package and commands remain agent-bench for backward compatibility.

Core definition: see docs/core.md for the Deterministic Episode Runtime primitive and invariant contracts.

If your agent can survive this benchmark, it can probably survive production.

Installation

Published package (recommended)

pip install tracecore

Or with uv:

uv pip install tracecore

This installs the agent-bench CLI and all runtime dependencies. The CLI is immediately available once your environment's Scripts directory is on PATH.

Developer / contributor install

Clone the repo and install in editable mode to keep tasks and CLI entries in sync with your working tree (required for the web UI and the registry-powered loader):

git clone https://github.com/justindobbs/Tracecore.git
cd Tracecore
python -m venv .venv && .venv\Scripts\activate  # or source .venv/bin/activate on macOS/Linux
pip install -e .[dev]

pip install -e keeps the package in sync with your working tree so new tasks + CLI entries are immediately available.

Windows PATH tip

The editable install drops agent-bench.exe into %APPDATA%\Python\Python310\Scripts (or whichever minor version you're using). Add that folder to Path via System Properties → Environment Variables so agent-bench works from any terminal. After updating Path, open a new shell.

Prefer a one-step install? pipx install tracecore drops its own shim into %USERPROFILE%\.local\bin and handles PATH automatically.

Already using uv? Run uv tool install tracecore to create the CLI shim in %USERPROFILE%\.local\bin. uv's bootstrap already wires that directory into PATH, so no manual environment edits are required.

Prefer a shorter command name? Create a shell alias so tracecore forwards to agent-bench:

  • PowerShell (add to $PROFILE): Set-Alias tracecore agent-bench
  • Command Prompt: doskey tracecore=agent-bench $*
  • Bash/Zsh: alias tracecore='agent-bench'

The alias simply invokes the same CLI, so all subcommands and flags continue to work.

Quick start

Fastest path — run a known-good agent+task pairing by name:

agent-bench run pairing log_stream_monitor
agent-bench run pairing log_stream_monitor --seed 7

See all available pairings:

agent-bench run pairing --list

Smoke-test every pairing in one shot (useful after a harness change):

agent-bench run pairing --all
agent-bench run pairing --all --seed 7 --timeout 120   # 120 s wall-clock limit per run

Or navigate into the agents/ directory — if only one pairing matches a file there, it auto-selects:

cd agents
agent-bench run pairing          # auto-detects if unambiguous

Run any agent+task+seed explicitly, with an optional wall-clock timeout:

agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42 --timeout 60

Need an end-to-end TraceCore + Pydantic AI example? See docs/pydantic_poc.md for the deterministic dice game agent/task combo.

Want a standalone proof-of-concept that walks through the full execution loop? See examples/simple_agent_demo/ — a self-contained demo with a CLI that lists tasks, lists agents, and runs any pairing with verbose trace output:

cd examples/simple_agent_demo
python demo.py --task dice_game --agent dice_game_agent
python demo.py --list-tasks
python demo.py --list-agents

Prefer a guided setup? Launch the colorful wizard and let it walk you through agent/task/seed selection (it saves the answers and then calls the same run command under the hood):

agent-bench interactive
# add --dry-run to preview the command without executing
# add --save-session to remember your choices for next time
# add --plugins to include plugin tasks in discovery
# add --no-color if your terminal doesn't support ANSI colors

The wizard includes:

  • Suggested pairings: See agent-task combinations with proven success (if baseline data exists)
  • Agent validation: Checks that selected agents implement the required interface
  • Task budgets: Shows steps and tool_calls limits for each task
  • Progress indicators: Guides you through "Step 1/3", "Step 2/3", "Step 3/3"
  • Fuzzy search: Type partial names to filter agents/tasks
  • Inline help: Press ? during any prompt for context-sensitive tips
  • Session persistence: Use --save-session to remember your last selections
  • Dry-run mode: Preview the exact command before execution with --dry-run

Prefer the UI?

agent-bench dashboard --reload
# then open http://localhost:8000

Point the form at agents/toy_agent.py + filesystem_hidden_config@1 for a deterministic smoke test, or switch to agents/rate_limit_agent.py for the API scenarios. The Pairings tab in the dashboard provides one-click launch for every known-good pairing.

Inspect recent runs

Print a compact table of recent runs without opening the dashboard:

agent-bench runs summary
agent-bench runs summary --task log_stream_monitor@1 --limit 10
agent-bench runs summary --failure-type budget_exhausted

For raw JSON output use agent-bench runs list (same filters).

Run tests

python -m pytest

Want a single command that runs task validation + pytest and can apply a couple guarded, mechanical fixes? See docs/maintainer.md:

agent-bench maintain

Write a new agent

Scaffold a stub with the correct reset / observe / act interface in one command:

agent-bench new-agent my_agent
# creates agents/my_agent_agent.py with inline docstrings and budget-guard boilerplate

Kebab-case names are normalised automatically (my-agentMyAgentAgent). Use --output-dir to write elsewhere, --force to overwrite an existing file.

Then wire it to a task and run:

agent-bench run --agent agents/my_agent_agent.py --task filesystem_hidden_config@1 --seed 0

See docs/agents.md for the full interface contract and docs/task_harness.md for the action schema.

Troubleshooting

Need help diagnosing install, CLI, or validator issues? See docs/troubleshooting.md for a consolidated guide that covers PATH fixes, common failure types, and dashboard hiccups.

Note: Task budgets are configured in each task's task.toml manifest and can be inspected via agent-bench tasks validate --registry. There is no --budget CLI override flag; budgets are enforced from the task definition.

Tutorials

Framing the idea

Terminal Bench works because it:

  • Evaluates agents via real tasks, not synthetic prompts
  • Uses a simple, opinionated interface (a terminal)
  • Is cheap to run, easy to extend, and hard to game

An operations-focused benchmark should do the same, but centered on:

  • Action-oriented agents with tool APIs
  • Environment interaction and partial observability
  • Longish horizons with state, retries, and recovery

In practice, this covers:

  • OpenClaw-native agents
  • Custom planner loops wired into REST or filesystem tools
  • Orchestration agents (e.g., TaskWeaver, AutoGPT-style) that can wrap the simple reset/observe/act interface

Think of it less as "benchmarking a model" and more as benchmarking an agent loop end-to-end.

What makes these agents distinct?

(Adjust these if your mental model differs.)

  • Planner / policy loop instead of single-shot prompting
  • Tool or action interfaces instead of raw chat completions
  • Optional memory, world models, or reusable skills
  • Strong emphasis on doing, not just responding

So the benchmark should not test raw language quality or one-shot reasoning. It should test:

  • Decision-making under constraints
  • Tool sequencing and dependency management
  • Recovery from errors and partial failures
  • State tracking over time and across steps

Why this exists

Most benchmarks answer questions like:

  • Can the model reason?
  • Can it write the right patch?
  • Can it roleplay an agent?

TraceCore answers a different question: Can this agent run unattended and get the job done without breaking things?

We test:

  • Tool sequencing
  • Error recovery
  • State tracking
  • Long-horizon behavior
  • Boring, reliable decision-making

Design principles

  1. Minimal environment, maximal signal
    • Keep worlds tiny, deterministic, and inspectable: toy filesystems, fake APIs, log streams, local services.
    • No giant simulators or cloud dependencies—everything should run in seconds on a laptop.
  2. Agent-in-the-loop evaluation
    • Benchmark the entire perception → reasoning → action loop, not a single prompt.
    • Each task specifies initial state, tool interface, validator, and explicit budgets (steps + tool calls).
  3. Binary outcomes first
    • Success or failure is the headline metric; secondary stats (steps, tool calls, errors) give color.
    • Deterministic tasks + frozen versions make regressions obvious and stop overfitting.
  4. Hard to game, easy to extend
    • Sandboxed execution, limited affordances, and published hashes keep agents honest.
    • Tasks are small Python packages so contributors can add new scenarios without ceremony.

Task categories (operations-native)

1. Tool choreography tasks

Goal: stress sequencing, dependency management, and retries.

  • Example: rate_limited_api@1 — retrieve an ACCESS_TOKEN from a mock API that enforces a deterministic rate limit and transient failures.
  • Signals: correct tool ordering, retry logic, state retention, graceful degradation.

2. Partial observability & discovery

Goal: reward cautious exploration instead of brute force.

  • Example: “Traverse a directory tree with undocumented schema. Find the real config key without trashing the filesystem.”
  • Signals: hypothesis updates, selective reads, remembering seen paths, avoiding repeated mistakes.

3. Long-horizon maintenance

Goal: ensure persistence, monitoring, and acting at the right moment.

  • Example: “A service degrades over time. Watch logs, detect the symptom, and apply the correct fix only when needed.”
  • Signals: patience, trigger detection, not overreacting, applying steady-state playbooks.

4. Adversarial-but-fair environments

Goal: test robustness when the world is a little hostile.

  • Example: flaky tools, malformed API responses, conflicting telemetry that needs disambiguation.
  • Signals: error recovery, fallback strategies, keeping track of provenance before acting.

Scoring without overengineering

  • Binary success/failure is the scoreboard.
  • Secondary metrics: steps taken, tool calls, wall-clock time, error count.
  • No LLM judges, no vibes, no composite scores you can’t reason about.

Interface sketch

Agents run exactly like they would in production: provide an agent, pick a task, respect the budget.

agent-bench run \
  --agent agents/toy_agent.py \
  --task filesystem_hidden_config@1 \
  --seed 42

Each task ships with a harness, fake environment, and validator. Agents only see what they’re allowed to see.

Why this matters (and what’s missing today)

Most agent benchmarks collapse back into single-prompt exams. They rarely measure recovery, operational competence, or whether the agent can survive unattended. TraceCore surfaces engineering-quality differences and rewards boring-but-correct behavior.

Potential pitfalls & guardrails

  • Overfitting to the harness → Keep suites varied, publish fixtures, encourage new contributions.
  • Agents cheating via inspection → Sandbox aggressively, freeze binaries, limit visibility.
  • Benchmark drift → Freeze task versions, publish hashes/seeded assets, require changelog entries.

What’s in v0

Task suites:

  • Filesystem & State
  • Tool Choreography
  • Long-Horizon & Monitoring
  • Adversarial-but-Fair
  • Operations & Triage

Shipping tasks:

  • filesystem_hidden_config@1 (filesystem suite): explore a hidden directory tree to find the one true API_KEY.
  • rate_limited_api@1 (api suite): classify API errors, respect retry_after, and persist the returned ACCESS_TOKEN.
  • log_alert_triage@1 (operations suite): triage deterministic logs and extract the final ALERT_CODE.
  • config_drift_remediation@1 (operations suite): compare desired vs. live configs and output the remediation patch line.
  • incident_recovery_chain@1 (operations suite): follow a recovery handoff chain to recover RECOVERY_TOKEN.
  • log_stream_monitor@1 (operations suite): poll a paginated log stream, ignore noise, and emit STREAM_CODE when a CRITICAL entry is detected.

Each task:

  • Defines an initial environment
  • Exposes a constrained action interface
  • Has a single, deterministic success condition

How it works

You provide any agent that implements the documented interface. We provide a task harness. The agent runs until:

  • It succeeds
  • It fails
  • It runs out of budget

No human in the loop. No retries.

Example

agent-bench run \
  --agent agents/toy_agent.py \
  --task filesystem_hidden_config@1 \
  --seed 42

# Replay a prior run_id (defaults to recorded agent/task/seed, but you can override):
agent-bench run --replay <run_id> --seed 42

Configuration via agent-bench.toml

Rather not repeat --agent, --task, and --seed every time? Drop a config file in the repo root (or pass --config path/to/file). Set AGENT_BENCH_CONFIG=agent-bench.toml in CI (and any automation) so the same defaults apply everywhere.

[defaults]
agent = "agents/toy_agent.py"
task = "filesystem_hidden_config@1"
seed = 42

[agent."agents/rate_limit_agent.py"]
task = "rate_limited_api@1"
seed = 11

The CLI resolves flags first, then per-agent overrides, then the [defaults] block. Any command accepts --config to point at another file; otherwise agent-bench.toml (or agent_bench.toml) is used when present or when AGENT_BENCH_CONFIG is set.

If agent-bench isn’t on your PATH yet, call it via Python:

python -m agent_bench.cli --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42

Every CLI run writes a JSON artifact under .agent_bench/runs/<run_id>.json. Inspect them directly, or list them via:

agent-bench runs list --limit 5

Want to zero in on a specific outcome? Use the structured failure taxonomy filter:

agent-bench runs list --failure-type timeout --limit 5
agent-bench runs list --failure-type success --limit 5  # only successful runs

The same buckets surface in the Web UI’s Recent Runs list, where each entry is labeled Success or Failure — <type> so you can spot budget exhaustion vs. invalid actions at a glance.

Need a quick aggregate of how an agent performs on a task? Use the baseline helper:

agent-bench baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1

It emits success rate, average steps/tool calls, and links back to the latest trace for that agent/task pair. Add --export to persist a frozen snapshot for the web UI:

agent-bench baseline --export        # writes .agent_bench/baselines/baseline-<ts>.json
agent-bench baseline --export latest # custom filename in the baselines folder

Compare two specific runs (paths or run_ids) to see exactly where traces diverge:

agent-bench baseline --compare .agent_bench/runs/run_a.json .agent_bench/runs/run_b.json
# or mix path + run_id
agent-bench baseline --compare abcd1234 efgh5678

The diff output highlights whether the agent/task/success states match and lists per-step differences. Use --format text for a quick human summary; exit codes are 0 (identical), 1 (different), 2 (incompatible task/agent). For CI usage, see docs/ci_workflow.md. This repo also ships a chain-agent-baseline workflow wired to agents/chain_agent.py + rate_limited_chain@1.

The Baselines tab in the UI only shows a "Latest published" card after you export at least once.

Minimal Web UI (Optional)

Prefer sliders and buttons over the CLI? Spin up the lightweight FastAPI form:

pip install tracecore
agent-bench dashboard --host 127.0.0.1 --port 8000 --reload

--reload is for local development only. It enables uvicorn's auto-reload on file changes and should not be used in shared or production environments. Omit the flag for stable serving.

Tip: create a virtual environment first (e.g., python -m venv .venv && .venv\Scripts\activate on Windows) so the FastAPI deps stay isolated. See the official FastAPI installation guide for more platform-specific options: https://fastapi.tiangolo.com/#installation

Then visit http://localhost:8000 to:

  • Pick any agent module under agents/
  • Choose a task (filesystem_hidden_config@1, rate_limited_api@1, etc.) and seed
  • Launch runs, inspect structured JSON results (seed included), and drill into traces
  • Replay a prior run by pasting its run_id and optionally overriding the seed/agent/task

The UI intentionally ships with no Node/Vite stack—just FastAPI + Jinja—so you can layer more elaborate frontends later without losing the minimal flow.

Output:

{
  "task_id": "filesystem_hidden_config",
  "version": 1,
  "seed": 42,
  "success": true,
  "failure_reason": null,
  "failure_type": null,
  "steps_used": 37,
  "tool_calls_used": 12
}

Diagnostics workflow

  1. Run & persist — both the CLI and the web UI call the same harness and automatically persist artifacts under .agent_bench/runs/ with metadata (run_id, trace_id, timestamps, harness version, trace entries).
  2. Inspect traces — load http://localhost:8000/?trace_id=<run_id> to jump straight to the trace viewer, or fetch raw JSON via /api/traces/<run_id>.
  3. Compare outcomes — use agent-bench baseline ... or the UI baseline table to spot regressions (success rate, average steps/tool calls) before publishing results.
  4. Freeze specs — once a run set looks good, tag the task versions + harness revision so those run IDs remain reproducible proof of behavior.
  5. Manual verification — before freezing or sharing results, run through docs/manual_verification.md to replay the CLI + UI flows end-to-end.

To inspect a specific run artifact directly, use:

agent-bench runs list --limit 5
# then load the JSON artifact from .agent_bench/runs/<run_id>.json

Release process

Ready to cut a release? See docs/release_process.md for the standard checklist (changelog, version stamping, test gate, SPEC_FREEZE alignment, trust evidence bundle, and tagging steps). Historical release notes are also archived there.

What we measure

Per task:

  • Success / failure
  • Steps taken
  • Tool calls
  • Error count

Across a suite:

  • Success rate
  • Aggregate efficiency metrics

See SPEC_FREEZE.md for the frozen v0.1.0 task list (including the new rate_limited_chain@1 pain task) and the rules for bumping versions.

We deliberately avoid:

  • LLM-based judges
  • Natural language grading
  • Weighted composite scores

Reference agent

TraceCore ships with a minimal reference agent. It is:

  • Conservative
  • State-driven
  • Explicit about errors
  • Boring on purpose

If your agent can’t outperform the reference agent, that’s a signal.

Reference implementations:

  • agents/toy_agent.py — solves filesystem discovery tasks.
  • agents/rate_limit_agent.py — handles classic rate-limit retry flows (rate_limited_api@1).
  • agents/chain_agent.py — completes the chained handshake + rate-limit pain task (rate_limited_chain@1).
  • agents/ops_triage_agent.py — handles operations triage tasks (log_alert_triage@1, config_drift_remediation@1, incident_recovery_chain@1).
  • agents/cheater_agent.py — intentionally malicious “cheater sim” that tries to read hidden state; the sandbox should block it with a sandbox_violation so you can prove the harness defenses work.

Adding a task, log_alert_triage@1, config_drift_remediation@1, incident_recovery_chain@1

Tasks are small and self-contained, but every bundled scenario now flows through a manifest so registry + docs stay aligned.

Bundled manifest

  • tasks/registry.json enumerates every built-in task (filesystem_hidden_config@1, rate_limited_api@1, rate_limited_chain@1, deterministic_rate_service@1, log_alert_triage@1, config_drift_remediation@1, incident_recovery_chain@1).
  • Update the list above whenever you add new operations tasks.
  • When you add or bump a task version, update this manifest, SPEC_FREEZE, and the docs table in docs/tasks.md.

Plugin workflow

  • External packages can expose tasks without living in this repo via the agent_bench.tasks entry-point group.

  • See docs/task_plugin_template.md for a ready-to-copy layout, entry-point snippet, and register() helper contract.

  • The loader automatically merges bundled manifest entries and plugin descriptors, so agent-bench run --task my_plugin_task@1 works once the package is installed.

  • Validate task manifests/registry entries with agent-bench tasks validate before publishing plugins or bumping versions.

    Task requirements

    • Environment setup (setup.py)
    • Available actions/tools (actions.py)
    • Validator (validate.py)
    • Budget defaults + metadata (task.toml)
    • Contract fields defined in docs/contract_spec.md

If your task:

  • Requires internet access
  • Needs a GPU
  • Takes minutes to run

It probably doesn’t belong here.

Non-goals

TraceCore does not aim to:

  • Benchmark raw language quality
  • Measure creativity
  • Replace SWE-bench or Terminal Bench
  • Simulate the real world

It tests operational competence, nothing more.

Status

This project is early and opinionated. Expect:

  • Breaking changes
  • Small task suites
  • Strong opinions

If you disagree, open an issue—or better, a PR.

One-line summary: Terminal Bench, but for agents that actually have to do things.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracecore-0.9.0.tar.gz (87.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracecore-0.9.0-py3-none-any.whl (68.2 kB view details)

Uploaded Python 3

File details

Details for the file tracecore-0.9.0.tar.gz.

File metadata

  • Download URL: tracecore-0.9.0.tar.gz
  • Upload date:
  • Size: 87.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tracecore-0.9.0.tar.gz
Algorithm Hash digest
SHA256 0a85472df77d99d3c668832379abfabac5f3454cd1bddee5948fc7f874dc47e0
MD5 c93788b8b1cfdce7cdfa6e9f01e9dd0f
BLAKE2b-256 244671531f9319238971b5a5bce220189b3d467f6b31a6b9c7e4cd20b1b58bbc

See more details on using hashes here.

File details

Details for the file tracecore-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: tracecore-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 68.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tracecore-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 91e156259c2f34514c4d5c05a900ef474453ba06abf9eb176f7aa90c54610c8b
MD5 cda998b495de8a5b9400b09f898e2e2e
BLAKE2b-256 a4f7eb0ce986605ad997e0dab5367afbcedb0b8cdc99996256613a4f9511d908

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page