A lightweight benchmark for action-oriented agents.
Project description
TraceCore (Agent Bench CLI)
TraceCore overview
A lightweight benchmark for action-oriented agents inspired by the OpenClaw style—planner loops, tool APIs, partial observability—but open to any implementation that satisfies the harness.
TraceCore evaluates whether an agent can operate—not just reason. No LLM judges. No vibes. No giant simulators.
Brand note: TraceCore is the product name; the CLI/package and commands remain
agent-benchfor backward compatibility.
Core definition: see docs/core.md for the Deterministic Episode Runtime primitive and invariant contracts.
If your agent can survive this benchmark, it can probably survive production.
Installation
Published package (recommended)
pip install tracecore
Or with uv:
uv pip install tracecore
This installs the agent-bench CLI and all runtime dependencies. The CLI is immediately available once your environment's Scripts directory is on PATH.
Developer / contributor install
Clone the repo and install in editable mode to keep tasks and CLI entries in sync with your working tree (required for the web UI and the registry-powered loader):
git clone https://github.com/justindobbs/Tracecore.git
cd Tracecore
python -m venv .venv && .venv\Scripts\activate # or source .venv/bin/activate on macOS/Linux
pip install -e .[dev]
pip install -e keeps the package in sync with your working tree so new tasks + CLI entries are immediately available.
Windows PATH tip
The editable install drops agent-bench.exe into %APPDATA%\Python\Python310\Scripts (or whichever minor version you're using). Add that folder to Path via System Properties → Environment Variables so agent-bench works from any terminal. After updating Path, open a new shell.
Prefer a one-step install?
pipx install tracecoredrops its own shim into%USERPROFILE%\.local\binand handles PATH automatically.Already using uv? Run
uv tool install tracecoreto create the CLI shim in%USERPROFILE%\.local\bin. uv's bootstrap already wires that directory into PATH, so no manual environment edits are required.
Prefer a shorter command name? Create a shell alias so tracecore forwards to agent-bench:
- PowerShell (add to
$PROFILE):Set-Alias tracecore agent-bench - Command Prompt:
doskey tracecore=agent-bench $* - Bash/Zsh:
alias tracecore='agent-bench'
The alias simply invokes the same CLI, so all subcommands and flags continue to work.
Quick start
Fastest path — run a known-good agent+task pairing by name:
agent-bench run pairing log_stream_monitor
agent-bench run pairing log_stream_monitor --seed 7
See all available pairings:
agent-bench run pairing --list
Smoke-test every pairing in one shot (useful after a harness change):
agent-bench run pairing --all
agent-bench run pairing --all --seed 7 --timeout 120 # 120 s wall-clock limit per run
Or navigate into the agents/ directory — if only one pairing matches a file there, it auto-selects:
cd agents
agent-bench run pairing # auto-detects if unambiguous
Run any agent+task+seed explicitly, with an optional wall-clock timeout:
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42 --timeout 60
Need an end-to-end TraceCore + Pydantic AI example? See docs/pydantic_poc.md for the deterministic dice game agent/task combo.
Want a standalone proof-of-concept that walks through the full execution loop? See examples/simple_agent_demo/ — a self-contained demo with a CLI that lists tasks, lists agents, and runs any pairing with verbose trace output:
cd examples/simple_agent_demo
python demo.py --task dice_game --agent dice_game_agent
python demo.py --list-tasks
python demo.py --list-agents
Prefer a guided setup? Launch the colorful wizard and let it walk you through agent/task/seed selection (it saves the answers and then calls the same run command under the hood):
agent-bench interactive
# add --dry-run to preview the command without executing
# add --save-session to remember your choices for next time
# add --plugins to include plugin tasks in discovery
# add --no-color if your terminal doesn't support ANSI colors
The wizard includes:
- Suggested pairings: See agent-task combinations with proven success (if baseline data exists)
- Agent validation: Checks that selected agents implement the required interface
- Task budgets: Shows
stepsandtool_callslimits for each task - Progress indicators: Guides you through "Step 1/3", "Step 2/3", "Step 3/3"
- Fuzzy search: Type partial names to filter agents/tasks
- Inline help: Press
?during any prompt for context-sensitive tips - Session persistence: Use
--save-sessionto remember your last selections - Dry-run mode: Preview the exact command before execution with
--dry-run
Prefer the UI?
agent-bench dashboard --reload
# then open http://localhost:8000
Point the form at agents/toy_agent.py + filesystem_hidden_config@1 for a deterministic smoke test, or switch to agents/rate_limit_agent.py for the API scenarios. The Pairings tab in the dashboard provides one-click launch for every known-good pairing.
Inspect recent runs
Print a compact table of recent runs without opening the dashboard:
agent-bench runs summary
agent-bench runs summary --task log_stream_monitor@1 --limit 10
agent-bench runs summary --failure-type budget_exhausted
For raw JSON output use agent-bench runs list (same filters).
Run tests
python -m pytest
Want a single command that runs task validation + pytest and can apply a couple guarded, mechanical fixes? See docs/maintainer.md:
agent-bench maintain
Write a new agent
Scaffold a stub with the correct reset / observe / act interface in one command:
agent-bench new-agent my_agent
# creates agents/my_agent_agent.py with inline docstrings and budget-guard boilerplate
Kebab-case names are normalised automatically (my-agent → MyAgentAgent). Use --output-dir to write elsewhere, --force to overwrite an existing file.
Then wire it to a task and run:
agent-bench run --agent agents/my_agent_agent.py --task filesystem_hidden_config@1 --seed 0
See docs/agents.md for the full interface contract and docs/task_harness.md for the action schema.
Troubleshooting
Need help diagnosing install, CLI, or validator issues? See docs/troubleshooting.md for a consolidated guide that covers PATH fixes, common failure types, and dashboard hiccups.
Note: Task budgets are configured in each task's
task.tomlmanifest and can be inspected viaagent-bench tasks validate --registry. There is no--budgetCLI override flag; budgets are enforced from the task definition.
Tutorials
- OpenClaw users: see
OPENCLAW_QUICKSTART.mdfor a 5-minute first run (no OpenClaw install required), or the fulltutorials/openclaw_quickstart.mdfor adapter patterns, budget mapping, and troubleshooting.
Framing the idea
Terminal Bench works because it:
- Evaluates agents via real tasks, not synthetic prompts
- Uses a simple, opinionated interface (a terminal)
- Is cheap to run, easy to extend, and hard to game
An operations-focused benchmark should do the same, but centered on:
- Action-oriented agents with tool APIs
- Environment interaction and partial observability
- Longish horizons with state, retries, and recovery
In practice, this covers:
- OpenClaw-native agents
- Custom planner loops wired into REST or filesystem tools
- Orchestration agents (e.g., TaskWeaver, AutoGPT-style) that can wrap the simple
reset/observe/actinterface
Think of it less as "benchmarking a model" and more as benchmarking an agent loop end-to-end.
What makes these agents distinct?
(Adjust these if your mental model differs.)
- Planner / policy loop instead of single-shot prompting
- Tool or action interfaces instead of raw chat completions
- Optional memory, world models, or reusable skills
- Strong emphasis on doing, not just responding
So the benchmark should not test raw language quality or one-shot reasoning. It should test:
- Decision-making under constraints
- Tool sequencing and dependency management
- Recovery from errors and partial failures
- State tracking over time and across steps
Why this exists
Most benchmarks answer questions like:
- Can the model reason?
- Can it write the right patch?
- Can it roleplay an agent?
TraceCore answers a different question: Can this agent run unattended and get the job done without breaking things?
We test:
- Tool sequencing
- Error recovery
- State tracking
- Long-horizon behavior
- Boring, reliable decision-making
Design principles
- Minimal environment, maximal signal
- Keep worlds tiny, deterministic, and inspectable: toy filesystems, fake APIs, log streams, local services.
- No giant simulators or cloud dependencies—everything should run in seconds on a laptop.
- Agent-in-the-loop evaluation
- Benchmark the entire perception → reasoning → action loop, not a single prompt.
- Each task specifies initial state, tool interface, validator, and explicit budgets (steps + tool calls).
- Binary outcomes first
- Success or failure is the headline metric; secondary stats (steps, tool calls, errors) give color.
- Deterministic tasks + frozen versions make regressions obvious and stop overfitting.
- Hard to game, easy to extend
- Sandboxed execution, limited affordances, and published hashes keep agents honest.
- Tasks are small Python packages so contributors can add new scenarios without ceremony.
Task categories (operations-native)
1. Tool choreography tasks
Goal: stress sequencing, dependency management, and retries.
- Example:
rate_limited_api@1— retrieve anACCESS_TOKENfrom a mock API that enforces a deterministic rate limit and transient failures. - Signals: correct tool ordering, retry logic, state retention, graceful degradation.
2. Partial observability & discovery
Goal: reward cautious exploration instead of brute force.
- Example: “Traverse a directory tree with undocumented schema. Find the real config key without trashing the filesystem.”
- Signals: hypothesis updates, selective reads, remembering seen paths, avoiding repeated mistakes.
3. Long-horizon maintenance
Goal: ensure persistence, monitoring, and acting at the right moment.
- Example: “A service degrades over time. Watch logs, detect the symptom, and apply the correct fix only when needed.”
- Signals: patience, trigger detection, not overreacting, applying steady-state playbooks.
4. Adversarial-but-fair environments
Goal: test robustness when the world is a little hostile.
- Example: flaky tools, malformed API responses, conflicting telemetry that needs disambiguation.
- Signals: error recovery, fallback strategies, keeping track of provenance before acting.
Scoring without overengineering
- Binary success/failure is the scoreboard.
- Secondary metrics: steps taken, tool calls, wall-clock time, error count.
- No LLM judges, no vibes, no composite scores you can’t reason about.
Interface sketch
Agents run exactly like they would in production: provide an agent, pick a task, respect the budget.
agent-bench run \
--agent agents/toy_agent.py \
--task filesystem_hidden_config@1 \
--seed 42
Each task ships with a harness, fake environment, and validator. Agents only see what they’re allowed to see.
Why this matters (and what’s missing today)
Most agent benchmarks collapse back into single-prompt exams. They rarely measure recovery, operational competence, or whether the agent can survive unattended. TraceCore surfaces engineering-quality differences and rewards boring-but-correct behavior.
Potential pitfalls & guardrails
- Overfitting to the harness → Keep suites varied, publish fixtures, encourage new contributions.
- Agents cheating via inspection → Sandbox aggressively, freeze binaries, limit visibility.
- Benchmark drift → Freeze task versions, publish hashes/seeded assets, require changelog entries.
What’s in v0
Task suites:
- Filesystem & State
- Tool Choreography
- Long-Horizon & Monitoring
- Adversarial-but-Fair
- Operations & Triage
Shipping tasks:
filesystem_hidden_config@1(filesystem suite): explore a hidden directory tree to find the one trueAPI_KEY.rate_limited_api@1(api suite): classify API errors, respectretry_after, and persist the returnedACCESS_TOKEN.log_alert_triage@1(operations suite): triage deterministic logs and extract the finalALERT_CODE.config_drift_remediation@1(operations suite): compare desired vs. live configs and output the remediation patch line.incident_recovery_chain@1(operations suite): follow a recovery handoff chain to recoverRECOVERY_TOKEN.log_stream_monitor@1(operations suite): poll a paginated log stream, ignore noise, and emitSTREAM_CODEwhen aCRITICALentry is detected.
Each task:
- Defines an initial environment
- Exposes a constrained action interface
- Has a single, deterministic success condition
How it works
You provide any agent that implements the documented interface. We provide a task harness. The agent runs until:
- It succeeds
- It fails
- It runs out of budget
No human in the loop. No retries.
Example
agent-bench run \
--agent agents/toy_agent.py \
--task filesystem_hidden_config@1 \
--seed 42
# Replay a prior run_id (defaults to recorded agent/task/seed, but you can override):
agent-bench run --replay <run_id> --seed 42
Configuration via agent-bench.toml
Rather not repeat --agent, --task, and --seed every time? Drop a config file in the repo root (or pass --config path/to/file). Set AGENT_BENCH_CONFIG=agent-bench.toml in CI (and any automation) so the same defaults apply everywhere.
[defaults]
agent = "agents/toy_agent.py"
task = "filesystem_hidden_config@1"
seed = 42
[agent."agents/rate_limit_agent.py"]
task = "rate_limited_api@1"
seed = 11
The CLI resolves flags first, then per-agent overrides, then the [defaults] block. Any command accepts --config to point at another file; otherwise agent-bench.toml (or agent_bench.toml) is used when present or when AGENT_BENCH_CONFIG is set.
If agent-bench isn’t on your PATH yet, call it via Python:
python -m agent_bench.cli --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
Every CLI run writes a JSON artifact under .agent_bench/runs/<run_id>.json. Inspect them directly, or list them via:
agent-bench runs list --limit 5
Want to zero in on a specific outcome? Use the structured failure taxonomy filter:
agent-bench runs list --failure-type timeout --limit 5
agent-bench runs list --failure-type success --limit 5 # only successful runs
The same buckets surface in the Web UI’s Recent Runs list, where each entry is labeled
Success or Failure — <type> so you can spot budget exhaustion vs. invalid actions at a glance.
Need a quick aggregate of how an agent performs on a task? Use the baseline helper:
agent-bench baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1
It emits success rate, average steps/tool calls, and links back to the latest trace for that agent/task pair. Add --export to persist a frozen snapshot for the web UI:
agent-bench baseline --export # writes .agent_bench/baselines/baseline-<ts>.json
agent-bench baseline --export latest # custom filename in the baselines folder
Compare two specific runs (paths or run_ids) to see exactly where traces diverge:
agent-bench baseline --compare .agent_bench/runs/run_a.json .agent_bench/runs/run_b.json
# or mix path + run_id
agent-bench baseline --compare abcd1234 efgh5678
The diff output highlights whether the agent/task/success states match and lists per-step differences.
Use --format text for a quick human summary; exit codes are 0 (identical), 1 (different), 2 (incompatible task/agent).
For CI usage, see docs/ci_workflow.md.
This repo also ships a chain-agent-baseline workflow wired to agents/chain_agent.py + rate_limited_chain@1.
The Baselines tab in the UI only shows a "Latest published" card after you export at least once.
Minimal Web UI (Optional)
Prefer sliders and buttons over the CLI? Spin up the lightweight FastAPI form:
pip install tracecore
agent-bench dashboard --host 127.0.0.1 --port 8000 --reload
--reloadis for local development only. It enables uvicorn's auto-reload on file changes and should not be used in shared or production environments. Omit the flag for stable serving.
Tip: create a virtual environment first (e.g.,
python -m venv .venv && .venv\Scripts\activateon Windows) so the FastAPI deps stay isolated. See the official FastAPI installation guide for more platform-specific options: https://fastapi.tiangolo.com/#installation
Then visit http://localhost:8000 to:
- Pick any agent module under
agents/ - Choose a task (
filesystem_hidden_config@1,rate_limited_api@1, etc.) and seed - Launch runs, inspect structured JSON results (seed included), and drill into traces
- Replay a prior run by pasting its
run_idand optionally overriding the seed/agent/task
The UI intentionally ships with no Node/Vite stack—just FastAPI + Jinja—so you can layer more elaborate frontends later without losing the minimal flow.
Output:
{
"task_id": "filesystem_hidden_config",
"version": 1,
"seed": 42,
"success": true,
"failure_reason": null,
"failure_type": null,
"steps_used": 37,
"tool_calls_used": 12
}
Diagnostics workflow
- Run & persist — both the CLI and the web UI call the same harness and automatically persist artifacts under
.agent_bench/runs/with metadata (run_id,trace_id, timestamps, harness version, trace entries). - Inspect traces — load http://localhost:8000/?trace_id=<run_id> to jump straight to the trace viewer, or fetch raw JSON via
/api/traces/<run_id>. - Compare outcomes — use
agent-bench baseline ...or the UI baseline table to spot regressions (success rate, average steps/tool calls) before publishing results. - Freeze specs — once a run set looks good, tag the task versions + harness revision so those run IDs remain reproducible proof of behavior.
- Manual verification — before freezing or sharing results, run through
docs/manual_verification.mdto replay the CLI + UI flows end-to-end.
To inspect a specific run artifact directly, use:
agent-bench runs list --limit 5
# then load the JSON artifact from .agent_bench/runs/<run_id>.json
Release process
Ready to cut a release? See docs/release_process.md for the standard checklist (changelog, version stamping, test gate, SPEC_FREEZE alignment, trust evidence bundle, and tagging steps). Historical release notes are also archived there.
What we measure
Per task:
- Success / failure
- Steps taken
- Tool calls
- Error count
Across a suite:
- Success rate
- Aggregate efficiency metrics
See SPEC_FREEZE.md for the frozen v0.1.0 task list (including the new rate_limited_chain@1 pain task) and the rules for bumping versions.
We deliberately avoid:
- LLM-based judges
- Natural language grading
- Weighted composite scores
Reference agent
TraceCore ships with a minimal reference agent. It is:
- Conservative
- State-driven
- Explicit about errors
- Boring on purpose
If your agent can’t outperform the reference agent, that’s a signal.
Reference implementations:
agents/toy_agent.py— solves filesystem discovery tasks.agents/rate_limit_agent.py— handles classic rate-limit retry flows (rate_limited_api@1).agents/chain_agent.py— completes the chained handshake + rate-limit pain task (rate_limited_chain@1).agents/ops_triage_agent.py— handles operations triage tasks (log_alert_triage@1,config_drift_remediation@1,incident_recovery_chain@1).agents/cheater_agent.py— intentionally malicious “cheater sim” that tries to read hidden state; the sandbox should block it with asandbox_violationso you can prove the harness defenses work.
Adding a task, log_alert_triage@1, config_drift_remediation@1, incident_recovery_chain@1
Tasks are small and self-contained, but every bundled scenario now flows through a manifest so registry + docs stay aligned.
Bundled manifest
tasks/registry.jsonenumerates every built-in task (filesystem_hidden_config@1,rate_limited_api@1,rate_limited_chain@1,deterministic_rate_service@1,log_alert_triage@1,config_drift_remediation@1,incident_recovery_chain@1).- Update the list above whenever you add new operations tasks.
- When you add or bump a task version, update this manifest, SPEC_FREEZE, and the docs table in
docs/tasks.md.
Plugin workflow
-
External packages can expose tasks without living in this repo via the
agent_bench.tasksentry-point group. -
See
docs/task_plugin_template.mdfor a ready-to-copy layout, entry-point snippet, andregister()helper contract. -
The loader automatically merges bundled manifest entries and plugin descriptors, so
agent-bench run --task my_plugin_task@1works once the package is installed. -
Validate task manifests/registry entries with
agent-bench tasks validatebefore publishing plugins or bumping versions.Task requirements
- Environment setup (
setup.py) - Available actions/tools (
actions.py) - Validator (
validate.py) - Budget defaults + metadata (
task.toml) - Contract fields defined in
docs/contract_spec.md
- Environment setup (
If your task:
- Requires internet access
- Needs a GPU
- Takes minutes to run
It probably doesn’t belong here.
Non-goals
TraceCore does not aim to:
- Benchmark raw language quality
- Measure creativity
- Replace SWE-bench or Terminal Bench
- Simulate the real world
It tests operational competence, nothing more.
Status
This project is early and opinionated. Expect:
- Breaking changes
- Small task suites
- Strong opinions
If you disagree, open an issue—or better, a PR.
One-line summary: Terminal Bench, but for agents that actually have to do things.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tracecore-0.9.0.tar.gz.
File metadata
- Download URL: tracecore-0.9.0.tar.gz
- Upload date:
- Size: 87.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a85472df77d99d3c668832379abfabac5f3454cd1bddee5948fc7f874dc47e0
|
|
| MD5 |
c93788b8b1cfdce7cdfa6e9f01e9dd0f
|
|
| BLAKE2b-256 |
244671531f9319238971b5a5bce220189b3d467f6b31a6b9c7e4cd20b1b58bbc
|
File details
Details for the file tracecore-0.9.0-py3-none-any.whl.
File metadata
- Download URL: tracecore-0.9.0-py3-none-any.whl
- Upload date:
- Size: 68.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91e156259c2f34514c4d5c05a900ef474453ba06abf9eb176f7aa90c54610c8b
|
|
| MD5 |
cda998b495de8a5b9400b09f898e2e2e
|
|
| BLAKE2b-256 |
a4f7eb0ce986605ad997e0dab5367afbcedb0b8cdc99996256613a4f9511d908
|