A lightweight benchmark for action-oriented agents.
Project description
TraceCore (Agent Bench CLI)
TraceCore is a lightweight benchmark for action-oriented agents inspired by the OpenClaw style: planner loops, tool APIs, partial observability, but open to any implementation that satisfies the harness.
TraceCore evaluates whether an agent can operate, not just reason. No LLM judges. No vibes. No giant simulators.
Brand note: TraceCore is the product name; the CLI/package and commands remain
agent-benchfor backward compatibility.
Core definition: see docs/core.md for the Deterministic Episode Runtime primitive and invariant contracts.
If your agent can survive this benchmark, it can probably survive production.
Quick links
- Google Colab Example — hosted copy ready to run without cloning the repo
- Deterministic Episode Runtime spec (
docs/core.md) - Task registry & spec freeze
- Release process & historical notes
- Troubleshooting
- Manual verification checklist
Install TraceCore
| Use case | Command | Notes |
|---|---|---|
| Stable CLI (recommended) | pip install tracecore |
Adds agent-bench to your PATH. |
| uv users | uv pip install tracecore |
Same artifact, faster resolver. |
| pipx / uv tool | pipx install tracecore or uv tool install tracecore |
Creates isolated shim in %USERPROFILE%\.local\bin. |
| Development | git clone https://github.com/justindobbs/Tracecore && cd Tracecore && python -m venv .venv && .venv\Scripts\activate && pip install -e .[dev] |
Keeps CLI + tasks live-edited. |
Windows-specific install guidance (PATH, ExecutionPolicy, uv tool shims) lives in docs/troubleshooting.md#windows.
Alias the CLI if you prefer tracecore:
Set-Alias tracecore agent-bench # PowerShell profile
doskey tracecore=agent-bench $* # cmd
alias tracecore='agent-bench' # Bash/Zsh
Feature highlights
| Capability | Why it matters |
|---|---|
| Deterministic Episode Runtime | Every task freezes its environment, action schema, budgets, and validator, so a run_id is reproducible proof of behavior. See docs/core.md. |
| Sandboxed tasks | Task manifests declare filesystem roots + network hosts, enforced by GuardedEnv and surfaced in IO audits. |
| Binary scoring + telemetry | Success/failure is the headline; secondary metrics (steps, tool calls, IO audits, validator payloads) keep regressions obvious. |
| Minimal stack | Python-only harness + FastAPI dashboard. No Node build tooling, no external services. Runs in seconds on a laptop. |
| CLI & Web UI parity | agent-bench commands, dashboard, and APIs all call the same runner, so automation matches what maintainers see. |
| Extensible registry | Built-in tasks live beside plugin tasks discovered via the agent_bench.tasks entry point group. |
TraceCore evaluates planner loops, not single prompts: tool sequencing, retry logic, state tracking, and boring-but-correct behavior under budgets.
Quick start commands
# Run a known-good pairing
agent-bench run pairing log_stream_monitor
agent-bench run pairing log_stream_monitor --seed 7
See all available pairings:
```bash
agent-bench run pairing --list
agent-bench run pairing --all --timeout 120
# Run explicit agent + task
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
# Launch the interactive wizard
agent-bench interactive --dry-run --save-session
# Launch the dashboard
agent-bench dashboard
or
agent-bench dashboard --reload
# Summaries & baselines
agent-bench runs summary --task log_stream_monitor@1 --limit 10
agent-bench baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1 --export latest
# Scaffold a new agent
agent-bench new-agent my_agent
# Maintainer helper (pytest + task validation)
agent-bench maintain
Need a turnkey example? See examples/simple_agent_demo for a self-contained CLI, or docs/pydantic_poc.md for the deterministic dice-game walkthrough.
Task suites & signals
Frozen tasks live in SPEC_FREEZE.md. Current operations-focused suites:
| Task | Suite | Goal | Signals |
|---|---|---|---|
filesystem_hidden_config@1 |
Filesystem | Discover the one true config key without wrecking the tree. | Selective exploration, state recall. |
rate_limited_api@1 |
API | Navigate a deterministic rate limit + transient errors to fetch ACCESS_TOKEN. |
Retry pacing, error classification. |
rate_limited_chain@1 |
API pain task | Multi-stage handshake + rate limit. | Sequencing, dependency tracking. |
deterministic_rate_service@1 |
API | Deterministic payload parsing + rate-limits. | Budget management, payload validation. |
log_alert_triage@1 |
Operations | Triage noisy logs to recover ALERT_CODE. |
Signal detection, tool economy. |
config_drift_remediation@1 |
Operations | Compare desired vs. live config and emit the remediation patch. | Diffing discipline, precise edits. |
incident_recovery_chain@1 |
Operations | Follow a hand-off chain to recover RECOVERY_TOKEN. |
Long-horizon reasoning, state carry-over. |
log_stream_monitor@1 |
Operations | Poll paginated logs, ignore noise, emit STREAM_CODE. |
Patience, trigger detection. |
runbook_verifier@1 |
Operations | Verify runbook phase execution order and emit RUNBOOK_CHECKSUM. |
Ordering discipline, multi-artifact stitching. |
sandboxed_code_auditor@1 |
Operations | Audit sandbox source + logs to emit ISSUE_ID|AUDIT_CODE. |
Scoped reads, multi-source extraction. |
Every task ships with a harness (setup.py, actions.py, validate.py, task.toml), published hashes, and budgets. Success is binary; steps/tool calls/IO audits provide color.
Architecture & artifacts
Agent script ──▶ Runner (GuardedEnv, budgets, validator)
│
├─► IO audit + action trace (JSON)
├─► Baseline exports (.agent_bench/baselines)
└─► FastAPI dashboard + REST APIs
- CLI (
agent-bench) — runs agents, validates tasks, exports baselines, maintains the repo. - Runner — enforces budgets, sandbox allowlists, structured failure taxonomy.
- Artifacts —
.agent_bench/runs/<run_id>.json(ground truth) + optionalbaseline-<ts>.jsonfor UI compare views. - APIs —
/api/pairings,/api/traces/{run_id}?include_io=true,/api/ledgerare typed via Pydantic models. - Dashboard — Jinja templates plus FastAPI endpoints; no Node build. Upload a run_id to replay, compare baselines, or visualize IO audits.
Baseline diffs (agent-bench baseline --compare run_a run_b) highlight where traces diverge. For CI workflows, see docs/ci_workflow.md.
Web dashboard snapshot
- Launch runs via forms or quick-pick pairings.
- Drill into traces, budget usage, validator payloads, IO audit summaries.
- Filter baselines and recent runs; download artifacts directly.
- Enable
--reloadonly during local dev (uvicorn auto-reload). For long-lived servers, omit the flag.
All dashboard actions have CLI equivalents so you can automate the same flows.
Build or extend TraceCore
Write agents
- Scaffold via
agent-bench new-agent my_agent(columnar docstrings, budget guards baked in). - Interface contract lives in
docs/agents.mdanddocs/task_harness.md. - Reference agents:
toy_agent.py,rate_limit_agent.py,chain_agent.py,ops_triage_agent.py,cheater_agent.py(sandbox violation test).
Add tasks
- Built-in tasks register through
tasks/registry.json; update it plusdocs/tasks.mdandSPEC_FREEZE.mdwhen bumping versions. - Plugin pathway: publish a package exposing
agent_bench.tasksentry points. Template lives indocs/task_plugin_template.md. - Every task must include setup/actions/validator files, budgets in
task.toml, and passagent-bench tasks validate --registry.
Troubleshooting & maintainer workflows
- Install/CLI issues —
docs/troubleshooting.mdcovers PATH fixes, validator errors, dashboard hiccups. - Task validation —
agent-bench tasks validate --registryensures manifests + registry stay in lockstep. - Maintainer helper —
agent-bench maintainruns pytest + task validation and applies mechanical fixes. - Manual verification — Run through
docs/manual_verification.mdbefore freezing specs or publishing changelogs.
Task budgets are defined per task.toml and cannot be overridden at runtime—agents must respect the published constraints.
Releases & roadmap
- Version metadata lives in
pyproject.tomlandagent_bench/webui/app.py(FastAPI banner). - Changelog is maintained in
CHANGELOG.md; tags followvX.Y.Z. - Release checklist:
docs/release_process.md— changelog promotion, behavior verification, SPEC_FREEZE update, trust evidence bundle, tagging, publish. - Plan/shipping updates are captured in
docs/project_positioning.mdand issue tracker.
TraceCore is intentionally opinionated and evolving. Expect additive task suites, sandbox refinements, and runner upgrades—documented via CHANGELOG + SPEC_FREEZE.
License & acknowledgments
TraceCore (Agent Bench CLI) is MIT Licensed. If you ship improvements (new tasks, agents, dashboard tweaks) open a PR or publish them as plugins. If you disagree with the assumptions, that’s fine: the benchmark is small enough to fork, but contributions that improve determinism, coverage, or ergonomics are always welcome.
One-line summary: Terminal Bench energy, but for agents that actually have to do things.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tracecore-0.9.6.tar.gz.
File metadata
- Download URL: tracecore-0.9.6.tar.gz
- Upload date:
- Size: 141.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab91c3dc02c8db109ee182ef87aa8460bf4ec12fa775cf6b50e5477b495c3075
|
|
| MD5 |
4d3aa54715f705728e17bedb7ddc575d
|
|
| BLAKE2b-256 |
df2c522a6874d64509ba8212d07fc3be109db32a263f361acde0fb6d296c100f
|
File details
Details for the file tracecore-0.9.6-py3-none-any.whl.
File metadata
- Download URL: tracecore-0.9.6-py3-none-any.whl
- Upload date:
- Size: 150.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a20ec17626915986c3f99ca412c2425ac3c1bfb94db7f23048dced3ad2baa23
|
|
| MD5 |
77628114386a592c652ec3c74ba76d39
|
|
| BLAKE2b-256 |
7412e65b95a5ad9cbd939489e005b752aedf0e1d5acfa310dd63257a6ac10be2
|