A lightweight benchmark for action-oriented agents.

These details have not been verified by PyPI

Project links

Project description

TraceCore (Agent Bench CLI)

TraceCore hero

TraceCore is a lightweight benchmark for action-oriented agents inspired by the OpenClaw style: planner loops, tool APIs, partial observability,but open to any implementation that satisfies the harness.

TraceCore evaluates whether an agent can operate, not just reason. No LLM judges. No vibes. No giant simulators.

Brand note: TraceCore is the product name; the CLI/package and commands remain agent-bench for backward compatibility.

Core definition: see docs/core.md for the Deterministic Episode Runtime primitive and invariant contracts.

If your agent can survive this benchmark, it can probably survive production.

Quick links

Colab quickstart notebook — run TraceCore end-to-end in the browser (install, sample agent, live trace output)
Deterministic Episode Runtime spec (docs/core.md)
Task registry & spec freeze
Release process & historical notes
Troubleshooting
Manual verification checklist

Install TraceCore

Use case	Command	Notes
Stable CLI (recommended)	`pip install tracecore`	Adds `agent-bench` to your PATH.
uv users	`uv pip install tracecore`	Same artifact, faster resolver.
pipx / uv tool	`pipx install tracecore` or `uv tool install tracecore`	Creates isolated shim in `%USERPROFILE%\.local\bin`.
Development	`git clone https://github.com/justindobbs/Tracecore && cd Tracecore && python -m venv .venv && .venv\Scripts\activate && pip install -e .[dev]`	Keeps CLI + tasks live-edited.

Windows-specific install guidance (PATH, ExecutionPolicy, uv tool shims) lives in docs/troubleshooting.md#windows.

Alias the CLI if you prefer tracecore:

Set-Alias tracecore agent-bench      # PowerShell profile
doskey tracecore=agent-bench $*      # cmd
alias tracecore='agent-bench'        # Bash/Zsh

Feature highlights

Capability	Why it matters
Deterministic Episode Runtime	Every task freezes its environment, action schema, budgets, and validator, so a `run_id` is reproducible proof of behavior. See `docs/core.md`.
Sandboxed tasks	Task manifests declare filesystem roots + network hosts, enforced by GuardedEnv and surfaced in IO audits.
Binary scoring + telemetry	Success/failure is the headline; secondary metrics (steps, tool calls, IO audits, validator payloads) keep regressions obvious.
Minimal stack	Python-only harness + FastAPI dashboard. No Node build tooling, no external services. Runs in seconds on a laptop.
CLI & Web UI parity	`agent-bench` commands, dashboard, and APIs all call the same runner, so automation matches what maintainers see.
Extensible registry	Built-in tasks live beside plugin tasks discovered via the `agent_bench.tasks` entry point group.

TraceCore evaluates planner loops, not single prompts: tool sequencing, retry logic, state tracking, and boring-but-correct behavior under budgets.

Quick start commands

# Run a known-good pairing
agent-bench run pairing log_stream_monitor
agent-bench run pairing log_stream_monitor --seed 7

See all available pairings:

```bash
agent-bench run pairing --list
agent-bench run pairing --all --timeout 120

# Run explicit agent + task
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42

# Launch the interactive wizard
agent-bench interactive --dry-run --save-session

# Launch the dashboard
agent-bench dashboard 
or
agent-bench dashboard --reload

# Summaries & baselines
agent-bench runs summary --task log_stream_monitor@1 --limit 10
agent-bench baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1 --export latest

# Scaffold a new agent
agent-bench new-agent my_agent

# Maintainer helper (pytest + task validation)
agent-bench maintain

Need a turnkey example? See examples/simple_agent_demo for a self-contained CLI, or docs/pydantic_poc.md for the deterministic dice-game walkthrough.

Task suites & signals

Frozen tasks live in SPEC_FREEZE.md. Current operations-focused suites:

Task	Suite	Goal	Signals
`filesystem_hidden_config@1`	Filesystem	Discover the one true config key without wrecking the tree.	Selective exploration, state recall.
`rate_limited_api@1`	API	Navigate a deterministic rate limit + transient errors to fetch `ACCESS_TOKEN`.	Retry pacing, error classification.
`rate_limited_chain@1`	API pain task	Multi-stage handshake + rate limit.	Sequencing, dependency tracking.
`deterministic_rate_service@1`	API	Deterministic payload parsing + rate-limits.	Budget management, payload validation.
`log_alert_triage@1`	Operations	Triage noisy logs to recover `ALERT_CODE`.	Signal detection, tool economy.
`config_drift_remediation@1`	Operations	Compare desired vs. live config and emit the remediation patch.	Diffing discipline, precise edits.
`incident_recovery_chain@1`	Operations	Follow a hand-off chain to recover `RECOVERY_TOKEN`.	Long-horizon reasoning, state carry-over.
`log_stream_monitor@1`	Operations	Poll paginated logs, ignore noise, emit `STREAM_CODE`.	Patience, trigger detection.

Every task ships with a harness (setup.py, actions.py, validate.py, task.toml), published hashes, and budgets. Success is binary; steps/tool calls/IO audits provide color.

Architecture & artifacts

Agent script  ──▶  Runner (GuardedEnv, budgets, validator)
                      │
                      ├─► IO audit + action trace (JSON)
                      ├─► Baseline exports (.agent_bench/baselines)
                      └─► FastAPI dashboard + REST APIs

CLI (agent-bench) — runs agents, validates tasks, exports baselines, maintains the repo.
Runner — enforces budgets, sandbox allowlists, structured failure taxonomy.
Artifacts — .agent_bench/runs/<run_id>.json (ground truth) + optional baseline-<ts>.json for UI compare views.
APIs — /api/pairings, /api/traces/{run_id}?include_io=true, /api/ledger are typed via Pydantic models.
Dashboard — Jinja templates plus FastAPI endpoints; no Node build. Upload a run_id to replay, compare baselines, or visualize IO audits.

Baseline diffs (agent-bench baseline --compare run_a run_b) highlight where traces diverge. For CI workflows, see docs/ci_workflow.md.

Web dashboard snapshot

TraceCore dashboard UI

Launch runs via forms or quick-pick pairings.
Drill into traces, budget usage, validator payloads, IO audit summaries.
Filter baselines and recent runs; download artifacts directly.
Enable --reload only during local dev (uvicorn auto-reload). For long-lived servers, omit the flag.

All dashboard actions have CLI equivalents so you can automate the same flows.

Build or extend TraceCore

Write agents

Scaffold via agent-bench new-agent my_agent (columnar docstrings, budget guards baked in).
Interface contract lives in docs/agents.md and docs/task_harness.md.
Reference agents: toy_agent.py, rate_limit_agent.py, chain_agent.py, ops_triage_agent.py, cheater_agent.py (sandbox violation test).

Add tasks

Built-in tasks register through tasks/registry.json; update it plus docs/tasks.md and SPEC_FREEZE.md when bumping versions.
Plugin pathway: publish a package exposing agent_bench.tasks entry points. Template lives in docs/task_plugin_template.md.
Every task must include setup/actions/validator files, budgets in task.toml, and pass agent-bench tasks validate --registry.

Troubleshooting & maintainer workflows

Install/CLI issues — docs/troubleshooting.md covers PATH fixes, validator errors, dashboard hiccups.
Task validation — agent-bench tasks validate --registry ensures manifests + registry stay in lockstep.
Maintainer helper — agent-bench maintain runs pytest + task validation and applies mechanical fixes.
Manual verification — Run through docs/manual_verification.md before freezing specs or publishing changelogs.

Task budgets are defined per task.toml and cannot be overridden at runtime—agents must respect the published constraints.

Releases & roadmap

Version metadata lives in pyproject.toml and agent_bench/webui/app.py (FastAPI banner).
Changelog is maintained in CHANGELOG.md; tags follow vX.Y.Z.
Release checklist: docs/release_process.md — changelog promotion, behavior verification, SPEC_FREEZE update, trust evidence bundle, tagging, publish.
Plan/shipping updates are captured in docs/project_positioning.md and issue tracker.

TraceCore is intentionally opinionated and evolving. Expect additive task suites, sandbox refinements, and runner upgrades—documented via CHANGELOG + SPEC_FREEZE.

License & acknowledgments

TraceCore (Agent Bench CLI) is MIT Licensed. If you ship improvements (new tasks, agents, dashboard tweaks) open a PR or publish them as plugins. If you disagree with the assumptions, that’s fine: the benchmark is small enough to fork, but contributions that improve determinism, coverage, or ergonomics are always welcome.

One-line summary: Terminal Bench energy, but for agents that actually have to do things.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.3

Mar 16, 2026

1.1.2

Mar 6, 2026

1.1.1

Mar 5, 2026

1.1.0

Mar 4, 2026

1.0.1

Mar 3, 2026

1.0.0

Feb 28, 2026

0.9.8

Feb 27, 2026

0.9.7

Feb 27, 2026

0.9.6

Feb 26, 2026

0.9.5

Feb 26, 2026

0.9.4

Feb 26, 2026

This version

0.9.3

Feb 25, 2026

0.9.2

Feb 25, 2026

0.9.1

Feb 24, 2026

0.9.0

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracecore-0.9.3.tar.gz (135.2 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tracecore-0.9.3-py3-none-any.whl (141.2 kB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file tracecore-0.9.3.tar.gz.

File metadata

Download URL: tracecore-0.9.3.tar.gz
Upload date: Feb 25, 2026
Size: 135.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tracecore-0.9.3.tar.gz
Algorithm	Hash digest
SHA256	`d93e55baaf7df6c86736d596601f01a51b7050041c525e12f96a622f46d1e1de`
MD5	`a47d82692d733a766eee710dc6e4fc9f`
BLAKE2b-256	`7e80cbf4fb00e65679cc8517c845683ebaa6089ad76cb18caf10f6a4aec930a8`

See more details on using hashes here.

File details

Details for the file tracecore-0.9.3-py3-none-any.whl.

File metadata

Download URL: tracecore-0.9.3-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 141.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tracecore-0.9.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`89d19d2c78b325966b396de9fd3898b299e708664aae61c4699acda1038efcdf`
MD5	`d85d55bb13724cfd0d67ed11003395d5`
BLAKE2b-256	`3f0163f5bad613b0d0b2bdf03057497d8d6662b9fb601fb8b0eebe413b12ce3d`

See more details on using hashes here.

tracecore 0.9.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

TraceCore (Agent Bench CLI)

Quick links

Install TraceCore

Feature highlights

Quick start commands

Task suites & signals

Architecture & artifacts

Web dashboard snapshot

Build or extend TraceCore

Write agents

Add tasks

Troubleshooting & maintainer workflows

Releases & roadmap

License & acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes