Skip to main content

Time-travel debugger for AI agents: bit-exact record/replay, fork any step, causal blame with confidence intervals.

Project description

tracefork

CI License: MIT Python 3.12+ Code style: ruff

A time-travel debugger for AI agents that doesn't just replay a failed run — it proves the replay is bit-for-bit real, lets you fork any step, and measures which step caused the failure, with confidence intervals.

tracefork report — timeline, exchange detail, and causal blame panel

The three-panel report: a run's timeline (left) with causal-blame badges, the request/response for the selected exchange (center), and the blame ranking with 95% confidence intervals (right). Generated offline, for $0, by examples/demo_report.py.


The idea

Every agent-observability tool shows you a trace and asks you to eyeball it. tracefork treats an agent run like a recording you can rewind, branch, and reason about causally:

  • Record every model call into a content-addressed tape at the HTTP seam of the Anthropic SDK, capturing the sources of nondeterminism (clock, ids) the agent reads.
  • Replay the tape bit-exact for $0 — every replayed request's body is sha256-checked against the tape, so it's proven identical, not asserted. (The matched surface is the request body; request headers such as anthropic-beta are out of scope — see Determinism boundary.) No network, no key.
  • Fork any step: swap in a different model response and let the same agent run forward from there. The unchanged prefix replays for free; only the new tail costs anything.
  • Blame: resample those forks across every step and rank each by its flip-rate — how often perturbing it changes the run's outcome — with Wilson score confidence intervals so a small sample can't masquerade as certainty.
  • Validate the instrument itself: inject faults with known root causes and confirm the blame engine fingers the right step. The engine is genuinely causal — it ranks whichever step actually flips the outcome #1, not a fixed slot — and across five injection mechanisms it hits 1.00 top-1 precision offline against a flat negative control (which is now enforced, not just printed). See Validation scope for exactly what that number does and doesn't claim.

That last pillar is the point: a debugger that claims to find root causes has to be held to ground truth. tracefork validate is that proof, and it runs in under a second with no API key.

Quickstart (offline, $0, no API key)

Python 3.12 via uv. Everything below is offline and makes no network calls.

uv sync --extra dev

# 1. The full offline test suite (65 tests).
uv run pytest -q

# 2. The instrument validates itself against injected, known-root-cause faults.
uv run tracefork validate

# 3. Generate the demo report shown above, then open it in any browser.
uv run python examples/demo_report.py
open examples/demo_report.html      # macOS; or just open the file

# 4. The original Spike 0 receipt: record → persist → replay → prove bit-exact.
uv run python -m tracefork_spike

tracefork validate prints:

  [PASS] corrupted_tool_output               top-1: 1.00
  [PASS] misleading_retrieval                top-1: 1.00
  [PASS] wrong_system_prompt                 top-1: 1.00
  [PASS] dropped_message                     top-1: 1.00
  [PASS] poisoned_argument                   top-1: 1.00

  overall top-1 precision: 1.00
  negative control max flip: 0.00 (threshold 0.30)

The CLI

uv run tracefork --help
Command What it does
replay <tape> --agent pkg.mod:fn Replay a tape and print the bit-exact verification receipt.
verify <tape> --agent pkg.mod:fn Verify replay; exit non-zero on drift (CI gate).
fork <run_id> --step N --response f --agent pkg.mod:fn Fork a run at step N with a mutated response; record the counterfactual branch.
blame <run_id> --agent pkg.mod:fn [--k 10] [--budget 5.0] Rank every step by causal flip-rate with 95% CIs (re-runs the agent; budget-capped).
report <run_id> | --tape <tape> -o out.html Render the self-contained three-panel HTML report.
serve [--store store.db] [--port 7777] Serve the live web UI (same-origin, 127.0.0.1).
validate [--k 3] [--n-runs 5] [--check] Run the fault-injection suite; --check gates against the committed report.

Replay, verify, fork, and the offline demos need no key. blame against a real run re-runs the agent's counterfactual tails against the live API, which is why it's budget-capped — the offline, $0 proof that blame works is tracefork validate.

How it works

The spine is a record/replay seam at the Anthropic SDK's httpx boundary plus a nondeterminism-virtualization seam the agent reads time and ids through. Bit-exactness is the contract between them.

  • transport.pyTraceforkTransport (sync) / AsyncTraceforkTransport (async). Record mode tees request+response bytes into the tape (buffering streaming SSE and plain JSON identically via .read()/.aread()); replay mode serves recorded bytes and sha256-asserts every request body matches the tape. A replay transport has no inner transport, so an unrecorded request is a hard error, never a silent network call.
  • tape.py — content-addressed (sha256) blobs + an ordered event log, persistable to SQLite, with a hash-chain digest() fingerprint.
  • nondet.pyNondetSource is the only way the agent gets time/ids; RecordingNondet logs real draws, ReplayNondet serves them back, DriftingNondet is the negative control. find_divergence() unwraps the DivergenceError the SDK buries inside an APIConnectionError so a real divergence isn't mistaken for a network blip.
  • fork.pyForkTransport runs three phases: prefix-replay (served from the parent tape for $0, request asserted to match — the agent must be deterministic up to the fork point), mutation-injection (same request, swapped response), and tail-record (the counterfactual continuation recorded fresh). A Branch carries prefix_replayed/tail_recorded counters that quantify the savings.
  • blame.py — forks each step k times, re-runs the agent, grades the outcome via an Oracle, and counts flips vs. the parent outcome. wilson_ci() gives the interval; BudgetGovernor estimates fork count and dollar cost before any spend.
  • faults.py / validate.py — five fault classes, each producing valid Anthropic JSON with a marker embedded inside a content field. A synthetic agent echoes each response into its next request, so an injected fault propagates through a fork to a fault-aware tail and flips the outcome — letting the blame engine be scored against ground truth entirely offline.
  • report.py / server.py / web/report.html — a single, dependency-free HTML file (vanilla JS, no npm) rendered statically by report or served live by serve.

Determinism boundary (v1, honest scope)

Bit-exact replay holds within a declared boundary: single-process, clock + id nondeterminism, captured through NondetSource. An agent that reads datetime.now() / uuid / random directly, or runs its loop across threads/subprocesses, steps outside that boundary — and the verifier will detect the resulting drift rather than paper over it. Forking and blame assume the agent rebuilds its prefix deterministically (the same property replay proves). See SPIKE0.md for how the boundary was de-risked.

Validation scope

What tracefork validate proves, stated precisely: the blame engine is genuinely causal — inject an outcome-flipping fault at any step and the engine ranks that step first (verified by also injecting at a non-root step), so the 1.00 is not a tautology or a fixed-slot artifact. The five "fault classes" carry two real injection mechanisms (a corrupted tool argument and a replaced text message) via a marker that survives the SDK's JSON round-trip, and the negative control — a no-op perturbation that must not flip the outcome — is enforced with a hard threshold (the run fails if it ever exceeds 0.30).

What it does not yet claim: discrimination among several competing plausible causes on a long run. The fixture is a short tape where one step gets a flip-capable perturbation and the rest get an inert one — a clean positive-vs-control, but an easy one. A longer tape with a decoy step that changes the transcript without changing the outcome is the next iteration; until then, read 1.00 as "the instrument reliably finds the planted cause," not "it resolves ambiguous multi-cause blame."

Layout

src/tracefork/      transport, tape, nondet, recorder, fork, store,
                    blame, faults, validate, report, server, wire, synthetic, cli
src/tracefork_spike/  the original bit-exact record/replay spike
web/report.html     the single-file three-panel UI
examples/           runnable demo that produces the report above
tests/              65 offline tests ($0, no key)
experiments/        committed reference report for `validate --check`

Testing

uv run pytest -q                                   # all 65 offline tests
uv run pytest tests/test_faults.py -q              # the self-validation chain
uv run tracefork validate --check                  # regression-gate vs committed report

Contributing

Contributions are welcome — see CONTRIBUTING.md for dev setup, the invariants a PR must respect, and commit/PR conventions. The whole dev loop (tests, validate, lint, type-check) is offline and $0, so you can run the full gate with no API key. Please also read the Code of Conduct.

Security

See SECURITY.md for how to report a vulnerability. In short: tapes are JSON + base64 (never pickle, so loading one can't execute code), and tracefork serve binds to 127.0.0.1 only.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracefork-0.1.0.tar.gz (177.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracefork-0.1.0-py3-none-any.whl (49.9 kB view details)

Uploaded Python 3

File details

Details for the file tracefork-0.1.0.tar.gz.

File metadata

  • Download URL: tracefork-0.1.0.tar.gz
  • Upload date:
  • Size: 177.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracefork-0.1.0.tar.gz
Algorithm Hash digest
SHA256 14a1259da617afe10f7e08c9dbd1fcb8705718ea276b4e817eda53bfb800d8e6
MD5 e4efa9c5ea5a59129424253a7bdd0f78
BLAKE2b-256 d1c3a833415d851545c0dcf33573c1028f71b1f337a08878b298653219c1a22d

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracefork-0.1.0.tar.gz:

Publisher: release.yml on pratik916/tracefork

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tracefork-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tracefork-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 49.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracefork-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e1828cca9d82cc39dc208207e9aec70930c549adbc3fd0c8e69ccdd6692198b4
MD5 fc7e6190e055669d53524af607383170
BLAKE2b-256 f4b7c3f9b1e96bb47c7b1b5e0dbb79e70b1d2b4c8129f5debf298b3e84501a9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracefork-0.1.0-py3-none-any.whl:

Publisher: release.yml on pratik916/tracefork

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page