Time-travel debugger for AI agents: bit-exact record/replay, fork any step, causal blame with confidence intervals.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

tracefork

A time-travel debugger for AI agents that doesn't just replay a failed run — it proves the replay is bit-for-bit real, lets you fork any step, and measures which step caused the failure, with confidence intervals.

tracefork report — timeline, exchange detail, and causal blame panel

The three-panel report: a run's timeline (left) with causal-blame badges, the request/response for the selected exchange (center), and the blame ranking with 95% confidence intervals (right). Generated offline, for $0, by examples/demo_report.py.

The idea

Every agent-observability tool shows you a trace and asks you to eyeball it. tracefork treats an agent run like a recording you can rewind, branch, and reason about causally:

Record every model call into a content-addressed tape at the HTTP seam of the Anthropic SDK, capturing the sources of nondeterminism (clock, ids) the agent reads.
Replay the tape bit-exact for $0 — every replayed request's body is sha256-checked against the tape, so it's proven identical, not asserted. (The matched surface is the request body; request headers such as anthropic-beta are out of scope — see Determinism boundary.) No network, no key.
Fork any step: swap in a different model response and let the same agent run forward from there. The unchanged prefix replays for free; only the new tail costs anything.
Blame: resample those forks across every step and rank each by its flip-rate — how often perturbing it changes the run's outcome — with Wilson score confidence intervals so a small sample can't masquerade as certainty.
Validate the instrument itself: inject faults with known root causes and confirm the blame engine fingers the right step. The engine is genuinely causal — it ranks whichever step actually flips the outcome #1, not a fixed slot — and across five injection mechanisms it hits 1.00 top-1 precision offline against a flat negative control (which is now enforced, not just printed). See Validation scope for exactly what that number does and doesn't claim.

That last pillar is the point: a debugger that claims to find root causes has to be held to ground truth. tracefork validate is that proof, and it runs in under a second with no API key.

Quickstart (offline, $0, no API key)

Python 3.12 via uv. Everything below is offline and makes no network calls.

uv sync --extra dev

# 1. The full offline test suite (65 tests).
uv run pytest -q

# 2. The instrument validates itself against injected, known-root-cause faults.
uv run tracefork validate

# 3. Generate the demo report shown above, then open it in any browser.
uv run python examples/demo_report.py
open examples/demo_report.html      # macOS; or just open the file

# 4. The original Spike 0 receipt: record → persist → replay → prove bit-exact.
uv run python -m tracefork_spike

tracefork validate prints:

  [PASS] corrupted_tool_output               top-1: 1.00
  [PASS] misleading_retrieval                top-1: 1.00
  [PASS] wrong_system_prompt                 top-1: 1.00
  [PASS] dropped_message                     top-1: 1.00
  [PASS] poisoned_argument                   top-1: 1.00

  overall top-1 precision: 1.00
  negative control max flip: 0.00 (threshold 0.30)

The CLI

uv run tracefork --help

Command	What it does
`replay <tape> --agent pkg.mod:fn`	Replay a tape and print the bit-exact verification receipt.
`verify <tape> --agent pkg.mod:fn`	Verify replay; exit non-zero on drift (CI gate).
`fork <run_id> --step N --response f --agent pkg.mod:fn`	Fork a run at step N with a mutated response; record the counterfactual branch.
`blame <run_id> --agent pkg.mod:fn [--k 10] [--budget 5.0]`	Rank every step by causal flip-rate with 95% CIs (re-runs the agent; budget-capped).
`report <run_id> \| --tape <tape> -o out.html`	Render the self-contained three-panel HTML report.
`serve [--store store.db] [--port 7777]`	Serve the live web UI (same-origin, 127.0.0.1).
`validate [--k 3] [--n-runs 5] [--check]`	Run the fault-injection suite; `--check` gates against the committed report.

Replay, verify, fork, and the offline demos need no key. blame against a real run re-runs the agent's counterfactual tails against the live API, which is why it's budget-capped — the offline, $0 proof that blame works is tracefork validate.

How it works

The spine is a record/replay seam at the Anthropic SDK's httpx boundary plus a nondeterminism-virtualization seam the agent reads time and ids through. Bit-exactness is the contract between them.

transport.py — TraceforkTransport (sync) / AsyncTraceforkTransport (async). Record mode tees request+response bytes into the tape (buffering streaming SSE and plain JSON identically via .read()/.aread()); replay mode serves recorded bytes and sha256-asserts every request body matches the tape. A replay transport has no inner transport, so an unrecorded request is a hard error, never a silent network call.
tape.py — content-addressed (sha256) blobs + an ordered event log, persistable to SQLite, with a hash-chain digest() fingerprint.
nondet.py — NondetSource is the only way the agent gets time/ids; RecordingNondet logs real draws, ReplayNondet serves them back, DriftingNondet is the negative control. find_divergence() unwraps the DivergenceError the SDK buries inside an APIConnectionError so a real divergence isn't mistaken for a network blip.
fork.py — ForkTransport runs three phases: prefix-replay (served from the parent tape for $0, request asserted to match — the agent must be deterministic up to the fork point), mutation-injection (same request, swapped response), and tail-record (the counterfactual continuation recorded fresh). A Branch carries prefix_replayed/tail_recorded counters that quantify the savings.
blame.py — forks each step k times, re-runs the agent, grades the outcome via an Oracle, and counts flips vs. the parent outcome. wilson_ci() gives the interval; BudgetGovernor estimates fork count and dollar cost before any spend.
faults.py / validate.py — five fault classes, each producing valid Anthropic JSON with a marker embedded inside a content field. A synthetic agent echoes each response into its next request, so an injected fault propagates through a fork to a fault-aware tail and flips the outcome — letting the blame engine be scored against ground truth entirely offline.
report.py / server.py / web/report.html — a single, dependency-free HTML file (vanilla JS, no npm) rendered statically by report or served live by serve.

Determinism boundary (v1, honest scope)

Bit-exact replay holds within a declared boundary: single-process, clock + id nondeterminism, captured through NondetSource. An agent that reads datetime.now() / uuid / random directly, or runs its loop across threads/subprocesses, steps outside that boundary — and the verifier will detect the resulting drift rather than paper over it. Forking and blame assume the agent rebuilds its prefix deterministically (the same property replay proves). See SPIKE0.md for how the boundary was de-risked.

Validation scope

What tracefork validate proves, stated precisely: the blame engine is genuinely causal — inject an outcome-flipping fault at any step and the engine ranks that step first (verified by also injecting at a non-root step), so the 1.00 is not a tautology or a fixed-slot artifact. The five "fault classes" carry two real injection mechanisms (a corrupted tool argument and a replaced text message) via a marker that survives the SDK's JSON round-trip, and the negative control — a no-op perturbation that must not flip the outcome — is enforced with a hard threshold (the run fails if it ever exceeds 0.30).

What it does not yet claim: discrimination among several competing plausible causes on a long run. The fixture is a short tape where one step gets a flip-capable perturbation and the rest get an inert one — a clean positive-vs-control, but an easy one. A longer tape with a decoy step that changes the transcript without changing the outcome is the next iteration; until then, read 1.00 as "the instrument reliably finds the planted cause," not "it resolves ambiguous multi-cause blame."

Layout

src/tracefork/      transport, tape, nondet, recorder, fork, store,
                    blame, faults, validate, report, server, wire, synthetic, cli
src/tracefork_spike/  the original bit-exact record/replay spike
web/report.html     the single-file three-panel UI
examples/           runnable demo that produces the report above
tests/              65 offline tests ($0, no key)
experiments/        committed reference report for `validate --check`

Testing

uv run pytest -q                                   # all 65 offline tests
uv run pytest tests/test_faults.py -q              # the self-validation chain
uv run tracefork validate --check                  # regression-gate vs committed report

Contributing

Contributions are welcome — see CONTRIBUTING.md for dev setup, the invariants a PR must respect, and commit/PR conventions. The whole dev loop (tests, validate, lint, type-check) is offline and $0, so you can run the full gate with no API key. Please also read the Code of Conduct.

Security

See SECURITY.md for how to report a vulnerability. In short: tapes are JSON + base64 (never pickle, so loading one can't execute code), and tracefork serve binds to 127.0.0.1 only.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pratiksoni

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

Jul 2, 2026

0.2.0

Jul 2, 2026

This version

0.1.0

Jul 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracefork-0.1.0.tar.gz (177.3 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tracefork-0.1.0-py3-none-any.whl (49.9 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file tracefork-0.1.0.tar.gz.

File metadata

Download URL: tracefork-0.1.0.tar.gz
Upload date: Jul 1, 2026
Size: 177.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracefork-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`14a1259da617afe10f7e08c9dbd1fcb8705718ea276b4e817eda53bfb800d8e6`
MD5	`e4efa9c5ea5a59129424253a7bdd0f78`
BLAKE2b-256	`d1c3a833415d851545c0dcf33573c1028f71b1f337a08878b298653219c1a22d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracefork-0.1.0.tar.gz:

Publisher: release.yml on pratik916/tracefork

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracefork-0.1.0.tar.gz
- Subject digest: 14a1259da617afe10f7e08c9dbd1fcb8705718ea276b4e817eda53bfb800d8e6
- Sigstore transparency entry: 2041841941
- Sigstore integration time: Jul 1, 2026
Source repository:
- Permalink: pratik916/tracefork@734a0a1a7e1f3a95f7daf83c5ab21f272767c7ff
- Branch / Tag: refs/heads/main
- Owner: https://github.com/pratik916
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@734a0a1a7e1f3a95f7daf83c5ab21f272767c7ff
- Trigger Event: workflow_dispatch

File details

Details for the file tracefork-0.1.0-py3-none-any.whl.

File metadata

Download URL: tracefork-0.1.0-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 49.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracefork-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e1828cca9d82cc39dc208207e9aec70930c549adbc3fd0c8e69ccdd6692198b4`
MD5	`fc7e6190e055669d53524af607383170`
BLAKE2b-256	`f4b7c3f9b1e96bb47c7b1b5e0dbb79e70b1d2b4c8129f5debf298b3e84501a9f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracefork-0.1.0-py3-none-any.whl:

Publisher: release.yml on pratik916/tracefork

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracefork-0.1.0-py3-none-any.whl
- Subject digest: e1828cca9d82cc39dc208207e9aec70930c549adbc3fd0c8e69ccdd6692198b4
- Sigstore transparency entry: 2041842822
- Sigstore integration time: Jul 1, 2026
Source repository:
- Permalink: pratik916/tracefork@734a0a1a7e1f3a95f7daf83c5ab21f272767c7ff
- Branch / Tag: refs/heads/main
- Owner: https://github.com/pratik916
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@734a0a1a7e1f3a95f7daf83c5ab21f272767c7ff
- Trigger Event: workflow_dispatch

tracefork 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

tracefork

The idea

Quickstart (offline, $0, no API key)

The CLI

How it works

Determinism boundary (v1, honest scope)

Validation scope

Layout

Testing

Contributing

Security

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance