Time-travel debugger for AI agents: bit-exact record/replay, fork any step, causal blame with confidence intervals.
Project description
tracefork
A time-travel debugger for AI agents: record a run to a content-addressed tape, replay it bit-for-bit for $0 — proven by hash, not asserted — fork any step, and, validated on controlled fixtures, measure which step is causally responsible for a failure, with confidence intervals.
The three-panel report: a run's timeline (left) with causal-blame badges, the
request/response for the selected exchange (center), and the blame ranking with 95%
confidence intervals (right). Generated offline, for $0, by
examples/demo_report.py.
The idea
Every agent-observability tool shows you a trace and asks you to eyeball it. tracefork treats an agent run like a recording you can rewind, branch, and reason about causally:
- Record every model call into a content-addressed tape at the HTTP seam of the Anthropic SDK, capturing the sources of nondeterminism (clock, ids) the agent reads.
- Replay the tape bit-exact for $0 — every replayed request's body is
sha256-checked against the tape, so it's proven identical, not asserted. (The matched
surface is the request body; request headers such as
anthropic-betaare out of scope — see Determinism boundary.) No network, no key. - Fork any step: swap in a different model response and let the same agent run forward from there. The unchanged prefix replays for free; only the new tail costs anything.
- Blame: resample those forks across every step and rank each by its flip-rate — how often perturbing it changes the run's outcome — with Wilson score confidence intervals so a small sample can't masquerade as certainty.
- Validate the instrument itself: inject faults with known root causes and confirm the blame engine fingers the right step. On a short positive-vs-inert control the engine is genuinely causal — it ranks whichever step actually flips the outcome #1, not a fixed slot — hitting 1.00 top-1 precision across five injection mechanisms, offline, against a flat negative control that's enforced, not just printed.
- Discriminate among competing causes: a second, longer fixture plants several
causally-distinct faults on one tape at once — a root cause, a downstream echo that must
not be blamed as the root, and a two-part necessary-not-sufficient conjunction — and
measures whether the coalition/temporal-Shapley engine tells them apart (
tracefork bench). 8 of 9 cases resolve exactly as planted; the one that doesn't is reported as a named, honest limitation, not hidden. This — and the published ~14.2% log-based step-attribution anchor it's cited alongside (Who&When, ICML 2025) — is internal, labeled evidence, not a run against that benchmark's real data. See Validation scope for exactly what each number does and doesn't claim.
That last two pillars are the point: a debugger that claims to find root causes has to be
held to ground truth. tracefork validate and tracefork bench are that proof, and both
run in a few seconds with no API key.
Quickstart (offline, $0, no API key)
Python 3.12 via uv. Everything below is offline and makes no network calls.
uv sync --extra dev
# 1. The full offline test suite (672 tests) -- including a cross-module,
# whole-pipeline suite (tests/test_e2e.py) and an all-CLI-commands smoke
# test (tests/test_cli_smoke.py), not just per-module unit tests.
uv run pytest -q
# 2. The instrument validates itself against injected, known-root-cause faults.
uv run tracefork validate
# 2b. ...and against several SIMULTANEOUS, competing faults on a longer tape.
uv run tracefork bench
# 3. Generate the demo report shown above, then open it in any browser.
uv run python examples/demo_report.py
open examples/demo_report.html # macOS; or just open the file
# 4. The original Spike 0 receipt: record → persist → replay → prove bit-exact.
uv run python -m tracefork_spike
Run the full E2E receipt — sync, lint, format, type-check, tests+coverage, the self-validation and replay-fixture-corpus regression gates, the competing-fault benchmark, and a package build/twine check, as one script with a single PASS/FAIL verdict:
bash scripts/e2e.sh
tracefork validate prints:
[PASS] corrupted_tool_output top-1: 1.00
[PASS] misleading_retrieval top-1: 1.00
[PASS] wrong_system_prompt top-1: 1.00
[PASS] dropped_message top-1: 1.00
[PASS] poisoned_argument top-1: 1.00
overall top-1 precision: 1.00
negative control max flip: 0.00 (threshold 0.30)
Optional extras
The core install (pip install tracefork) is offline/$0 and pulls in no framework or
provider SDKs. Everything else — providers, Bedrock, MCP, observability, and each
framework adapter — is opt-in via an extra. A curated bundle of the internally-
consistent, stable-wire family is available in one shot:
pip install 'tracefork[all]'
| Extra | Installs | Note |
|---|---|---|
all |
providers + bedrock + mcp + observability |
Convenience bundle. Deliberately excludes the framework stacks (frameworks, openai-agents, crewai, autogen, adk) so one framework's future version cap can't block installing everything else. |
providers |
openai, google-genai SDKs |
Record/replay against OpenAI or Gemini directly — not the same as openai-agents below. |
bedrock |
boto3 |
AWS Bedrock record/replay. |
mcp |
mcp |
Model Context Protocol client record/replay. |
observability |
structlog, opentelemetry-* |
Self-instrumentation logging/tracing of tracefork itself. |
frameworks |
langchain-core, langchain-openai, langchain-anthropic, langgraph |
LangChain/LangGraph adapter. |
openai-agents |
openai-agents (the OpenAI Agents SDK) |
Not the plain openai SDK — that's providers. |
crewai |
crewai |
CrewAI adapter (routes through LiteLLM). |
autogen |
autogen-core, autogen-ext |
AutoGen adapter. |
adk |
google-adk (Google Agent Development Kit) |
An agent framework, not a provider SDK. |
dev |
pytest, ruff, mypy, ... |
Local development/test tooling. |
Every bracketed install command in this README is single-quoted — unquoted [...]
gets glob-expanded by zsh (macOS's default shell) into no matches found.
The CLI
uv run tracefork --help
| Command | What it does |
|---|---|
replay <tape> --agent pkg.mod:fn |
Replay a tape and print the bit-exact verification receipt. |
replay --check <fixtures dir> |
Replay-as-regression gate: assert every fixture in a committed tape corpus replays bit-exact and its digest() matches. |
verify <tape> --agent pkg.mod:fn |
Verify replay; exit non-zero on drift (CI gate). |
fork <run_id> --step N --response f --agent pkg.mod:fn |
Fork a run at step N with a mutated response; record the counterfactual branch. |
blame <run_id> --agent pkg.mod:fn [--k 10] [--budget 5.0] |
Rank every step by causal flip-rate with 95% CIs (re-runs the agent; budget-capped). |
report <run_id> | --tape <tape> -o out.html |
Render the self-contained three-panel HTML report. |
serve [--store store.db] [--port 7777] |
Serve the live web UI (same-origin, 127.0.0.1). |
validate [--k 3] [--n-runs 5] [--check] |
Run the fault-injection suite; --check gates against the committed report. |
bench [--k 3] [--m-samples 2] |
Long-tape competing-fault benchmark: does the coalition/temporal-Shapley engine discriminate several simultaneously planted causes, not just detect one? See Validation scope. |
export <run_id> --otel|--openinference -o out.json |
Export a tape (+ optional --blame-report) as an OTel GenAI trace or an OpenInference dataset. |
ingest <trace.json> --otel|--openinference -o out.tape.sqlite |
Build a tape's step structure from an externally-produced trace — blame-by-re-execution only, not bit-exact replayable. |
proxy record|replay --tape <tape> [--upstream url] [--port 8899] |
Localhost base-URL record/replay proxy for non-Python clients (curl, Node, Go, ...) — see Localhost record/replay proxy. |
Replay, verify, fork, and the offline demos need no key. blame against a real run
re-runs the agent's counterfactual tails against the live API, which is why it's
budget-capped — the offline, $0 proof that blame works is tracefork validate.
How it works
The spine is a record/replay seam at the Anthropic SDK's httpx boundary plus a nondeterminism-virtualization seam the agent reads time and ids through. Bit-exactness is the contract between them.
transport.py—TraceforkTransport(sync) /AsyncTraceforkTransport(async). Record mode tees request+response bytes into the tape (buffering streaming SSE and plain JSON identically via.read()/.aread()); replay mode serves recorded bytes and sha256-asserts every request body matches the tape. A replay transport has no inner transport, so an unrecorded request is a hard error, never a silent network call.tape.py— content-addressed (sha256) blobs + an ordered event log, persistable to SQLite, with a hash-chaindigest()fingerprint.nondet.py—NondetSourceis the only way the agent gets time/ids/random draws;RecordingNondetlogs real draws,ReplayNondetserves them back,DriftingNondetis the negative control.find_divergence()unwraps theDivergenceErrorthe SDK buries inside anAPIConnectionErrorso a real divergence isn't mistaken for a network blip.boundary_guard.py— opt-in (default off)BoundaryGuard: hard-errors at record time on thread/subprocess spawn or directrandom/clock reads that bypassNondetSource, instead of letting the tape fail replay later, mysteriously.fork.py—ForkTransportruns three phases: prefix-replay (served from the parent tape for $0, request asserted to match — the agent must be deterministic up to the fork point), mutation-injection (same request, swapped response), and tail-record (the counterfactual continuation recorded fresh). ABranchcarriesprefix_replayed/tail_recordedcounters that quantify the savings.blame.py— forks each stepktimes, re-runs the agent, grades the outcome via anOracle, and counts flips vs. the parent outcome.wilson_ci()gives the interval;BudgetGovernorestimates fork count and dollar cost before any spend.faults.py/validate.py— five fault classes, each producing valid Anthropic JSON with a marker embedded inside a content field. A synthetic agent echoes each response into its next request, so an injected fault propagates through a fork to a fault-aware tail and flips the outcome — letting the blame engine be scored against ground truth entirely offline.report.py/server.py/web/report.html— a single, dependency-free HTML file (vanilla JS, no npm) rendered statically byreportor served live byserve.interop.py—gen_ai.*/OpenInference export (export) and ingest (ingest); see OTel / OpenInference interop for the precise, blame-only-not-bit-exact scope of the ingest direction.observability.py— opt-in (observabilityextra) structlog JSON logging and OTel self-instrumentation of record/replay/fork/blame; a no-op until explicitly enabled, so installing the extra alone changes nothing.
Determinism boundary (v1, honest scope)
Bit-exact replay holds within a declared boundary: single-process (sync or asyncio),
clock + id + random nondeterminism, captured through NondetSource. An agent that reads
datetime.now() / uuid / random directly, or runs its loop across threads/subprocesses,
steps outside that boundary — and the verifier will detect the resulting drift rather than
paper over it. Forking and blame assume the agent rebuilds its prefix deterministically (the
same property replay proves). See SPIKE0.md for how the boundary was de-risked.
Concurrency-graph determinism (asyncio). asyncio is deterministic except for the
order in which concurrent in-flight requests (an asyncio.gather/TaskGroup fan-out)
resolve — and that order is driven by the very I/O tracefork already records. So the async
transport records the completion order (and logs each fully-overlapping fan-out batch to the
tape) and, on replay, correlates each request to its recorded exchange by fingerprint and
releases responses in the recorded completion order — a fan-out agent replays bit-exact,
not just a single-call-at-a-time one. A strictly-sequential async run (one await at a
time — the common case) is byte-identical to before, and the sync path is untouched. The
recorded order can also be replayed under a seeded reordering (chaos_release_order) to
surface completion-order-dependent ("race"/ordering) bugs. Nested fan-out where a request is
sent only after an earlier one in the same batch completed is replayed faithfully in the
recorded order but is not reordered by chaos (it isn't a physically-reorderable batch).
An opt-in BoundaryGuard (default off; Recorder(..., boundary_guard=True) or
TraceforkConfig(boundary_guard=True)) turns some of these violations — thread/
subprocess spawn, direct random.random()/time.monotonic()/time.sleep() — into a loud
error at record time instead of a drift only discovered on replay. It deliberately can't
intercept everything: datetime.datetime.now() is a classmethod on an immutable C type
(can't be monkeypatched without breaking the SDK's pydantic schema builder — see
recorder.py), and time.time() is called unconditionally by httpx's cookie-jar
machinery on every response, so guarding it would false-positive on every exchange. See
boundary_guard.py's module docstring for the full, precise scope.
Redaction (opt-in)
Recording real traffic can put secrets and PII on a tape. Redaction is entirely opt-in —
Recorder/AsyncRecorder behave byte-for-byte as before unless you pass a redactor:
from tracefork import Recorder, safe_defaults, with_content_redaction
# Metadata only: auth headers + known secret env values (ANTHROPIC_API_KEY, ...).
# Fully bit-exact-replayable — redaction runs inside the matcher seam, so record
# and replay hash the identical redacted form.
with Recorder(client, redactor=safe_defaults()) as rec:
...
# Opt in further: also scrub message CONTENT (prompts/completions), mirroring
# OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT. This marks the tape
# `content_redacted = True` — forensic-only, NOT guaranteed bit-exact
# replayable, because the agent sees redacted text on replay instead of the
# real completion, and a redacted request can no longer prove a genuine
# prompt change didn't happen.
with Recorder(client, redactor=with_content_redaction()) as rec:
...
Redactors never affect what the live agent sees during recording — only what lands on the
tape. See redact.py for the full pipeline (ordered RedactorFn callbacks, regex_redactor,
secret_value_redactor, the header-redaction rules).
OTel / OpenInference interop (opt-in)
tracefork's provider-neutral seam (NormalizedResponse in providers/base.py) already
names fields after the OpenTelemetry GenAI semantic conventions (model,
input_tokens/output_tokens, finish_reason). interop.py builds on that to move data
in and out of the two conventions every observability stack actually speaks — OTel GenAI
(gen_ai.* span attributes) and OpenInference (llm.*/openinference.*, used by Arize
Phoenix and friends) — the pinned semconv release is GENAI_SEMCONV_VERSION in
constants.py.
# Export a recorded run (+ optional blame_<run_id>.json) as plain JSON —
# gen_ai.* spans or an OpenInference-style dataset. No opentelemetry-sdk
# install needed to produce or consume either; they're just dicts.
uv run tracefork export <run_id> --otel -o trace.json
uv run tracefork export <run_id> --openinference --blame-report blame_<run_id>.json -o dataset.json
# Ingest a trace exported by ANY system that speaks these attributes —
# not just tracefork's own export — into a tape's STEP STRUCTURE.
uv run tracefork ingest trace.json --otel -o ingested.tape.sqlite
Ingest is blame-by-re-execution, NOT $0 bit-exact replay — read this before reaching
for it. Bit-exact replay depends on the exact request/response bytes tracefork itself
recorded, plus every NondetSource draw that produced them; an externally-produced trace
carries neither — span attributes don't include the original prompt, so an ingested
exchange's request is a synthesized placeholder ({"model": ..., "messages": []}). An
ingested tape's boundary is set to OTEL_INGESTED_BOUNDARY precisely so it's never
mistaken for a recorded one: feeding it to replay/fork against a real agent correctly
diverges on the very first step (proven, not just asserted, in tests/test_interop.py).
What it's for instead: recovering step count, per-step model, token usage, and — if the
source attached tracefork's own tracefork.blame.* attributes — flip-rate/CI, for
inspection or to drive a live re-execution blame strategy.
Two more pieces are opt-in via the observability extra (pip install 'tracefork[observability]'; the core stays offline/$0 and dependency-free without it):
a structlog JSON logging pipeline (observability.configure_structlog_json() +
get_logger(), falling back to stdlib logging when structlog isn't installed), and
OTel self-instrumentation of record/replay/fork/blame — off by default, and
double opt-in even when installed (enable_otel_instrumentation() or
TRACEFORK_OTEL_ENABLED=1), so merely installing the extra changes nothing.
Framework adapters (opt-in)
Most agents in the wild are built on a framework, not raw SDK calls — and a
framework's own tracing/callbacks are observer-only, so they can annotate a run
but can't give you bit-exact, $0 replay. tracefork's adapters keep the byte seam
exactly where it already is (the httpx transport) and use the framework layer only
for structure: bind() routes the framework's underlying LLM client through
the existing TraceforkTransport + NondetSource, while callbacks feed a
neutral step-DAG (Step/StepDAG) that overlays the tape.
from tracefork import LangChainAdapter, make_callback_handler, make_tape_backed_checkpointer
adapter = LangChainAdapter()
# Replay a recorded tape through a LangChain chat model — bit-exact, $0, no key.
# (ChatOpenAI via root_client.copy(http_client=…); ChatAnthropic — which has no
# http_client field — via a fresh anthropic client seeded before first use.)
result = adapter.bind(chat_model, tape, mode="replay")
handler = make_callback_handler(adapter.dag) # step structure via BaseCallbackHandler
The marquee is tape-backed LangGraph time-travel: pair a replay-bound chat model
with make_tape_backed_checkpointer(tape) and LangGraph's own checkpoint time-travel
resumes graph state while the model replays its I/O from the tape — bit-exact and $0.
langchain-* / langgraph are optional (pip install 'tracefork[frameworks]');
every framework import is guarded, so import tracefork and the whole offline test
suite run with none of them installed. The framework-facing thin wrappers are
exercised against the real library when present and skipped cleanly otherwise, so
the pinned adapter version ranges (which churn) are validated separately from the
offline core.
Four more adapters ship the same way, each its own optional extra and each targeting the framework's actual model-call chokepoint:
- OpenAI Agents SDK (
pip install 'tracefork[openai-agents]') —bind()injects into an Agents SDK model wrapper's underlyingopenaiclient (defensive attribute search, since the SDK doesn't document the stored attribute name);bind_default_client()wraps the SDK's own documentedagents.set_default_openai_client()for a process-wide injection with no attribute guessing. Step visibility is a realTracingProcessor(make_tracing_processor(), installable viaagents.set_trace_processors()). - CrewAI (
pip install 'tracefork[crewai]') — CrewAI routes every model call through LiteLLM, sobind()targets LiteLLM's own documented custom-client surface (litellm.client_session/litellm.aclient_session) rather than CrewAI itself. Step visibility is acrewai_event_buslistener (make_event_listener()) over crew/agent/task/tool/LLM-call boundary events. - AutoGen (
pip install 'tracefork[autogen]',autogen-core/autogen-ext) —bind()injects into an AutoGen model client's underlyingopenaiclient (same defensive attribute search). Step visibility is a message-levelInterventionHandler(make_intervention_handler()) — pass-through only, so it stays an annotation layer, never a second capture path. - Google ADK (
pip install 'tracefork[adk]', Agent Development Kit) — ADK's model calls go through thegoogle-genaiSDK, sobind()walks a short list of candidate attribute paths (the target itself, agenai.Client, an ADKGeminimodel wrapper, or anLlmAgentwhose.modelalready holds one) to find thegoogle.genaiBaseApiClientand swap in tracefork's httpx clients — the same GeminigenerateContentwire formatproviders/gemini.pyalready parses. Step visibility is a realBasePlugin(make_plugin(), installable viaRunner(..., plugins=[plugin])) over agent/model/tool before/after boundaries — registered once for the whole run rather than threaded through everyLlmAgent.
Each adapter's real-framework wrapper is import-guarded and validated against a
synthetic stand-in mimicking the framework's interface (never a live call) in the
offline test suite; the thin real subclasses are only reachable — and only
smoke-tested — when the framework is actually installed (pytest.importorskip).
AWS Bedrock (opt-in)
Bedrock is the outlier provider: boto3/botocore never touch httpx, so the transport
seam above can't see them, and Bedrock signs every request (SigV4) and streams over AWS's
own binary application/vnd.amazon.eventstream framing, not SSE. bedrock_transport.py is
a second, parallel seam for exactly this: it hooks botocore's own before-send
short-circuit (the mechanism botocore itself uses to skip a real network call), tees
request+response bytes into the same Tape/tape.py used everywhere else, and reuses
matcher.py's existing bedrock_matcher() preset to canonicalize away SigV4 signing
material (Authorization, X-Amz-Date, X-Amz-Security-Token) — a replay whose only
difference is a fresh signature/timestamp is not a false divergence; a real body or model
change still is.
from tracefork.bedrock_transport import BedrockTransport, default_sender
from tracefork.tape import Tape
tape = Tape()
transport = BedrockTransport("record", tape, sender=default_sender())
transport.register(bedrock_runtime_client.meta.events)
# ... call bedrock_runtime_client.invoke_model(...) normally; tees into `tape`.
boto3/botocore are optional (pip install 'tracefork[bedrock]'): the seam is
duck-typed against whatever prepared-request/event-emitter object it's handed
(.method/.url/.headers/.body, .register()/.emit()), so it needs zero
botocore import of its own, and the offline test suite exercises it entirely through a
synthetic botocore-shaped fake (synthetic.py's FakeAWSPreparedRequest/
FakeEventEmitter/ScriptedBedrockSender). providers/bedrock.py normalizes the
InvokeModel response — the Anthropic Messages shape verbatim — plus a best-effort read of
the Converse API shape; eventstream.py is a standalone, dependency-free
encoder/decoder of the AWS event-stream binary framing, proven by its own round-trip test.
Scope: non-streaming InvokeModel record/replay + SigV4 canonicalization + the
eventstream codec are proven end-to-end. Full streaming response replay through
botocore's own event-stream parsing machinery is not exercised — the codec round-trips
correctly in isolation, but wiring a replayed InvokeModelWithResponseStream call through a
real bedrock-runtime client's own parser is materially deeper than this seam's proven
contract. See bedrock_transport.py's module docstring for the precise boundary.
Localhost record/replay proxy for non-Python clients (opt-in)
Every seam above patches something Python-side (httpx's transport, botocore's
before-send hook). None of that helps if the agent is curl, a Node/Go service, or
Python code you can't wrap. proxy.py is a localhost base-URL proxy: point the
client's base_url/endpoint at http://127.0.0.1:<port> instead of the provider
directly, and tracefork sits in between.
# record: forwards every request to the real upstream and tees it into a tape
uv run tracefork proxy record --tape run.tape.sqlite --upstream https://api.anthropic.com --port 8899
# point any client at the proxy instead of the provider, e.g.:
curl http://127.0.0.1:8899/v1/messages -H 'x-api-key: ...' -d '{...}'
# replay: serves the recorded bytes back, with NO upstream at all
uv run tracefork proxy replay --tape run.tape.sqlite --port 8899
This is a base-URL proxy, not a transparent TLS MITM. It does not generate a CA or
intercept a client's CONNECT tunnel — that needs the client to trust a custom root
cert, which is out of scope here. It works for anything that can set its own base
URL/endpoint (every major provider SDK, curl, any HTTP client), which is what "non-httpx
/ non-Python clients" means in practice.
Outside the full determinism boundary. Every other seam in this codebase captures
the agent's clock/id/random draws through the in-process NondetSource (nondet.py) so
replay is bit-exact regardless of what the agent reads. A non-Python client on the other
side of a TCP socket has no such seam — tracefork can't see, let alone virtualize,
whatever timestamp/UUID/idempotency-key material the client bakes into its own request.
So bit-exact replay through this proxy depends on the client sending a
canonically-identical request on both runs. If the client rotates something
call-to-call (a fresh idempotency key, a client-side timestamp), point --matcher at one
of the existing matcher.py presets (redacting, gemini, bedrock) — the same seam
transport.py already uses to normalize Gemini's ?key= or Bedrock's SigV4 headers —
so the volatile material is canonicalized away instead of causing a false divergence. An
unrecorded request, or a genuine body/field change the matcher doesn't normalize, is
still a hard error (HTTP 502) — replay must fail loud, not silently drift.
Streaming (SSE) responses are teed chunk-by-chunk while forwarding in record mode, not
buffered in full before the client sees the first byte. The tape itself only ever stores
body bytes (like every other tape here), so replay recovers the SSE-vs-JSON distinction
with a small framing heuristic rather than a persisted header — see proxy.py's module
docstring for the exact rule. Storage/hashing reuse tape.py unchanged, and the
replay-time divergence check reuses matcher.py's existing RequestMatcher protocol —
nothing new was invented for either.
Validation scope
Read this section before trusting any accuracy number elsewhere in this README — it says
precisely what each one does and does not prove. The load-bearing, proven claim in this
project is the bit-exact, hash-verified replay substrate (replay --check, verify, the
spike receipt); the causal/blame claims below are validated on controlled, labeled
fixtures, not on real-world traces, and are scoped accordingly.
tracefork validate — is the engine genuinely causal? Yes, on a short control. Inject
an outcome-flipping fault at any step and the engine ranks that step first (verified by
also injecting at a non-root step), so the 1.00 top-1 precision is not a tautology or a
fixed-slot artifact. The five "fault classes" carry two real injection mechanisms (a
corrupted tool argument and a replaced text message) via a marker that survives the SDK's
JSON round-trip, and the negative control — a no-op perturbation that must not flip the
outcome — is enforced with a hard threshold (the run fails if it ever exceeds 0.30). What
it does not claim: discrimination among several competing plausible causes. The fixture
is a short tape where one step gets a flip-capable perturbation and the rest get an inert
one — a clean positive-vs-inert-control, but an easy one.
tracefork bench — does the engine discriminate among competing causes? Mostly, and
the one exception is named, not hidden. A longer, 7-exchange tape
(src/tracefork/competing_faults.py) carries several causally-distinct faults planted at
once, and measures the coalition/temporal-Shapley engine (blame.py's shapley_rank)
against each one's known ground truth:
| planted case | ground truth | engine's reading |
|---|---|---|
| a root cause (necessary and sufficient) | necessary, sufficient | matches |
| a downstream echo of the root — independently "sufficient" under naive single-step flip-rate (ties the root exactly), must not be blamed as root | sufficient, NOT necessary | matches |
| a two-part AND-conjunction — neither half alone is sufficient; both halves are genuinely necessary | necessary, NOT sufficient (both halves) | the later-joining half matches; the earlier half reads necessity=False |
| the same root cause re-run alongside the AND-conjunction (an over-determined run) | root: necessary + sufficient; conjunction halves: correctly NOT necessary, since the root alone already guarantees failure | matches |
| 4 unrelated decoy steps across the three scenarios above | neither necessary nor sufficient | matches |
8 of 9 cases resolve exactly as planted (Wilson 95% CI on that 8/9 at the CLI's
defaults, --k 3 --m-samples 2: roughly [0.56, 0.98] — small-n, read the interval, not
just the point estimate; tracefork bench prints it exactly). The one documented exception:
shapley_rank's necessity check is a temporal-order-restricted Shapley walk with
exactly one valid permutation (an explicit design trade-off — see the function's
docstring), so for a symmetric two-part conjunction it can only detect the marginal
contribution of the later-joining half; the earlier half is genuinely necessary too, but
its own marginal is measured before the conjunction completes, so it reads
necessity=False. tracefork bench reports this itself ([LIMITATION], never silently
passed), and it's pinned by
tests/test_competing_faults.py::test_temporal_order_undercredits_the_earlier_half_of_a_conjunction
— see src/tracefork/competing_faults.py's module docstring for the full mechanism.
Where this sits next to the field. Zhang et al., "Who&When: Uncover the Whodunit and
When of LLM Multi-Agent Failures" (ICML 2025), report
that log-based (single-pass, no re-execution) step attribution on their multi-agent
failure benchmark scores only ~14.2% top-1 — roughly the size of the gap tracefork's
fork-and-remeasure approach is aimed at. tracefork has not been run against Who&When's
actual data: no external dataset is downloaded anywhere in this repository, ever — offline
and $0 is non-negotiable (see CLAUDE.md). The 14.2% figure is printed by tracefork bench as context for the scale of the problem, not as a benchmark tracefork claims to
beat. validate's short control and bench's longer, multi-cause fixture are internal,
labeled, synthetic evidence — real signal about the instrument's own behavior, not a
substitute for evaluation on real multi-agent failure traces.
Read the numbers as: "the instrument reliably finds a single planted cause, and — with one named, structural exception — discriminates among several simultaneously planted causes on one longer run." Not: "tracefork resolves ambiguous multi-cause blame on real-world traces," and not: "tracefork scores some percentage on Who&When."
Layout
src/tracefork/ transport, tape, nondet, recorder, matcher, redact, fork, store,
blame, faults, validate, competing_faults, bench, report, server,
wire, synthetic, cli,
interop (OTel GenAI / OpenInference export+ingest),
observability (opt-in structlog + OTel self-instrumentation),
adapters/ (opt-in framework seam: LangChain/LangGraph, OpenAI
Agents SDK, CrewAI, AutoGen, Google ADK),
bedrock_transport (opt-in botocore before-send record/replay seam),
eventstream (standalone AWS event-stream binary framing codec),
proxy (opt-in localhost base-URL record/replay proxy for non-Python clients),
providers/ (anthropic, openai, gemini, bedrock adapters)
src/tracefork_spike/ the original bit-exact record/replay spike
web/report.html the single-file three-panel UI
examples/ runnable demo that produces the report above
tests/ 616 offline tests ($0, no key)
experiments/ committed reference report for `validate --check`
Testing
uv run pytest -q # all 616 offline tests
uv run pytest tests/test_faults.py -q # the self-validation chain
uv run pytest tests/test_competing_faults.py tests/test_bench.py -q # competing-cause discrimination
uv run tracefork validate --check # regression-gate vs committed report
uv run tracefork bench # competing-fault discrimination report
Contributing
Contributions are welcome — see CONTRIBUTING.md for dev setup,
the invariants a PR must respect, and commit/PR conventions. The whole dev loop
(tests, validate, lint, type-check) is offline and $0, so you can run the full gate
with no API key. Please also read the Code of Conduct.
Security
See SECURITY.md for how to report a vulnerability. In short: tapes
are JSON + base64 (never pickle, so loading one can't execute code), and tracefork serve binds to 127.0.0.1 only.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tracefork-0.2.1.tar.gz.
File metadata
- Download URL: tracefork-0.2.1.tar.gz
- Upload date:
- Size: 630.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55d4dd4077b35c5544bd62c889865e01f1eb622c005cc9b227d72bc3320bbd46
|
|
| MD5 |
256ca34dbbe57b9f2b892275aa88e0ed
|
|
| BLAKE2b-256 |
063b6bcbb2843018b78382c8cc83913be76e426df52a84aa2cb6edcf7559ec56
|
Provenance
The following attestation bundles were made for tracefork-0.2.1.tar.gz:
Publisher:
release.yml on pratik916/tracefork
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tracefork-0.2.1.tar.gz -
Subject digest:
55d4dd4077b35c5544bd62c889865e01f1eb622c005cc9b227d72bc3320bbd46 - Sigstore transparency entry: 2049877778
- Sigstore integration time:
-
Permalink:
pratik916/tracefork@39f9b00d6e661eae5028772d1aa63d42d2aefca8 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/pratik916
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@39f9b00d6e661eae5028772d1aa63d42d2aefca8 -
Trigger Event:
release
-
Statement type:
File details
Details for the file tracefork-0.2.1-py3-none-any.whl.
File metadata
- Download URL: tracefork-0.2.1-py3-none-any.whl
- Upload date:
- Size: 206.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
901bacd54bf179741899232f0afebe84e9acfcfec69c81a6257fba62c0298eff
|
|
| MD5 |
889fbd3b637ca46afce073c5133f51d3
|
|
| BLAKE2b-256 |
c5908c4f41d529cc5466566161a2ffcfe90338ff53bbf0c18cb05967d440a506
|
Provenance
The following attestation bundles were made for tracefork-0.2.1-py3-none-any.whl:
Publisher:
release.yml on pratik916/tracefork
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tracefork-0.2.1-py3-none-any.whl -
Subject digest:
901bacd54bf179741899232f0afebe84e9acfcfec69c81a6257fba62c0298eff - Sigstore transparency entry: 2049878247
- Sigstore integration time:
-
Permalink:
pratik916/tracefork@39f9b00d6e661eae5028772d1aa63d42d2aefca8 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/pratik916
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@39f9b00d6e661eae5028772d1aa63d42d2aefca8 -
Trigger Event:
release
-
Statement type: