Record & replay the claude-agent-sdk wire for deterministic, offline tests.
Project description
Claude Agent Cassette
Record & replay the claude-agent-sdk
wire for deterministic, offline tests — no API key, no subprocess, no mocks.
Why
Apps built on claude-agent-sdk read a stream of typed messages (assistant turns,
tool results, task notifications, control-protocol frames) and drive logic off
them. The nasty bugs live at that stream → your-handler seam: the SDK emits a
slightly different shape than you expected, and your handler quietly does the
wrong thing.
Mocked tests can't catch this — you build the mock, so you only test your understanding of your own mock. A cassette records the real wire once and replays it through the SDK's real parser, so:
- a shape change in the SDK turns your test red instead of shipping to prod;
- tests run with no API cost, no network, no
claudesubprocess; - the replayed frames go through the genuine
message_parser, not a stand-in.
PRODUCTION: real CLI ──raw frames──► SDK parser ──► your code
▲
REPLAY: ReplayTransport ──raw frames──┘ (same parser, same code)
Install
pip install claude-agent-cassette # (or: uv add claude-agent-cassette)
Replay (the common case — offline, no key)
from claude_agent_cassette import replay, load_frames
async def test_my_handler():
async with replay(load_frames("tests/cassettes/happy_path.jsonl")) as client:
kinds = []
async for m in client.receive_messages():
kinds.append(type(m).__name__)
if kinds[-1] == "ResultMessage":
break # stream stays open after the result; break like the real wire
assert "ResultMessage" in kinds
# ...or feed client.receive_messages() into your own dispatcher and
# assert on what it produces.
A frames file is JSONL of raw inbound stream-json frames — the exact dicts the
CLI emits. replay() injects them into a real ClaudeSDKClient and answers the
SDK's initialize control handshake for you. (Vocabulary: a frame is a raw
wire dict; a message is the typed object the SDK parses it into; a tape
is a full duplex recording.)
Record (capture a real session)
record() works with both SDK entry points — the one-shot query()
and the interactive ClaudeSDKClient (it patches both transport-construction
sites the SDK uses):
from claude_agent_cassette import record, save_tape
# one-shot query()
from claude_agent_sdk import query
with record() as tape: # tees the full duplex wire
async for _ in query(prompt="...", options=...):
pass
save_tape(tape, "session.jsonl")
# interactive ClaudeSDKClient
from claude_agent_sdk import ClaudeAgentOptions, ClaudeSDKClient
with record() as tape:
async with ClaudeSDKClient(options=ClaudeAgentOptions()) as client:
await client.query("...")
async for _ in client.receive_messages():
pass
save_tape(tape, "session.jsonl")
record() captures both directions, including the control plane
(control_request/control_response, mcp_message, hook_callback, the
handshake), so one recording can feed both conversation replay and
control-protocol replay. Derive the conversation-only frames with
conversation_frames(tape).
Drift detection (gate SDK bumps)
Re-parse a cassette's message frames through the installed SDK's own
message_parser. A frame that no longer parses — or whose content blocks the
parser silently drops — is flagged. Because it reuses the SDK's own parser, there
is no schema to maintain: the judge is the thing being judged.
claude-agent-cassette drift tests/cassettes/ # *.jsonl files, or dirs of them
drift: 5 cassette(s) vs claude-agent-sdk 0.2.87
ok happy_path.jsonl
DRIFT stop_midtask.jsonl — 1 frame(s):
frame[3] assistant: content_dropped — 1 of 2 content block(s) dropped on parse
ok notification.jsonl
5 checked, 1 drifted (1 frame) — re-record the drifted cassettes.
- Exits non-zero on drift — use it to gate an SDK-bump PR in CI.
- Fails closed: if no cassette files are found it exits non-zero (a mispointed
path can't pass as a false green); pass
--allow-emptyto override. - Two cassette layouts in a directory: flat (top-level
*.jsonl) or nested (<name>/input.jsonl, where each cassette is a dir holding the recording plus sidecars). Nested is auto-detected; onlyinput.jsonlis checked, so siblingexpected.jsonl/meta.jsonare ignored, and a drift row is named by the cassette dir. Use--input-name FILEfor a different recording filename. A dir mixing both layouts is rejected (it can't silently check only half). - Three drift signals:
parse_error(the parser rejected the frame),unrecognized_type(the message type is gone),content_dropped(a content block silently vanished). - Scope: catches parse-level drift (rejected/skipped frames) + dropped content blocks. It does not catch additive field-level drift (a still-parsing frame that gained a field) — see ROADMAP.md.
In Python: parse_drift(frames) / check_drift(tape) → list[DriftFinding].
Control-protocol replay (the duplex wire)
replay_tape(tape, mode=...) replays a full duplex recording, including the control
plane, through a real ClaudeSDKClient. Break at the terminal ResultMessage (the
stream stays open after it, like the real wire):
from claude_agent_cassette import replay_tape, load_tape
async def test_permission_flow():
async with replay_tape(load_tape("session.jsonl"), mode="stub") as client:
async for m in client.receive_messages():
if type(m).__name__ == "ResultMessage":
break
mode="inert"(default) — conversation + Direction-A control replay: theinitialize/mcp_statushandshakes are answered from the recording; inbound Direction-B requests (can_use_tool/hook_callback/mcp_message) are dropped, so your registered callbacks stay inert.mode="stub"— also replay Direction-B: the recorded requests are delivered to the SDK and answered from the tape by stubs that replace yourcan_use_tool/ hooks / SDK MCP servers (formcp_message, a real in-process MCP server is synthesized from the recordedinitialize/tools/list/tools/calltraffic). Deterministic and inert — it certifies the recorded wire, not your policy.mode="verify"— the recorded Direction-B requests are delivered to your realcan_use_tool/hooks/ SDK MCP servers (nothing is replaced), and on exit each live decision is diffed against the recorded one — matched byrequest_id, at the wire. This certifies your policy still produces the recorded decisions: a changed decision or tool result, a callback that now raises (or no longer does), or an unanswered exchange is divergence.- Fail-closed end-to-end. In
"stub"and"verify"modes, any divergence from the tape — a live request with no recorded match, an exhausted or error decision, hook ids the SDK didn't reproduce, a live decision that differs from the recording, or recorded exchanges left unreplayed — raisesCassetteMismatchErrorwhen theasync withexits. (The SDK swallows callback exceptions into error responses, so the divergence is collected and surfaced on exit, not inside the callback.) A Direction-B subtype with no replay support (one a future SDK adds) raises up front — usemode="inert". - Recording a Direction-B tape needs the control decisions preserved.
scrub_tape(tape, replacements)blanks PII values while keeping decisions intact;lint_tape(tape)lints whether a tape is still replayable (run it after scrubbing). Seeexamples/record_permission_session.py.
Examples
examples/ has a runnable, no-key demo:
python examples/replay_cassette.py
# AssistantMessage:
# ResultMessage: Hello! How can I help?
It replays the saved examples/cassettes/hello_world.jsonl
through a real ClaudeSDKClient. (That cassette is a small, illustrative
hand-written sample with realistic wire shapes; real cassettes are recorded —
see above.)
The three recorder scripts each capture one Direction-B subtype as a
decision-preserving, scrubbed fixture (they spend a small API call to re-record;
the committed fixtures in examples/cassettes/ replay
offline):
record_permission_session.py—can_use_tool(allow, allow +updatedInputredirect, deny)record_hooks_session.py—hook_callback(PreToolUse)record_mcp_session.py—mcp_message(in-process MCP calculator; one normal + oneis_errortool result)
API
record() |
CM that wraps the SDK's transport to capture a session's full duplex wire as a tape |
replay(frames, options=None) |
async CM → a connected ClaudeSDKClient replaying raw frames |
replay_tape(tape, options=None, mode=...) |
async CM → replay a full duplex tape incl. the control plane; ReplayMode = "inert" | "stub" | "verify" |
save_tape(tape, path) / load_tape(path) |
tape I/O (JSONL) |
load_frames(path) |
load a frames file for replay() |
inbound_frames(tape) / conversation_frames(tape) |
derive frame views from a tape (all inbound / conversation-only) |
direction_b_exchanges(tape) → {subtype: [ControlExchange]} |
inspect the recorded Direction-B decisions (what was allowed/denied/answered) |
scrub_tape(tape, replacements) |
decision-preserving PII scrub for sharing a recording |
lint_tape(tape) |
lint a tape for Direction-B replayability (run after scrubbing) |
check_drift(tape) / parse_drift(frames) → list[DriftFinding] |
drift findings vs the installed SDK |
ReplayTransport(frames) / .from_tape(tape, keep_subtypes=None) |
the transport under replay/replay_tape, for wiring a client by hand |
RecordingTransport(inner, tape) |
passive MITM tee, both directions |
CassetteMismatchError |
replay diverged from the recording (always fail-closed) |
TapeEntry / Frame |
the tape entry and raw-frame types |
claude-agent-cassette drift <path…> |
CLI drift gate (non-zero on drift / empty) |
How it works (the non-obvious bits)
- Replay rides the public
TransportABC (ClaudeSDKClient(transport=...), stable since SDK 0.0.22). It's solid across versions. - The initialize handshake:
connect()writes acontrol_requestwith a freshrequest_idand blocks until it sees acontrol_responseechoing it. SoReplayTransportreads that id offwrite()and synthesises the response — otherwise replay hangs. - Record patches two sites:
ClaudeSDKClientdoes a call-time import of the transport from its source module, while one-shotquery()uses the name bound in_internal.client. Patching only one silently misses the other.
Compatibility
Replay uses only the public Transport API. Record and drift reach into
claude_agent_sdk._internal (the subprocess transport, control-protocol shape,
and message_parser), so they are version-sensitive — this release targets
claude-agent-sdk 0.2.x. Pin your SDK and re-verify on bumps. (Drift being
version-sensitive is the point: it tells you when a bump broke a cassette.)
Roadmap
See ROADMAP.md. Shipped: conversation replay, recording,
Direction-A control replay (ReplayTransport.from_tape), drift detection,
Direction-B replay for all three subtypes (can_use_tool / hook_callback /
mcp_message, in both mode="stub" and mode="verify"), and a
decision-preserving scrub (scrub_tape). Next up: interrupt lockstep, a pytest
plugin with record-on-miss, and field-level drift.
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file claude_agent_cassette-0.3.0.tar.gz.
File metadata
- Download URL: claude_agent_cassette-0.3.0.tar.gz
- Upload date:
- Size: 144.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5f4dbc9a440eedf0a8580be6079b128a2b51dce8e9505dae8c77e1d6f9e060c
|
|
| MD5 |
1b6bddd85bb9a851758dab6849f55a60
|
|
| BLAKE2b-256 |
28069a222f0215b516abd2f765a1133df4cfd365c10b0935e5d17b01835ef7b4
|
File details
Details for the file claude_agent_cassette-0.3.0-py3-none-any.whl.
File metadata
- Download URL: claude_agent_cassette-0.3.0-py3-none-any.whl
- Upload date:
- Size: 35.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0913013c7a4d3a041d9adf3e73bc01c394d6f2e446bffed04c114791391cd30
|
|
| MD5 |
9f6008c3805fcfec5b1c415a97508758
|
|
| BLAKE2b-256 |
84fd6179a463e2d88e114fbec8effbf7ae714ce3209f2245f0c6d28f035f9fbc
|