Framework-agnostic record/replay test harness for LLM agents

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bigdragonbro

These details have not been verified by PyPI

Project description

llm-tape

Framework-agnostic record/replay test harness for LLM agents.

Record any agent run to a portable .tape.yaml file. Replay it in CI — no API keys, no flakiness, no cost. Assert on behavioral properties: which tools were called, in what order, how many steps, how many tokens.

Works with any framework that uses httpx under the hood: Anthropic SDK, OpenAI SDK, LangChain, LangGraph, CrewAI, raw httpx calls.

Why this exists

Testing LLM agents is a genuinely unsolved problem. The tools developers reach for — unit tests, mocks, LLM-as-judge eval frameworks — all fail in different ways:

Unit tests break the wrong thing. Mocking the LLM means you're testing your mock, not your agent. When the model changes its reasoning or tool-call patterns, mocked tests pass and production breaks.

Eval frameworks test quality, not behavior. Tools like DeepEval and MLflow are great for measuring response quality with LLM-as-judge. But they don't tell you whether your agent called the right tools in the right order, stayed within your token budget, or finished the task at all.

Every framework solves it differently. LangGraph has checkpointing. LangSmith has trace replay. But these only work within the LangChain ecosystem. If you switch frameworks, or use a bare anthropic SDK client, you start from scratch.

The result: most agent teams ship with no behavioral regression tests. When the model or prompt changes, they find out from users.

What llm-tape does differently

llm-tape intercepts at the HTTP transport layer — below any framework, above the network. It records exactly what the LLM was asked and what it said, stores that as a git-committable YAML file, and replays it deterministically. Your agent code runs unchanged; it just gets pre-recorded answers instead of live API calls.

Then you write assertions about structure, not content:

tape.assert_tool_called("read_file", before="write_summary")  # correct order
tape.assert_task_completed()                                   # didn't get stuck
tape.assert_steps_under(5)                                     # didn't loop
tape.assert_no_hallucinated_tools(["read_file", "write_summary"])  # stayed in scope

These tests run in CI with no API key, in milliseconds, and catch regressions before they reach production.

Install

pip install llm-tape

How it works

Your agent code
      │
      ▼
  httpx.Client / httpx.AsyncClient   ← llm-tape patches here
      │
      │  record mode: wraps real transport, saves each call to tape
      │  replay mode: returns recorded responses, no network
      ▼
  LLM API (Anthropic, OpenAI, ...)

Patching at the httpx transport layer means:

Zero changes to your agent code
Works with any SDK or framework that uses httpx
Sensitive headers (x-api-key, authorization) are scrubbed automatically

Full working example

The examples/summarize_agent/ directory contains a complete runnable example: a file-summarizer agent, a recording script, and a test suite.

The agent (`agent.py`)

import anthropic
from pathlib import Path

TOOLS = [
    {
        "name": "read_file",
        "description": "Read the contents of a local text file.",
        "input_schema": {
            "type": "object",
            "properties": {"path": {"type": "string"}},
            "required": ["path"],
        },
    },
    {
        "name": "write_summary",
        "description": "Write the final summary to a file on disk.",
        "input_schema": {
            "type": "object",
            "properties": {
                "content": {"type": "string"},
                "path": {"type": "string"},
            },
            "required": ["content", "path"],
        },
    },
]

def run_agent(task: str) -> str:
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": task}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=4096,
            messages=messages, tools=TOOLS,
        )
        # collect assistant content, run tools, feed results back...
        if response.stop_reason == "end_turn":
            return response.content[0].text
        # ... (see examples/summarize_agent/agent.py for the full loop)

Step 1 — Record once (requires API key)

ANTHROPIC_API_KEY=sk-... python examples/summarize_agent/record_tape.py

This runs the real agent, captures every LLM call, and writes tapes/summarize_document.tape.yaml. Commit it:

git add tapes/summarize_document.tape.yaml
git commit -m "chore: record summarize_agent tape"

Step 2 — Test forever after (no API key)

# examples/summarize_agent/test_agent.py

import llm_tape
from agent import run_agent

TAPE = "tapes/summarize_document.tape.yaml"

@pytest.fixture
def result():
    with llm_tape.replay(TAPE) as tape:
        text = run_agent("Please summarize the file notes.txt")
    return text, tape

def test_reads_file_before_writing(result):
    _, tape = result
    tape.assert_tool_called("read_file", before="write_summary")

def test_completes_task(result):
    _, tape = result
    tape.assert_task_completed()

def test_efficient(result):
    _, tape = result
    tape.assert_steps_under(5)

def test_no_hallucinated_tools(result):
    _, tape = result
    tape.assert_no_hallucinated_tools(["read_file", "write_summary"])

pytest examples/summarize_agent/test_agent.py
# runs in <100ms, zero API calls, works in CI

Quick start (any agent)

import llm_tape

# Record
with llm_tape.record("tapes/my_agent.tape.yaml"):
    result = my_agent.run("do the thing")

# Replay + assert
with llm_tape.replay("tapes/my_agent.tape.yaml") as tape:
    result = my_agent.run("do the thing")
    tape.assert_tool_called("search")
    tape.assert_task_completed()
    tape.assert_steps_under(10)

pytest markers

@pytest.mark.tape("tapes/my_agent.tape.yaml")
def test_agent_behavior(tape_replay):
    my_agent.run("do the thing")
    tape_replay.assert_tool_called("search", before="write_file")
    tape_replay.assert_task_completed()

Streaming agents

SSE streaming is fully supported — chunks are captured and replayed exactly as recorded:

# Record a streaming run
with llm_tape.record("tapes/streaming.tape.yaml"):
    with client.messages.stream(model="claude-sonnet-4-6", ...) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)

# Replay — same streaming interface, no API call
with llm_tape.replay("tapes/streaming.tape.yaml") as tape:
    with client.messages.stream(...) as stream:
        full_text = stream.get_final_text()
    tape.assert_task_completed()

How tapes fit into your workflow

A tape is a snapshot of Claude's behavior at a point in time. Once committed to git, every future test run replays that snapshot — no API key, no cost, no flakiness.

Developer (once, needs API key)         CI (every PR, no API key needed)
───────────────────────────────         ─────────────────────────────────
python record_tape.py                   pytest test_agent.py
        │                                       │
        ▼                                       ▼
  real Claude API call              tape replayed locally
        │                                       │
        ▼                                       ▼
tapes/my_agent.tape.yaml ──git commit──▶ assertions checked
                                          (~0.5s, $0.00)

What breaks these tests?

Not random LLM variation — the tape freezes the model's responses. What breaks them is your code changing in a way that alters agent behavior: a refactored tool loop, a changed prompt, a new tool added. That's exactly what you want CI to catch.

When do you re-record?

Intentionally, when you mean to change behavior:

You updated the system prompt
You added or renamed a tool
You want to capture improved model behavior after a model upgrade

Run record_tape.py again, review the tape diff in your PR, and commit. The diff tells reviewers exactly how Claude's behavior changed — which tools it now calls, in what order, how many tokens it used. This makes behavioral changes explicit and reviewable instead of silent.

Tapes in CI (no API key required)

The Anthropic SDK validates that an API key exists before making any call. In replay mode no real call is made, but you still need a non-empty value to pass that check. Set a dummy key in your test fixture:

@pytest.fixture
def result(monkeypatch):
    monkeypatch.setenv("ANTHROPIC_API_KEY", "test-replay-key")
    with llm_tape.replay("tapes/my_agent.tape.yaml") as tape:
        text = run_agent("do the thing")
    return text, tape

In CI, add ANTHROPIC_API_KEY: test-replay-key to your workflow environment — no real credentials needed.

Tape format

Tapes are plain YAML — designed to be committed, reviewed in PRs, and diffed:

version: '1'
metadata:
  recorded_at: '2026-06-24T10:00:00Z'
interactions:
  - id: call_0
    duration_ms: 412.3
    request:
      method: POST
      url: https://api.anthropic.com/v1/messages
      headers:
        x-api-key: '***'          # always scrubbed
        content-type: application/json
      body:
        model: claude-sonnet-4-6
        messages:
          - role: user
            content: Please summarize the file notes.txt
    response:
      status: 200
      body:
        stop_reason: tool_use
        content:
          - type: tool_use
            name: read_file
            input:
              path: notes.txt
        usage:
          input_tokens: 312
          output_tokens: 48

See tapes/summarize_document.tape.yaml for a full 3-call example.

All assertions

Method	What it checks
`assert_tool_called(name)`	Tool was called at least once
`assert_tool_called(name, before=other)`	Tool was called before another
`assert_tool_called(name, after=other)`	Tool was called after another
`assert_tool_not_called(name)`	Tool was never called
`assert_tool_input(name, **fields)`	Tool was called with specific input values
`assert_no_hallucinated_tools(allowed)`	Agent only called tools from this list
`assert_steps_under(n)`	Agent finished in ≤ n LLM calls
`assert_total_tokens_under(n)`	Total tokens (input + output) ≤ n
`assert_task_completed()`	Final stop reason is `end_turn`
`assert_stop_reason(reason)`	Final stop reason matches exactly
`diff(other_tape)`	Returns human-readable list of behavioral differences

Limitations

Sequential replay only. Tapes replay interactions in the order they were recorded. If your agent makes truly parallel LLM calls (concurrent asyncio.gather across multiple clients), the replay order may not match. Parallel tool execution within a single response is fine.
Frameworks that pre-create clients. If a framework constructs its httpx.Client before the record()/replay() block is entered, that client won't be intercepted. Workaround: pass a transport explicitly at construction time, or initialize the framework inside the block.
Non-httpx transports. Libraries that use urllib3, aiohttp, or requests directly aren't intercepted. Most major LLM SDKs use httpx; check your framework's dependencies if unsure.

Contributing

git clone https://github.com/bigdragonbro/llm-tape
cd llm-tape
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

PRs welcome. Current roadmap:

Parallel call support (non-sequential replay)
Pre-built injection helpers for LangChain / CrewAI early-init
llm-tape diff CLI for comparing two tape files
VS Code extension for tape visualization

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bigdragonbro

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_tape-0.1.0.tar.gz (24.4 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_tape-0.1.0-py3-none-any.whl (15.8 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file llm_tape-0.1.0.tar.gz.

File metadata

Download URL: llm_tape-0.1.0.tar.gz
Upload date: Jun 25, 2026
Size: 24.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_tape-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e52188c808c95d5698c003eabdf2b23a00485768a0d351e7378d54aca641366e`
MD5	`d76e00d480d1ac288ff3af50506ef3c2`
BLAKE2b-256	`a491f386bcdf244a3cc22928c6a6362fdf84ab1e1bbd83476053ac7a78cbe756`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_tape-0.1.0.tar.gz:

Publisher: publish.yml on bigdragonbro/llm-tape

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_tape-0.1.0.tar.gz
- Subject digest: e52188c808c95d5698c003eabdf2b23a00485768a0d351e7378d54aca641366e
- Sigstore transparency entry: 1949066068
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: bigdragonbro/llm-tape@6cd6e58529011e22cb02f771c4a9d96a70c7c82a
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/bigdragonbro
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6cd6e58529011e22cb02f771c4a9d96a70c7c82a
- Trigger Event: push

File details

Details for the file llm_tape-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_tape-0.1.0-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 15.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_tape-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a91b2d3fb044eecf5165eeb228bf0c0f63e3f748982fc6a60bc3e264a75598a9`
MD5	`c274c65e83f5ddfefe6f899885c890b3`
BLAKE2b-256	`dc7035705bf253eaea562bc33f6f661867de47af7e6a377e3dcd289ab542b242`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_tape-0.1.0-py3-none-any.whl:

Publisher: publish.yml on bigdragonbro/llm-tape

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_tape-0.1.0-py3-none-any.whl
- Subject digest: a91b2d3fb044eecf5165eeb228bf0c0f63e3f748982fc6a60bc3e264a75598a9
- Sigstore transparency entry: 1949066241
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: bigdragonbro/llm-tape@6cd6e58529011e22cb02f771c4a9d96a70c7c82a
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/bigdragonbro
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6cd6e58529011e22cb02f771c4a9d96a70c7c82a
- Trigger Event: push

llm-tape 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

llm-tape

Why this exists

What llm-tape does differently

Install

How it works

Full working example

The agent (agent.py)

Step 1 — Record once (requires API key)

Step 2 — Test forever after (no API key)

Quick start (any agent)

pytest markers

Streaming agents

How tapes fit into your workflow

What breaks these tests?

When do you re-record?

Tapes in CI (no API key required)

Tape format

All assertions

Limitations

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

The agent (`agent.py`)