Skip to main content

Deterministic agent test recorder and replayer. Record live runs, replay as mocks. Zero dependencies.

Project description

agentcassette

PyPI Python CI License: MIT Zero dependencies

Deterministic agent test recorder and replayer.

Record a real agent run once, replay it forever as a mock — no network, no cost, fully deterministic. Like VCR/pytest-recording, but purpose-built for LLM agents and with zero dependencies.

import agentcassette
from agentcassette import record, replay

call_model = agentcassette.intercept(call_model, kind="llm")

# Record a real run once:
with record("cassettes/flight_search.json"):
    my_agent.run("Find flights to NYC under $300")

# Replay it in tests — no API calls, no tokens spent, same result every time:
def test_flight_search():
    with replay("cassettes/flight_search.json"):
        result = my_agent.run("Find flights to NYC under $300")
    assert result.success

Why agentcassette?

Testing agents is painful. Live LLM calls are expensive (every test run costs money), non-deterministic (a different answer each time), and slow (seconds per call). So most teams either skip agent testing or maintain a costly, flaky integration suite.

agentcassette records the real calls an agent makes into a plain-JSON cassette, then replays them on demand. Your tests become fast, free, and deterministic — and you can assert on exactly what the agent did.

Unlike VCR-style tools that monkey-patch the HTTP layer, agentcassette uses an explicit, honest seam: you wrap the callables you want captured. That keeps it provider-agnostic (OpenAI, Anthropic, Gemini, a raw requests call, or a local model all work identically) and truly zero-dependency.

Installation

pip install agentcassette

Requires Python 3.9+. No other dependencies, ever.

Quick Start

1. Wrap what you want captured

Wrap your model-call function once (and any tools you want taped). Outside a record/replay block, wrapped callables behave exactly like the original — safe to leave in production code.

import agentcassette

# As a wrapper:
call_model = agentcassette.intercept(call_model, kind="llm")

# Or as a decorator:
@agentcassette.intercept(kind="tool")
def search_web(query: str) -> list[str]:
    ...

2. Record a real run

from agentcassette import record

with record("cassettes/flight_search.json", model="claude-sonnet-4-6"):
    my_agent.run("Find flights to NYC under $300")
# Cassette is written on clean exit.

3. Replay it in your tests

from agentcassette import replay

def test_flight_search():
    with replay("cassettes/flight_search.json"):
        result = my_agent.run("Find flights to NYC under $300")
    assert result.success

During replay, every intercepted call returns its recorded result and the real function is never called.

Async agents

intercept detects async def callables and returns an awaitable wrapper, so async agents work the same way — including a mix of async and sync tools in one run:

import agentcassette
from agentcassette import record, replay

acall_model = agentcassette.intercept(acall_model, kind="llm")  # an async def

async def agent(task):
    plan = await acall_model(f"plan: {task}")
    ...

with record("cassettes/run.json"):
    asyncio.run(agent("book a trip"))

with replay("cassettes/run.json"):
    asyncio.run(agent("book a trip"))   # awaited calls served from the cassette

Catching regressions with strict replay

By default, replay serves recorded results best-effort and collects any divergences. With strict=True, a call whose name or arguments differ from the recording raises DivergenceError — turning your cassette into a behavioral contract.

from agentcassette import replay, DivergenceError

with replay("cassettes/flight_search.json", strict=True):
    my_agent.run("Find flights to NYC under $300")   # raises on drift

Best-effort mode exposes what changed without failing:

with replay("cassettes/flight_search.json") as player:
    my_agent.run("Find flights to NYC under $300")

for d in player.divergences:
    print(d["index"], d["expected"], "->", d["actual"])

Using with pytest

agentcassette ships an optional pytest plugin (auto-registered — no config). Request the cassette fixture: it records on the first run, then replays on every run after. No cassette to manage by hand.

import agentcassette

call_model = agentcassette.intercept(call_model, kind="llm")

def test_flight_search(cassette):
    result = my_agent.run("Find flights to NYC under $300")
    assert result.ok

Cassettes default to <test dir>/cassettes/<test name>.json.

Record modes — via --record-mode:

Mode Behavior
once (default) Replay if a cassette exists, otherwise record it
none Replay only; fail if the cassette is missing (use in CI to forbid accidental recording)
all Always re-record, overwriting the cassette
pytest                     # record missing cassettes, replay the rest
pytest --record-mode=all   # re-record everything (e.g. after an intended change)
pytest --record-mode=none  # CI: fail if any cassette is missing

Per-test overrides with the cassette marker:

import pytest

@pytest.mark.cassette(record_mode="all", strict=True,
                      redact=["api_key"], path="tapes/search.json")
def test_search(cassette):
    ...

With strict=True, a replayed call that diverges from the recording fails the test — turning the cassette into a regression guard. The fixture yields the active Recorder (recording) or Player (replaying) for inspection.

The plugin needs pytest (pip install "agentcassette[pytest]", or it's in [dev]). Importing agentcassette itself never imports pytest, so the library stays zero-dependency.

Inspecting cassettes

from agentcassette import Cassette

c = Cassette.load("cassettes/flight_search.json")
c.num_steps            # number of intercepted calls
c.total_input_tokens   # summed across steps
c.total_output_tokens
c.total_tokens
c.duration_ms          # wall time of the original run

c.redact("api_key")    # scrub secrets before committing to git
c.save("cassettes/flight_search.json")

Token counts use exact usage blocks when the recorded response carries one (OpenAI usage.prompt_tokens, Anthropic usage.input_tokens, …), falling back to a deterministic ~4-chars-per-token heuristic otherwise.

Redacting secrets

Scrub sensitive keys either when recording or after loading:

# At record time:
with record("cassettes/run.json", redact=["api_key", "authorization"]):
    my_agent.run(task)

# Or later:
Cassette.load("cassettes/run.json").redact("api_key").save("cassettes/run.json")

Diffing runs

Compare two cassettes to see how an agent's behavior drifted between versions:

from agentcassette import diff_cassettes

delta = diff_cassettes("cassettes/v1.json", "cassettes/v2.json")
delta.new_calls          # call names in v2 but not v1
delta.dropped_calls      # call names in v1 but not v2
delta.changed_calls      # same-position steps whose args/results changed
delta.token_delta        # total token change (v2 - v1)
delta.identical          # True if nothing changed

Cassette format

Cassettes are plain, human-readable JSON — diffable and safe to commit:

{
  "version": 1,
  "recorded_at": "2026-06-30T12:00:00Z",
  "model": "claude-sonnet-4-6",
  "duration_ms": 1832.4,
  "steps": [
    {
      "index": 0,
      "type": "llm",
      "name": "call_model",
      "arguments": {"args": ["plan the task"], "kwargs": {}},
      "result": {"text": "...", "usage": {"input_tokens": 420, "output_tokens": 88}},
      "input_tokens": 420,
      "output_tokens": 88,
      "duration_ms": 512.0
    }
  ]
}

Every intercepted call becomes one step, in the exact order it happened.

API Reference

intercept(fn=None, *, name=None, kind="call")

Marks a callable as recordable/replayable. Usable as intercept(fn), intercept(fn, kind="llm"), or as a decorator. Works on both regular functions and async def coroutine functions (async callables get an awaitable wrapper). kind is a free-form label stored on each step (e.g. "llm", "tool"). Outside a session, the wrapped callable is a transparent pass-through.

record(path, *, model=None, redact=None)

Context manager. Records every intercepted call made inside the block to path, written on clean exit only. redact is a list of key names to scrub before saving. Yields the Recorder.

replay(path, *, strict=False)

Context manager. Serves recorded results for intercepted calls without running the real functions. strict=True raises DivergenceError on any mismatch. Yields the Player (with .divergences, .remaining, .cursor).

Cassette

Member Description
Cassette.load(path) Load from disk (raises CassetteNotFound)
.save(path) Write pretty-printed JSON, creating parent dirs
.num_steps Number of recorded steps
.total_input_tokens / .total_output_tokens / .total_tokens Token totals
.duration_ms Wall time of the recorded run
.redact(key, replacement="****") Scrub every value under key, at any depth

diff_cassettes(a, b) -> CassetteDiff

Compare two cassettes (paths or Cassette objects). Returns a CassetteDiff with new_calls, dropped_calls, changed_calls, token_delta, input_token_delta, output_token_delta, step_delta, and identical.

Exceptions

All inherit from AgentCassetteError:

Exception Raised when
CassetteNotFound Replaying a path that doesn't exist
ReplayExhausted The agent makes more calls than the cassette recorded
DivergenceError A strict replay sees a call that differs from the recording

Notes & limitations

  • Replayed results are JSON. Recorded values round-trip through JSON, so on replay you get plain dicts/lists/primitives, not the original SDK objects. For typical LLM responses (dicts) this is exactly what you want.
  • Ordering matters. Calls replay in the order they were recorded. agentcassette matches sequentially, which is deterministic and mirrors how an agent actually executes. Truly concurrent calls (e.g. asyncio.gather) are recorded in completion order; if that order isn't stable across runs, replay matching is best-effort — record such sections sequentially if you need strict determinism.
  • Sync and async. Both def and async def callables are supported. record/replay are thread-local and cover the event loop running on that thread; wrap per-thread if your agent fans out across OS threads.
  • Streaming responses (token iterators) are not specially handled yet — wrap at a boundary where the response is already materialized.

Contributing

See CONTRIBUTING.md.

License

MIT — see LICENSE.


Part of the aenealabs AI agent toolkit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentcassette-0.2.0.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentcassette-0.2.0-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file agentcassette-0.2.0.tar.gz.

File metadata

  • Download URL: agentcassette-0.2.0.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for agentcassette-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fbc95626752c6b4e341ae5a1e63febe35b517b3be67508944d84c3bd5109478e
MD5 187814e96d63b138f86f517121100fb1
BLAKE2b-256 64217b795ecfd71255310d3d9ab06b4ab25474b9006b62c80b832611b65560e9

See more details on using hashes here.

File details

Details for the file agentcassette-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: agentcassette-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for agentcassette-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4c8146b6feed02619de27f37666df5be1b5d0e47b33a2d599df8fb18b8067b9
MD5 f0d7e889e08335c9d089e415276670fe
BLAKE2b-256 512648491ce09c3fa17432568c676087f1f96ce1994635590adf565341e8584a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page