Deterministic agent test recorder and replayer. Record live runs, replay as mocks. Zero dependencies.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lavrut

These details have not been verified by PyPI

Project description

agentcassette

Deterministic agent test recorder and replayer.

Record a real agent run once, replay it forever as a mock — no network, no cost, fully deterministic. Like VCR/pytest-recording, but purpose-built for LLM agents and with zero dependencies.

import agentcassette
from agentcassette import record, replay

call_model = agentcassette.intercept(call_model, kind="llm")

# Record a real run once:
with record("cassettes/flight_search.json"):
    my_agent.run("Find flights to NYC under $300")

# Replay it in tests — no API calls, no tokens spent, same result every time:
def test_flight_search():
    with replay("cassettes/flight_search.json"):
        result = my_agent.run("Find flights to NYC under $300")
    assert result.success

Why agentcassette?

Testing agents is painful. Live LLM calls are expensive (every test run costs money), non-deterministic (a different answer each time), and slow (seconds per call). So most teams either skip agent testing or maintain a costly, flaky integration suite.

agentcassette records the real calls an agent makes into a plain-JSON cassette, then replays them on demand. Your tests become fast, free, and deterministic — and you can assert on exactly what the agent did.

Unlike VCR-style tools that monkey-patch the HTTP layer, agentcassette uses an explicit, honest seam: you wrap the callables you want captured. That keeps it provider-agnostic (OpenAI, Anthropic, Gemini, a raw requests call, or a local model all work identically) and truly zero-dependency.

Installation

pip install agentcassette

Requires Python 3.9+. No other dependencies, ever.

Quick Start

1. Wrap what you want captured

Wrap your model-call function once (and any tools you want taped). Outside a record/replay block, wrapped callables behave exactly like the original — safe to leave in production code.

import agentcassette

# As a wrapper:
call_model = agentcassette.intercept(call_model, kind="llm")

# Or as a decorator:
@agentcassette.intercept(kind="tool")
def search_web(query: str) -> list[str]:
    ...

2. Record a real run

from agentcassette import record

with record("cassettes/flight_search.json", model="claude-sonnet-4-6"):
    my_agent.run("Find flights to NYC under $300")
# Cassette is written on clean exit.

3. Replay it in your tests

from agentcassette import replay

def test_flight_search():
    with replay("cassettes/flight_search.json"):
        result = my_agent.run("Find flights to NYC under $300")
    assert result.success

During replay, every intercepted call returns its recorded result and the real function is never called.

Async agents

intercept detects async def callables and returns an awaitable wrapper, so async agents work the same way — including a mix of async and sync tools in one run:

import agentcassette
from agentcassette import record, replay

acall_model = agentcassette.intercept(acall_model, kind="llm")  # an async def

async def agent(task):
    plan = await acall_model(f"plan: {task}")
    ...

with record("cassettes/run.json"):
    asyncio.run(agent("book a trip"))

with replay("cassettes/run.json"):
    asyncio.run(agent("book a trip"))   # awaited calls served from the cassette

Catching regressions with strict replay

By default, replay serves recorded results best-effort and collects any divergences. With strict=True, a call whose name or arguments differ from the recording raises DivergenceError — turning your cassette into a behavioral contract.

from agentcassette import replay, DivergenceError

with replay("cassettes/flight_search.json", strict=True):
    my_agent.run("Find flights to NYC under $300")   # raises on drift

Best-effort mode exposes what changed without failing:

with replay("cassettes/flight_search.json") as player:
    my_agent.run("Find flights to NYC under $300")

for d in player.divergences:
    print(d["index"], d["expected"], "->", d["actual"])

Using with pytest

agentcassette ships an optional pytest plugin (auto-registered — no config). Request the cassette fixture: it records on the first run, then replays on every run after. No cassette to manage by hand.

import agentcassette

call_model = agentcassette.intercept(call_model, kind="llm")

def test_flight_search(cassette):
    result = my_agent.run("Find flights to NYC under $300")
    assert result.ok

Cassettes default to <test dir>/cassettes/<test name>.json.

Record modes — via --record-mode:

Mode	Behavior
`once` (default)	Replay if a cassette exists, otherwise record it
`none`	Replay only; fail if the cassette is missing (use in CI to forbid accidental recording)
`all`	Always re-record, overwriting the cassette

pytest                     # record missing cassettes, replay the rest
pytest --record-mode=all   # re-record everything (e.g. after an intended change)
pytest --record-mode=none  # CI: fail if any cassette is missing

Per-test overrides with the cassette marker:

import pytest

@pytest.mark.cassette(record_mode="all", strict=True,
                      redact=["api_key"], path="tapes/search.json")
def test_search(cassette):
    ...

With strict=True, a replayed call that diverges from the recording fails the test — turning the cassette into a regression guard. The fixture yields the active Recorder (recording) or Player (replaying) for inspection.

The plugin needs pytest (pip install "agentcassette[pytest]", or it's in [dev]). Importing agentcassette itself never imports pytest, so the library stays zero-dependency.

Inspecting cassettes

from agentcassette import Cassette

c = Cassette.load("cassettes/flight_search.json")
c.num_steps            # number of intercepted calls
c.total_input_tokens   # summed across steps
c.total_output_tokens
c.total_tokens
c.duration_ms          # wall time of the original run

c.redact("api_key")    # scrub secrets before committing to git
c.save("cassettes/flight_search.json")

Token counts use exact usage blocks when the recorded response carries one (OpenAI usage.prompt_tokens, Anthropic usage.input_tokens, …), falling back to a deterministic ~4-chars-per-token heuristic otherwise.

Redacting secrets

Scrub sensitive keys either when recording or after loading:

# At record time:
with record("cassettes/run.json", redact=["api_key", "authorization"]):
    my_agent.run(task)

# Or later:
Cassette.load("cassettes/run.json").redact("api_key").save("cassettes/run.json")

Diffing runs

Compare two cassettes to see how an agent's behavior drifted between versions:

from agentcassette import diff_cassettes

delta = diff_cassettes("cassettes/v1.json", "cassettes/v2.json")
delta.new_calls          # call names in v2 but not v1
delta.dropped_calls      # call names in v1 but not v2
delta.changed_calls      # same-position steps whose args/results changed
delta.token_delta        # total token change (v2 - v1)
delta.identical          # True if nothing changed

Cassette format

Cassettes are plain, human-readable JSON — diffable and safe to commit:

{
  "version": 1,
  "recorded_at": "2026-06-30T12:00:00Z",
  "model": "claude-sonnet-4-6",
  "duration_ms": 1832.4,
  "steps": [
    {
      "index": 0,
      "type": "llm",
      "name": "call_model",
      "arguments": {"args": ["plan the task"], "kwargs": {}},
      "result": {"text": "...", "usage": {"input_tokens": 420, "output_tokens": 88}},
      "input_tokens": 420,
      "output_tokens": 88,
      "duration_ms": 512.0
    }
  ]
}

Every intercepted call becomes one step, in the exact order it happened.

API Reference

`intercept(fn=None, *, name=None, kind="call")`

Marks a callable as recordable/replayable. Usable as intercept(fn), intercept(fn, kind="llm"), or as a decorator. Works on both regular functions and async def coroutine functions (async callables get an awaitable wrapper). kind is a free-form label stored on each step (e.g. "llm", "tool"). Outside a session, the wrapped callable is a transparent pass-through.

`record(path, *, model=None, redact=None)`

Context manager. Records every intercepted call made inside the block to path, written on clean exit only. redact is a list of key names to scrub before saving. Yields the Recorder.

`replay(path, *, strict=False)`

Context manager. Serves recorded results for intercepted calls without running the real functions. strict=True raises DivergenceError on any mismatch. Yields the Player (with .divergences, .remaining, .cursor).

`Cassette`

Member	Description
`Cassette.load(path)`	Load from disk (raises `CassetteNotFound`)
`.save(path)`	Write pretty-printed JSON, creating parent dirs
`.num_steps`	Number of recorded steps
`.total_input_tokens` / `.total_output_tokens` / `.total_tokens`	Token totals
`.duration_ms`	Wall time of the recorded run
`.redact(key, replacement="****")`	Scrub every value under `key`, at any depth

`diff_cassettes(a, b) -> CassetteDiff`

Compare two cassettes (paths or Cassette objects). Returns a CassetteDiff with new_calls, dropped_calls, changed_calls, token_delta, input_token_delta, output_token_delta, step_delta, and identical.

Exceptions

All inherit from AgentCassetteError:

Exception	Raised when
`CassetteNotFound`	Replaying a path that doesn't exist
`ReplayExhausted`	The agent makes more calls than the cassette recorded
`DivergenceError`	A strict replay sees a call that differs from the recording

Notes & limitations

Replayed results are JSON. Recorded values round-trip through JSON, so on replay you get plain dicts/lists/primitives, not the original SDK objects. For typical LLM responses (dicts) this is exactly what you want.
Ordering matters. Calls replay in the order they were recorded. agentcassette matches sequentially, which is deterministic and mirrors how an agent actually executes. Truly concurrent calls (e.g. asyncio.gather) are recorded in completion order; if that order isn't stable across runs, replay matching is best-effort — record such sections sequentially if you need strict determinism.
Sync and async. Both def and async def callables are supported. record/replay are thread-local and cover the event loop running on that thread; wrap per-thread if your agent fans out across OS threads.
Streaming responses (token iterators) are not specially handled yet — wrap at a boundary where the response is already materialized.

Contributing

See CONTRIBUTING.md.

License

MIT — see LICENSE.

Part of the aenealabs AI agent toolkit.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lavrut

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jul 1, 2026

0.1.0

Jul 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentcassette-0.2.0.tar.gz (27.0 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentcassette-0.2.0-py3-none-any.whl (18.2 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file agentcassette-0.2.0.tar.gz.

File metadata

Download URL: agentcassette-0.2.0.tar.gz
Upload date: Jul 1, 2026
Size: 27.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for agentcassette-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`fbc95626752c6b4e341ae5a1e63febe35b517b3be67508944d84c3bd5109478e`
MD5	`187814e96d63b138f86f517121100fb1`
BLAKE2b-256	`64217b795ecfd71255310d3d9ab06b4ab25474b9006b62c80b832611b65560e9`

See more details on using hashes here.

File details

Details for the file agentcassette-0.2.0-py3-none-any.whl.

File metadata

Download URL: agentcassette-0.2.0-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 18.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for agentcassette-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d4c8146b6feed02619de27f37666df5be1b5d0e47b33a2d599df8fb18b8067b9`
MD5	`f0d7e889e08335c9d089e415276670fe`
BLAKE2b-256	`512648491ce09c3fa17432568c676087f1f96ce1994635590adf565341e8584a`

See more details on using hashes here.

agentcassette 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agentcassette

Why agentcassette?

Installation

Quick Start

1. Wrap what you want captured

2. Record a real run

3. Replay it in your tests

Async agents

Catching regressions with strict replay

Using with pytest

Inspecting cassettes

Redacting secrets

Diffing runs

Cassette format

API Reference

intercept(fn=None, *, name=None, kind="call")

record(path, *, model=None, redact=None)

replay(path, *, strict=False)

Cassette

diff_cassettes(a, b) -> CassetteDiff

Exceptions

Notes & limitations

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`intercept(fn=None, *, name=None, kind="call")`

`record(path, *, model=None, redact=None)`

`replay(path, *, strict=False)`

`Cassette`

`diff_cassettes(a, b) -> CassetteDiff`