Skip to main content

Record and replay LLM agent traces for deterministic regression testing โ€” framework-agnostic, pytest-native

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

๐Ÿ” agent-replay

Record and replay LLM agent traces for deterministic regression testing.

PyPI Python License: MIT

agent-replay brings the VCR.py pattern to LLM agents โ€” but at the SDK level, not the HTTP level. It intercepts openai.chat.completions.create, anthropic.messages.create, tool calls, and agent decisions, recording the full execution trace. On replay, it injects recorded responses โ€” zero API calls, millisecond execution, fully deterministic.

Why not just use VCR.py?

VCR.py records HTTP traffic. agent-replay records agent behavior:

VCR.py / Cagent agent-replay
Records at HTTP layer SDK layer
Understands Request/response pairs LLM calls, tool invocations, agent decisions
Trajectory tracking โŒ โœ… "agent called search, then read_file, then responded"
Regression diff Binary (match/no-match) Semantic ("model changed", "new tool used", "extra LLM call")
Framework-agnostic โœ… โœ… (OpenAI, Anthropic, LiteLLM, LangChain, CrewAI)
Cost tracking โŒ โœ… per-call tokens and USD
Async + Streaming Varies โœ… Native support

Quick Start

pip install agent-replay
# With optional provider support:
pip install agent-replay[openai]           # OpenAI
pip install agent-replay[anthropic]        # Anthropic
pip install agent-replay[langchain]        # LangChain/LangGraph
pip install agent-replay[langgraph]        # LangGraph (includes langchain-core)
pip install agent-replay[crewai]           # CrewAI
pip install agent-replay[all]              # Everything

Record an agent run

from agent_replay import Recorder

with Recorder(save_to="cassettes/test_math.yaml") as rec:
    # Your agent code here โ€” all LLM calls are automatically captured
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    print(response.choices[0].message.content)

# Trace saved to cassettes/test_math.yaml
print(f"Recorded {rec.trace.total_llm_calls} LLM calls")

Async support (v0.2)

from agent_replay import Recorder

async with Recorder(save_to="cassettes/async_test.yaml") as rec:
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )

Streaming support (v0.2)

Streaming calls are automatically captured and assembled into complete responses in the cassette. On replay, responses are split back into realistic chunks:

with Recorder(save_to="cassettes/stream_test.yaml") as rec:
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Tell me a story"}],
        stream=True,
    )
    for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="")

Replay deterministically

from agent_replay import Replayer

with Replayer("cassettes/test_math.yaml"):
    # Same code โ€” but LLM calls return recorded responses
    # Zero API calls, zero cost, millisecond execution
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    assert response.choices[0].message.content  # deterministic!

Regression testing with pytest

import pytest
from agent_replay import Recorder, Replayer, assert_trace_unchanged, load_cassette

# First run: record
def test_agent_record():
    with Recorder(save_to="cassettes/test_agent.yaml"):
        agent.run("Summarize the quarterly report")

# Subsequent runs: replay and check for regressions
def test_agent_regression():
    old_trace = load_cassette("cassettes/test_agent.yaml")
    with Recorder() as rec:
        agent.run("Summarize the quarterly report")
    assert_trace_unchanged(old_trace, rec.trace)

Using the pytest plugin (auto record/replay)

# Uses the cassette fixture โ€” automatically records on first run,
# replays on subsequent runs
def test_agent(cassette):
    agent.run("Summarize the quarterly report")
# First run: record cassettes
pytest --record

# Subsequent runs: replay from cassettes
pytest

# Re-record all cassettes
pytest --record-mode=all

Budget assertions (v0.2)

Guard against cost overruns, token bloat, and infinite loops:

import pytest
from agent_replay import Recorder
from agent_replay.assertions import (
    assert_cost_under,
    assert_tokens_under,
    assert_max_llm_calls,
    assert_no_loops,
)

def test_agent_budget():
    with Recorder() as rec:
        agent.run("Summarize the report")

    assert_cost_under(rec.trace, max_usd=0.50)
    assert_tokens_under(rec.trace, max_tokens=10_000)
    assert_max_llm_calls(rec.trace, max_calls=5)
    assert_no_loops(rec.trace, max_consecutive_same_tool=3)

Or use the pytest marker:

@pytest.mark.budget(max_usd=0.50, max_tokens=10_000, max_llm_calls=5)
def test_agent(cassette):
    agent.run("Summarize the report")

What Gets Recorded

Every LLM call captures:

  • Provider (openai, anthropic, litellm, langchain, crewai)
  • Model (gpt-4o, claude-4-sonnet, etc.)
  • Messages (full prompt including system message)
  • Response (full completion response)
  • Tool calls (function name, arguments, tool_call_id)
  • Tokens (input/output counts)
  • Cost (USD per call)
  • Timing (milliseconds per call)

Plus agent-level events:

  • Tool invocations (name, input, output)
  • Agent decisions (delegation, routing, planning)
  • Errors (exceptions with type and message)

Supported Providers

Provider Auto-intercepted Sync Async Streaming Package
OpenAI โœ… โœ… โœ… โœ… openai
Anthropic โœ… โœ… โœ… โœ… anthropic
LiteLLM โœ… โœ… โœ… โœ… litellm
LangChain โœ… โœ… โœ… โ€” langchain-core
LangGraph โœ… โœ… โœ… โœ… langgraph
CrewAI โœ… โœ… โœ… โ€” crewai
Any (manual) Via rec.record_tool_call() โœ… โœ… โ€” โ€”

Framework Integrations (v0.3)

agent-replay now captures graph-level and framework-level events.

LangGraph

Record Pregel graph execution with node and stream-level events:

from agent_replay import Recorder
from langgraph.graph import StateGraph

with Recorder(save_to="cassettes/graph.yaml", intercept_langgraph=True):
    # Captured: graph_start, graph_stream_start/end, graph_end
    result = graph.invoke({"messages": [...]})

LangChain

Intercept BaseChatModel calls in agents, chains, and LCEL:

with Recorder(save_to="cassettes/agent.yaml", intercept_langchain=True):
    result = agent.invoke({"input": "your question"})

Anthropic Tool Use

Full support for tool_use blocks with automatic tool result injection.

Normalization (v0.2)

Compare responses across providers with normalized, provider-agnostic representations:

from agent_replay.normalize import normalize_response, normalize_for_comparison

# Both produce the same shape for diffing
openai_clean = normalize_for_comparison(openai_resp, "openai")
anthropic_clean = normalize_for_comparison(anthropic_resp, "anthropic")
# Strips volatile fields (IDs, token counts) โ€” focuses on semantics

CLI

# Inspect a cassette
replay inspect cassettes/test_math.yaml

# Compare two cassettes
replay diff cassettes/old.yaml cassettes/new.yaml

# Export to JSON
replay export cassettes/test.yaml --format json -o trace.json

# Interactive time-travel debugger (v0.2)
replay debug cassettes/test.yaml
replay debug cassettes/v1.yaml --compare cassettes/v2.yaml
replay debug cassettes/test.yaml --tools-only

# Generate HTML report (v0.2)
replay report cassettes/test.yaml -o report.html

# List all cassettes with stats (v0.2)
replay ls cassettes/

# Aggregate stats across cassettes (v0.2)
replay stats cassettes/

# Delete stale cassettes (v0.2)
replay prune cassettes/ --older-than 30d --dry-run

# Validate cassette integrity (v0.2)
replay validate cassettes/test.yaml

Time-Travel Debugger (v0.2)

Step forward and backward through a recorded trace, inspecting prompts, responses, and tool I/O at each step:

$ replay debug cassettes/test_math.yaml

โ•ญโ”€ ๐Ÿ” agent-replay debugger โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Trace ID: abc123                                  โ”‚
โ”‚ Events: 6  (LLM: 2, Tools: 1)                    โ”‚
โ”‚ Tokens: 450  Cost: $0.0015  Duration: 1200ms     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
  n/Enter = next ยท p = prev ยท q = quit ยท g <N> = go to event N

LLM Response  #2/6  provider=openai  model=gpt-4o  350ms
โ•ญโ”€ Response โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ The answer is 4.                       โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
  Tokens: in=100, out=25, $0.0005

[step] _

HTML Report (v0.2)

Generate a shareable single-file HTML report with dark theme, stats grid, trajectory visualization, and expandable events:

from agent_replay.reporters.html import generate_html_report
from agent_replay import load_cassette

trace = load_cassette("cassettes/test.yaml")
generate_html_report(trace, "report.html")

GitHub Action (v0.2)

Use the included composite action in your CI pipeline:

- uses: ./.github/actions/agent-replay
  with:
    cassette-dir: cassettes
    record-mode: replay
    pytest-args: tests/test_agent.py -v
    post-diff-comment: true

Trace Diffing

The diff engine compares two traces semantically:

from agent_replay import diff_traces, load_cassette

old = load_cassette("cassettes/v1.yaml")
new = load_cassette("cassettes/v2.yaml")
diff = diff_traces(old, new)

print(diff.summary())
# Trace comparison:
#   โš  TRAJECTORY CHANGED (agent took a different path)
#     Old: llm_call:gpt-4o โ†’ tool:search โ†’ llm_call:gpt-4o
#     New: llm_call:gpt-4o โ†’ tool:browse โ†’ tool:search โ†’ llm_call:gpt-4o
#   Tool calls: 1 more
#   New tools used: browse

Roadmap

  • v0.1: Record/replay for OpenAI, Anthropic, LiteLLM. Pytest plugin. Diff engine. CLI.
  • v0.2: Async + streaming support, normalized diffing, LangChain/CrewAI interceptors, budget assertions, time-travel debugger, HTML reports, GitHub Action, expanded CLI.
  • v0.3 (current): LangGraph Pregel interceptor, Anthropic tool_use support, example templates, framework integration tests, 58% code coverage (144 tests).
  • v0.4: VS Code extension with trace visualization, live replay dashboard, web UI for cassette inspection.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

replay_agent-0.4.0.tar.gz (96.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

replay_agent-0.4.0-py3-none-any.whl (54.7 kB view details)

Uploaded Python 3

File details

Details for the file replay_agent-0.4.0.tar.gz.

File metadata

  • Download URL: replay_agent-0.4.0.tar.gz
  • Upload date:
  • Size: 96.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for replay_agent-0.4.0.tar.gz
Algorithm Hash digest
SHA256 2c66d90b1242135203faf326d33ef8fe5fde58e9e8d35ecc6089eb3085d3bc14
MD5 10ee2188a146ac3bcb97a98b718f10b6
BLAKE2b-256 f162494f62a2eb8cda29f3f3c2fdd335e05e8b34d235bdf25135c55cd1e5833e

See more details on using hashes here.

File details

Details for the file replay_agent-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: replay_agent-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 54.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for replay_agent-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8d0dd93244bae2a1b1406b5240cc15e630a77b71ed25724ba16d853379dd4d7
MD5 f2ea1794ef4d2628218cc0227ef9754e
BLAKE2b-256 2f1af3a5d6a5da64ad57053b0d094f9ec567ecb9decd5045fb2e0cd0662f1928

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page