Record and replay LLM agent traces for deterministic regression testing โ framework-agnostic, pytest-native
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
๐ agent-replay
Record and replay LLM agent traces for deterministic regression testing.
agent-replay brings the VCR.py pattern to LLM agents โ but at the SDK level, not the HTTP level. It intercepts openai.chat.completions.create, anthropic.messages.create, tool calls, and agent decisions, recording the full execution trace. On replay, it injects recorded responses โ zero API calls, millisecond execution, fully deterministic.
Why not just use VCR.py?
VCR.py records HTTP traffic. agent-replay records agent behavior:
| VCR.py / Cagent | agent-replay | |
|---|---|---|
| Records at | HTTP layer | SDK layer |
| Understands | Request/response pairs | LLM calls, tool invocations, agent decisions |
| Trajectory tracking | โ | โ "agent called search, then read_file, then responded" |
| Regression diff | Binary (match/no-match) | Semantic ("model changed", "new tool used", "extra LLM call") |
| Framework-agnostic | โ | โ (OpenAI, Anthropic, LiteLLM, LangChain, CrewAI) |
| Cost tracking | โ | โ per-call tokens and USD |
| Async + Streaming | Varies | โ Native support |
Quick Start
pip install agent-replay
# With optional provider support:
pip install agent-replay[openai] # OpenAI
pip install agent-replay[anthropic] # Anthropic
pip install agent-replay[langchain] # LangChain/LangGraph
pip install agent-replay[langgraph] # LangGraph (includes langchain-core)
pip install agent-replay[crewai] # CrewAI
pip install agent-replay[all] # Everything
Record an agent run
from agent_replay import Recorder
with Recorder(save_to="cassettes/test_math.yaml") as rec:
# Your agent code here โ all LLM calls are automatically captured
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.choices[0].message.content)
# Trace saved to cassettes/test_math.yaml
print(f"Recorded {rec.trace.total_llm_calls} LLM calls")
Async support (v0.2)
from agent_replay import Recorder
async with Recorder(save_to="cassettes/async_test.yaml") as rec:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
Streaming support (v0.2)
Streaming calls are automatically captured and assembled into complete responses in the cassette. On replay, responses are split back into realistic chunks:
with Recorder(save_to="cassettes/stream_test.yaml") as rec:
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
Replay deterministically
from agent_replay import Replayer
with Replayer("cassettes/test_math.yaml"):
# Same code โ but LLM calls return recorded responses
# Zero API calls, zero cost, millisecond execution
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
assert response.choices[0].message.content # deterministic!
Regression testing with pytest
import pytest
from agent_replay import Recorder, Replayer, assert_trace_unchanged, load_cassette
# First run: record
def test_agent_record():
with Recorder(save_to="cassettes/test_agent.yaml"):
agent.run("Summarize the quarterly report")
# Subsequent runs: replay and check for regressions
def test_agent_regression():
old_trace = load_cassette("cassettes/test_agent.yaml")
with Recorder() as rec:
agent.run("Summarize the quarterly report")
assert_trace_unchanged(old_trace, rec.trace)
Using the pytest plugin (auto record/replay)
# Uses the cassette fixture โ automatically records on first run,
# replays on subsequent runs
def test_agent(cassette):
agent.run("Summarize the quarterly report")
# First run: record cassettes
pytest --record
# Subsequent runs: replay from cassettes
pytest
# Re-record all cassettes
pytest --record-mode=all
Budget assertions (v0.2)
Guard against cost overruns, token bloat, and infinite loops:
import pytest
from agent_replay import Recorder
from agent_replay.assertions import (
assert_cost_under,
assert_tokens_under,
assert_max_llm_calls,
assert_no_loops,
)
def test_agent_budget():
with Recorder() as rec:
agent.run("Summarize the report")
assert_cost_under(rec.trace, max_usd=0.50)
assert_tokens_under(rec.trace, max_tokens=10_000)
assert_max_llm_calls(rec.trace, max_calls=5)
assert_no_loops(rec.trace, max_consecutive_same_tool=3)
Or use the pytest marker:
@pytest.mark.budget(max_usd=0.50, max_tokens=10_000, max_llm_calls=5)
def test_agent(cassette):
agent.run("Summarize the report")
What Gets Recorded
Every LLM call captures:
- Provider (openai, anthropic, litellm, langchain, crewai)
- Model (gpt-4o, claude-4-sonnet, etc.)
- Messages (full prompt including system message)
- Response (full completion response)
- Tool calls (function name, arguments, tool_call_id)
- Tokens (input/output counts)
- Cost (USD per call)
- Timing (milliseconds per call)
Plus agent-level events:
- Tool invocations (name, input, output)
- Agent decisions (delegation, routing, planning)
- Errors (exceptions with type and message)
Supported Providers
| Provider | Auto-intercepted | Sync | Async | Streaming | Package |
|---|---|---|---|---|---|
| OpenAI | โ | โ | โ | โ | openai |
| Anthropic | โ | โ | โ | โ | anthropic |
| LiteLLM | โ | โ | โ | โ | litellm |
| LangChain | โ | โ | โ | โ | langchain-core |
| LangGraph | โ | โ | โ | โ | langgraph |
| CrewAI | โ | โ | โ | โ | crewai |
| Any (manual) | Via rec.record_tool_call() |
โ | โ | โ | โ |
Framework Integrations (v0.3)
agent-replay now captures graph-level and framework-level events.
LangGraph
Record Pregel graph execution with node and stream-level events:
from agent_replay import Recorder
from langgraph.graph import StateGraph
with Recorder(save_to="cassettes/graph.yaml", intercept_langgraph=True):
# Captured: graph_start, graph_stream_start/end, graph_end
result = graph.invoke({"messages": [...]})
LangChain
Intercept BaseChatModel calls in agents, chains, and LCEL:
with Recorder(save_to="cassettes/agent.yaml", intercept_langchain=True):
result = agent.invoke({"input": "your question"})
Anthropic Tool Use
Full support for tool_use blocks with automatic tool result injection.
Normalization (v0.2)
Compare responses across providers with normalized, provider-agnostic representations:
from agent_replay.normalize import normalize_response, normalize_for_comparison
# Both produce the same shape for diffing
openai_clean = normalize_for_comparison(openai_resp, "openai")
anthropic_clean = normalize_for_comparison(anthropic_resp, "anthropic")
# Strips volatile fields (IDs, token counts) โ focuses on semantics
CLI
# Inspect a cassette
replay inspect cassettes/test_math.yaml
# Compare two cassettes
replay diff cassettes/old.yaml cassettes/new.yaml
# Export to JSON
replay export cassettes/test.yaml --format json -o trace.json
# Interactive time-travel debugger (v0.2)
replay debug cassettes/test.yaml
replay debug cassettes/v1.yaml --compare cassettes/v2.yaml
replay debug cassettes/test.yaml --tools-only
# Generate HTML report (v0.2)
replay report cassettes/test.yaml -o report.html
# List all cassettes with stats (v0.2)
replay ls cassettes/
# Aggregate stats across cassettes (v0.2)
replay stats cassettes/
# Delete stale cassettes (v0.2)
replay prune cassettes/ --older-than 30d --dry-run
# Validate cassette integrity (v0.2)
replay validate cassettes/test.yaml
Time-Travel Debugger (v0.2)
Step forward and backward through a recorded trace, inspecting prompts, responses, and tool I/O at each step:
$ replay debug cassettes/test_math.yaml
โญโ ๐ agent-replay debugger โโโโโโโโโโโโโโโโโโโโโโโฎ
โ Trace ID: abc123 โ
โ Events: 6 (LLM: 2, Tools: 1) โ
โ Tokens: 450 Cost: $0.0015 Duration: 1200ms โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
n/Enter = next ยท p = prev ยท q = quit ยท g <N> = go to event N
LLM Response #2/6 provider=openai model=gpt-4o 350ms
โญโ Response โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ The answer is 4. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Tokens: in=100, out=25, $0.0005
[step] _
HTML Report (v0.2)
Generate a shareable single-file HTML report with dark theme, stats grid, trajectory visualization, and expandable events:
from agent_replay.reporters.html import generate_html_report
from agent_replay import load_cassette
trace = load_cassette("cassettes/test.yaml")
generate_html_report(trace, "report.html")
GitHub Action (v0.2)
Use the included composite action in your CI pipeline:
- uses: ./.github/actions/agent-replay
with:
cassette-dir: cassettes
record-mode: replay
pytest-args: tests/test_agent.py -v
post-diff-comment: true
Trace Diffing
The diff engine compares two traces semantically:
from agent_replay import diff_traces, load_cassette
old = load_cassette("cassettes/v1.yaml")
new = load_cassette("cassettes/v2.yaml")
diff = diff_traces(old, new)
print(diff.summary())
# Trace comparison:
# โ TRAJECTORY CHANGED (agent took a different path)
# Old: llm_call:gpt-4o โ tool:search โ llm_call:gpt-4o
# New: llm_call:gpt-4o โ tool:browse โ tool:search โ llm_call:gpt-4o
# Tool calls: 1 more
# New tools used: browse
Roadmap
- v0.1: Record/replay for OpenAI, Anthropic, LiteLLM. Pytest plugin. Diff engine. CLI.
- v0.2: Async + streaming support, normalized diffing, LangChain/CrewAI interceptors, budget assertions, time-travel debugger, HTML reports, GitHub Action, expanded CLI.
- v0.3 (current): LangGraph Pregel interceptor, Anthropic tool_use support, example templates, framework integration tests, 58% code coverage (144 tests).
- v0.4: VS Code extension with trace visualization, live replay dashboard, web UI for cassette inspection.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file replay_agent-0.4.0.tar.gz.
File metadata
- Download URL: replay_agent-0.4.0.tar.gz
- Upload date:
- Size: 96.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c66d90b1242135203faf326d33ef8fe5fde58e9e8d35ecc6089eb3085d3bc14
|
|
| MD5 |
10ee2188a146ac3bcb97a98b718f10b6
|
|
| BLAKE2b-256 |
f162494f62a2eb8cda29f3f3c2fdd335e05e8b34d235bdf25135c55cd1e5833e
|
File details
Details for the file replay_agent-0.4.0-py3-none-any.whl.
File metadata
- Download URL: replay_agent-0.4.0-py3-none-any.whl
- Upload date:
- Size: 54.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8d0dd93244bae2a1b1406b5240cc15e630a77b71ed25724ba16d853379dd4d7
|
|
| MD5 |
f2ea1794ef4d2628218cc0227ef9754e
|
|
| BLAKE2b-256 |
2f1af3a5d6a5da64ad57053b0d094f9ec567ecb9decd5045fb2e0cd0662f1928
|