The pytest for AI agents. Record, replay, assert, and diff agent behavior.
Project description
Mimic
The pytest for AI agents. Record, replay, assert, and diff agent behavior.
Mimic is an open-source library that lets you record an AI agent's behavior, replay it deterministically, assert properties about it, and diff runs across versions. It's the missing testing layer for the agent era.
from mimic import Mimic, assert_that, replay
from mimic.integrations.openai import tracked_completion
mimic = Mimic()
client = OpenAI()
@mimic.record("customer-support-agent", model="gpt-4o")
def answer(question: str) -> str:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": question}],
)
tracked_completion(resp) # auto-captures tokens + cost
return resp.choices[0].message.content
Verified performance
| Scenario | Record mode | Replay mode | Savings |
|---|---|---|---|
| 5-test multi-step agent suite | 360 ms | 50 ms | 7× faster |
| 1000 CI runs of the same suite | ~$2 in LLM cost | $0 | 100% |
Run mimic benchmark --runs 1000 on your own recordings to see your numbers.
Why Mimic?
Every team building AI agents hits the same wall:
- "I changed a prompt. Did I break anything?" — You don't know.
- "I switched from GPT-4 to Claude. Is it 2x more expensive?" — You don't know.
- "Did this agent ever call
delete_filein production?" — You don't know. - "Why did the agent fail on Tuesday at 3pm?" — You don't know.
Mimic turns those unknowns into testable, replayable, diffable artifacts. Think Sentry recordings + pytest assertions + git blame, purpose-built for LLM agents.
Features
- ✅ Record any callable — sync or async, LLM calls, tool use, multi-step agents
- ✅ Replay runs offline with zero API cost, byte-for-byte deterministic
- ✅ Assert behavioral properties: cost, latency, tool usage, output content
- ✅ Diff two runs to see exactly what changed
- ✅ Auto-track LLM costs for OpenAI, Anthropic, Gemini (zero-config)
- ✅ Multi-step agents with per-step recording, cost, and metadata
- ✅ Privacy mode (
capture_args=False,capture_return=False) - ✅ Storage-agnostic — filesystem by default, pluggable for S3/Postgres
- ✅ Zero LLM vendor lock-in — works with any model
- ✅ Beautiful CLI —
mimic run / list / show / diff / report / benchmark - ✅ CI-ready — GitHub Actions template + pre-commit hook included
Install
pip install mimic-ai
Or with optional integrations:
pip install mimic-ai[openai]
pip install mimic-ai[anthropic]
Quick start
mkdir my-agent && cd my-agent
mimic init
This creates a project skeleton:
my-agent/
├── mimic.yaml # Project config
├── tests/
│ └── test_agent.py # Your recorded tests
└── .mimic/ # Recorded runs (gitignored by default)
Edit tests/test_agent.py:
from mimic import Mimic, assert_that, replay
mimic = Mimic()
@mimic.record("my-agent", model="gpt-4o")
def answer(question: str) -> str:
# ... your LLM call here ...
return "..."
def test_agent():
answer("hello")
recorded = replay("my-agent")
assert_that(recorded).finished_without_errors()
assert_that(recorded).cost_less_than(usd=0.05)
assert_that(recorded).did_not_call_tool("delete_database")
Run it:
mimic run tests/ # records + runs (costs $$)
MIMIC_MODE=replay mimic run tests/ # replays only (free, deterministic)
Multi-step agents
For ReAct, multi-agent, or any agent with multiple LLM/tool calls, record each step:
@mimic.record("research-agent")
async def research(question: str) -> str:
# Step 1: plan
with mimic.step("plan", model="gpt-4o-mini") as s:
resp = await llm.complete(model="gpt-4o-mini", messages=[...])
tracked_completion(resp)
s.metadata["plan_steps"] = 3
# Step 2: search
with mimic.step("search") as s:
results = await web_search(question)
s.metadata["result_count"] = len(results)
# Step 3: synthesize
with mimic.step("synthesize", model="gpt-4o") as s:
resp = await llm.complete(model="gpt-4o", messages=[...])
tracked_completion(resp)
return summary
Assertions
The full chain (all return self for fluent chaining):
assert_that(run).finished_without_errors()
assert_that(run).had_error() # inverse
assert_that(run).cost_less_than(usd=0.05)
assert_that(run).completed_under(ms=2000)
assert_that(run).output_contains("substring")
assert_that(run).output_matches(r"regex")
assert_that(run).output_equals(value)
assert_that(run).called_tool("search")
assert_that(run).did_not_call_tool("delete_database")
assert_that(run).called_tools(["search", "synthesize"])
assert_that(run).had_exactly(3)
assert_that(run).had_at_least(2)
assert_that(run).used_model("gpt-4o")
How it works
Mimic sits outside your agent code, watching the inputs and outputs of any function you decorate. The first time the function runs, Mimic records the full execution into a content-addressable store. Subsequent test runs use the stored record instead of calling the LLM, making them fast, free, and deterministic.
For multi-step agents, Mimic records each step separately, so you can replay just the broken step without re-running the whole agent.
CI integration
Drop the included .github/workflows/test.yml into your repo. It runs your test suite in replay mode (no LLM cost) and validates that no cost was incurred.
Manual re-recording is a separate job, triggered on workflow_dispatch or a schedule.
The recording format
Mimic recordings are plain JSON conforming to a documented schema — see RECORDING_FORMAT.md. The format is vendor-neutral: you can build readers, web UIs, or analysis tools without depending on the Mimic library.
The $100M thesis
Mimic sits at the intersection of three exploding markets:
- AI agent development — 10M+ developers will build agents by 2027.
- AI observability — already a $2B+ market, dominated by closed vendors (LangSmith, Helicone, Langfuse).
- AI safety & compliance — every enterprise deploying agents needs guardrails, audit trails, and replay.
The land-and-expand model is proven (Sentry, Supabase, GitLab, Vercel, PostHog): open source core → community growth → enterprise tier with self-hosted, SSO, audit logs, and SOC2.
See BUSINESS_PLAN.md for the full strategy.
Roadmap
- v0.1 — Record/replay/assert core
- v0.2 — Async + multi-step + OpenAI/Anthropic cost tracking
- v0.3 — Web UI for browsing recorded runs
- v0.4 — TypeScript SDK
- v0.5 — Auto-generated regression tests from production traces
- v0.6 — Multi-agent parent/child traces
- v1.0 — Enterprise self-hosted edition
Contributing
We love contributions. See CONTRIBUTING.md.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mimic_recording-1.0.0.tar.gz.
File metadata
- Download URL: mimic_recording-1.0.0.tar.gz
- Upload date:
- Size: 32.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bddb9d84f080114a901b71a30fb096183a751ce537069c91d29ad4377827735d
|
|
| MD5 |
03f435e5901f635404a233fca162dfb4
|
|
| BLAKE2b-256 |
e34d9afd04b44633f5312a6b829855717d0777fe2d51c6abe51fab5f7bf39b4d
|
File details
Details for the file mimic_recording-1.0.0-py3-none-any.whl.
File metadata
- Download URL: mimic_recording-1.0.0-py3-none-any.whl
- Upload date:
- Size: 26.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89010f1eea99c0491842ee30eb2fb28b9eeabe9d41434a280dd503b300e1c5a4
|
|
| MD5 |
3b83f65cc7c078d25ec9dbaacae29049
|
|
| BLAKE2b-256 |
b8081ddad8c4774e84ddd1b00846200d758a7f7a075d51ed154f712773449264
|