Skip to main content

The pytest for AI agents. Record, replay, assert, and diff agent behavior.

Project description

Mimic

The pytest for AI agents. Record, replay, assert, and diff agent behavior.

PyPI License: MIT Python 3.9+

Mimic is an open-source library that lets you record an AI agent's behavior, replay it deterministically, assert properties about it, and diff runs across versions. It's the missing testing layer for the agent era.

from mimic import Mimic, assert_that, replay
from mimic.integrations.openai import tracked_completion

mimic = Mimic()
client = OpenAI()

@mimic.record("customer-support-agent", model="gpt-4o")
def answer(question: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    tracked_completion(resp)  # auto-captures tokens + cost
    return resp.choices[0].message.content

Verified performance

Scenario Record mode Replay mode Savings
5-test multi-step agent suite 360 ms 50 ms 7× faster
1000 CI runs of the same suite ~$2 in LLM cost $0 100%

Run mimic benchmark --runs 1000 on your own recordings to see your numbers.

Why Mimic?

Every team building AI agents hits the same wall:

  • "I changed a prompt. Did I break anything?" — You don't know.
  • "I switched from GPT-4 to Claude. Is it 2x more expensive?" — You don't know.
  • "Did this agent ever call delete_file in production?" — You don't know.
  • "Why did the agent fail on Tuesday at 3pm?" — You don't know.

Mimic turns those unknowns into testable, replayable, diffable artifacts. Think Sentry recordings + pytest assertions + git blame, purpose-built for LLM agents.

Features

  • Record any callable — sync or async, LLM calls, tool use, multi-step agents
  • Replay runs offline with zero API cost, byte-for-byte deterministic
  • Assert behavioral properties: cost, latency, tool usage, output content
  • Diff two runs to see exactly what changed
  • Auto-track LLM costs for OpenAI, Anthropic, Gemini (zero-config)
  • Multi-step agents with per-step recording, cost, and metadata
  • Privacy mode (capture_args=False, capture_return=False)
  • Storage-agnostic — filesystem by default, pluggable for S3/Postgres
  • Zero LLM vendor lock-in — works with any model
  • Beautiful CLImimic run / list / show / diff / report / benchmark
  • CI-ready — GitHub Actions template + pre-commit hook included

Install

pip install mimic-ai

Or with optional integrations:

pip install mimic-ai[openai]
pip install mimic-ai[anthropic]

Quick start

mkdir my-agent && cd my-agent
mimic init

This creates a project skeleton:

my-agent/
├── mimic.yaml           # Project config
├── tests/
│   └── test_agent.py    # Your recorded tests
└── .mimic/              # Recorded runs (gitignored by default)

Edit tests/test_agent.py:

from mimic import Mimic, assert_that, replay

mimic = Mimic()

@mimic.record("my-agent", model="gpt-4o")
def answer(question: str) -> str:
    # ... your LLM call here ...
    return "..."

def test_agent():
    answer("hello")
    recorded = replay("my-agent")
    assert_that(recorded).finished_without_errors()
    assert_that(recorded).cost_less_than(usd=0.05)
    assert_that(recorded).did_not_call_tool("delete_database")

Run it:

mimic run tests/                 # records + runs (costs $$)
MIMIC_MODE=replay mimic run tests/  # replays only (free, deterministic)

Multi-step agents

For ReAct, multi-agent, or any agent with multiple LLM/tool calls, record each step:

@mimic.record("research-agent")
async def research(question: str) -> str:
    # Step 1: plan
    with mimic.step("plan", model="gpt-4o-mini") as s:
        resp = await llm.complete(model="gpt-4o-mini", messages=[...])
        tracked_completion(resp)
        s.metadata["plan_steps"] = 3

    # Step 2: search
    with mimic.step("search") as s:
        results = await web_search(question)
        s.metadata["result_count"] = len(results)

    # Step 3: synthesize
    with mimic.step("synthesize", model="gpt-4o") as s:
        resp = await llm.complete(model="gpt-4o", messages=[...])
        tracked_completion(resp)

    return summary

Assertions

The full chain (all return self for fluent chaining):

assert_that(run).finished_without_errors()
assert_that(run).had_error()                   # inverse
assert_that(run).cost_less_than(usd=0.05)
assert_that(run).completed_under(ms=2000)
assert_that(run).output_contains("substring")
assert_that(run).output_matches(r"regex")
assert_that(run).output_equals(value)
assert_that(run).called_tool("search")
assert_that(run).did_not_call_tool("delete_database")
assert_that(run).called_tools(["search", "synthesize"])
assert_that(run).had_exactly(3)
assert_that(run).had_at_least(2)
assert_that(run).used_model("gpt-4o")

How it works

Mimic sits outside your agent code, watching the inputs and outputs of any function you decorate. The first time the function runs, Mimic records the full execution into a content-addressable store. Subsequent test runs use the stored record instead of calling the LLM, making them fast, free, and deterministic.

For multi-step agents, Mimic records each step separately, so you can replay just the broken step without re-running the whole agent.

CI integration

Drop the included .github/workflows/test.yml into your repo. It runs your test suite in replay mode (no LLM cost) and validates that no cost was incurred.

Manual re-recording is a separate job, triggered on workflow_dispatch or a schedule.

The recording format

Mimic recordings are plain JSON conforming to a documented schema — see RECORDING_FORMAT.md. The format is vendor-neutral: you can build readers, web UIs, or analysis tools without depending on the Mimic library.

The $100M thesis

Mimic sits at the intersection of three exploding markets:

  1. AI agent development — 10M+ developers will build agents by 2027.
  2. AI observability — already a $2B+ market, dominated by closed vendors (LangSmith, Helicone, Langfuse).
  3. AI safety & compliance — every enterprise deploying agents needs guardrails, audit trails, and replay.

The land-and-expand model is proven (Sentry, Supabase, GitLab, Vercel, PostHog): open source core → community growth → enterprise tier with self-hosted, SSO, audit logs, and SOC2.

See BUSINESS_PLAN.md for the full strategy.

Roadmap

  • v0.1 — Record/replay/assert core
  • v0.2 — Async + multi-step + OpenAI/Anthropic cost tracking
  • v0.3 — Web UI for browsing recorded runs
  • v0.4 — TypeScript SDK
  • v0.5 — Auto-generated regression tests from production traces
  • v0.6 — Multi-agent parent/child traces
  • v1.0 — Enterprise self-hosted edition

Contributing

We love contributions. See CONTRIBUTING.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mimic_recording-1.0.0.tar.gz (32.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mimic_recording-1.0.0-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file mimic_recording-1.0.0.tar.gz.

File metadata

  • Download URL: mimic_recording-1.0.0.tar.gz
  • Upload date:
  • Size: 32.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for mimic_recording-1.0.0.tar.gz
Algorithm Hash digest
SHA256 bddb9d84f080114a901b71a30fb096183a751ce537069c91d29ad4377827735d
MD5 03f435e5901f635404a233fca162dfb4
BLAKE2b-256 e34d9afd04b44633f5312a6b829855717d0777fe2d51c6abe51fab5f7bf39b4d

See more details on using hashes here.

File details

Details for the file mimic_recording-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mimic_recording-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 89010f1eea99c0491842ee30eb2fb28b9eeabe9d41434a280dd503b300e1c5a4
MD5 3b83f65cc7c078d25ec9dbaacae29049
BLAKE2b-256 b8081ddad8c4774e84ddd1b00846200d758a7f7a075d51ed154f712773449264

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page