Snapshot testing for AI agents — catch behavior regressions before they ship.
Project description
Snapshot testing for AI agents. Catch behavior regressions before they ship.
Your AI agent is one model upgrade away from silently breaking. You bump the model, tweak a prompt, or change a tool — and the agent starts behaving differently. You find out from a customer.
Brooder is the safety net. Wrap your agent once, and Brooder records its real runs as golden baselines. Every time you change the model, a prompt, or a tool, it re-runs and shows you a behavioral diff — what changed, what broke — and fails your CI if it regressed.
No eval datasets to hand-write. One command. It's jest --updateSnapshot, but for agents.
pip install brooder
Status: early alpha, built in public. Apache-2.0.
60-second demo (no API keys needed)
The included example agent simulates a model upgrade with an env var, so you can see Brooder catch a real regression completely offline.
git clone https://github.com/agentbrooder/brooder && cd brooder
pip install -e .
# The signature move: what breaks if I migrate from one model to another?
brooder migrate --from gpt-4o --to gpt-5-new examples/regressing_agent.py
Output (abridged):
──────────────────────── Model Migration Report ────────────────────────
1 of 3 cases change behavior when migrating gpt-4o → gpt-5-new.
support-agent · e1ded4070eee · REGRESSED · stability 40
path diverged at step 0: was TOOL create_ticket(order=12345), now dropped
- trajectory[0] {'name': 'create_ticket', 'args': {'order': '12345'}}
~ output
before: I've started your refund.
after: Refunds are not supported.
The "new model" silently stopped creating the refund ticket and flipped its answer. That would have shipped to production unnoticed. Brooder caught it — and exited non-zero, so CI would block it.
The normal workflow
brooder record examples/regressing_agent.py # capture golden baselines from real runs
brooder run examples/regressing_agent.py # re-run after a change, diff vs baseline
brooder diff # see exactly what changed
brooder approve # accept the new behavior as the baseline
brooder run exits non-zero when behavior regressed — drop it into CI and it gates your PRs.
Instrument your own agent
Add one decorator. Log tool calls with one function. That's the whole SDK.
import brooder
def search_kb(query):
brooder.tool_call("search_kb", {"query": query}, result="...")
return "..."
@brooder.record("support-agent")
def agent(question: str) -> str:
docs = search_kb(question)
return answer_from(docs)
# call it over your real inputs; brooder records/replays automatically
Then run it through the CLI. Baselines are plain JSON committed to your repo, so diffs show up in code review like any other change.
Auto-capture (no manual tool_call)
Wrap your LLM client and Brooder records the model's tool-call decisions automatically:
import brooder
import openai
client = brooder.instrument(openai.OpenAI())
# now every client.chat.completions.create(...) call is captured while recording
Supported providers: OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, and
Google (Gemini / Vertex). The provider is auto-detected; override it with
brooder.instrument(client, provider="bedrock"). Model names are intentionally not diffed, so
switching models isn't itself a change — only the model's behavior (which tools it calls, with
what arguments) is.
Async works too. @brooder.record and instrument(...) handle async def agents and async
clients — AsyncOpenAI, AsyncAzureOpenAI, AsyncAnthropic, and Google's generate_content_async
— with no extra setup (the recording context follows your awaits and into child tasks):
client = brooder.instrument(openai.AsyncOpenAI())
@brooder.record("support-agent")
async def agent(question: str) -> str:
await client.chat.completions.create(model="gpt-4o", messages=[...])
...
(Async AWS Bedrock via aioboto3 isn't covered yet — the sync boto3 client is.)
Capture from agent frameworks (OpenTelemetry)
Building on an agent framework? If it emits OpenTelemetry GenAI spans — LangGraph, CrewAI,
AutoGen, and anything else on the convention — add one span processor and Brooder ingests the
whole trajectory, no manual tool_call:
from opentelemetry import trace
from brooder.integrations.otel import BrooderSpanProcessor
trace.get_tracer_provider().add_span_processor(BrooderSpanProcessor(agent="support-agent"))
It maps inference spans → turns, execute_tool spans → tool calls, and the agent-root span's
input/output → the case identity and final answer. It also drops straight into the OTel pipelines
you already run (Datadog / Arize / Honeycomb).
Building directly on the Claude Agent SDK? Register Brooder's hooks and it records the tool trajectory automatically:
import brooder
from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions, ResultMessage
from brooder.integrations import claude_agent
options = ClaudeAgentOptions(hooks=brooder.claude_agent_hooks(agent="support-agent"))
async with ClaudeSDKClient(options=options) as client:
await client.query(prompt)
async for msg in client.receive_response():
if isinstance(msg, ResultMessage):
claude_agent.record_output(msg.session_id, msg.result) # optional: capture the answer
UserPromptSubmit opens a run (the prompt is the case identity), PostToolUse becomes a tool step,
and Stop finalizes it.
On the OpenAI Agents SDK? Its tracing is on by default — install Brooder's trace processor once and every run is captured (no OpenAI API key required for capture):
import brooder.integrations.openai_agents as bd_agents
bd_agents.install(agent="support-agent") # then run your agents as usual
It maps generation/response spans → turns, function spans → tool calls, and handoffs and triggered guardrails into the trajectory too — so both tool selection and control-flow regressions get diffed.
Using LangChain or LangGraph? Attach one callback handler — no OpenTelemetry setup required:
import brooder.integrations.langchain as bd_lc
handler = bd_lc.callback_handler(agent="support-agent")
graph.invoke({"messages": [...]}, config={"callbacks": [handler]})
The root chain start opens a run (its input is the case identity), model calls become turns, and tool calls become tool steps — one handler covers both LangChain and LangGraph.
It tests agents (the whole trajectory), not single LLM calls
@brooder.record wraps your entire agent — every step of its plan → act → observe loop.
The baseline is the full trajectory: every tool call across every turn, in order, plus the
final output. So Brooder catches agent-level regressions, not just token changes in one model
response.
# A multi-step agent that silently stops verifying before answering on the newer model:
brooder migrate --from gpt-4o --to gpt-5-new examples/loop_agent.py
# -> REGRESSED: trajectory[1] "verify" removed
That dropped verify step happened inside the loop — the kind of thing an LLM-output eval
would never see.
Why not just use observability / eval tools?
| Tool type | Examples | What it does | The gap Brooder fills |
|---|---|---|---|
| Observability | Langfuse, Laminar, Phoenix | Trace/monitor after it runs | Doesn't gate before you ship |
| Eval frameworks | DeepEval, Braintrust, Ragas | Score against hand-written datasets | Requires eval authoring nobody maintains |
| Brooder | — | Record real runs → behavioral diff on every change → CI gate | Zero eval-writing, catches model-migration regressions |
Gate your PRs (GitHub Action)
Drop Brooder into CI and it re-runs your agent on every pull request, comments the behavioral diff,
and fails the check when behavior regresses. Copy examples/github-action.yml
to .github/workflows/brooder.yml:
permissions:
contents: read
pull-requests: write # so it can comment the diff
jobs:
agent-snapshot:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: agentbrooder/brooder@v1
with:
script: tests/agent_snapshot.py
The comment is upserted (updated in place, not spammed) and looks like the --format markdown
output below.
Machine-readable output (--json / OTLP)
run, ci, and diff take --format table|json|markdown (--json is a shortcut). Exit codes are
unchanged, so you can gate and parse:
brooder run agent.py --json | jq '.summary'
# { "total": 3, "passed": 2, "regressed": 1, "flaky": 0, "regressions": 1, "mean_stability": 80 }
For dashboards, point Brooder at any OTLP endpoint and each run emits a snapshot of gauges
(brooder.cases.*, brooder.stability.mean) — one exporter that reaches Datadog, Grafana,
Honeycomb, and CloudWatch:
pip install 'brooder[otel]'
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/metrics # or metrics.otlp_endpoint in brooder.yaml
brooder ci agent.py
What it checks
- Structural diff — the sequence of tool calls, their arguments, and the final output.
- Semantic diff — a pluggable judge (
judge: exact | llm) so equivalent wording isn't a regression. - Flakiness —
brooder run --runs 3runs each case N times and flags non-determinism (FLAKY).
Each case gets a verdict — PASS / REGRESSED / NEW / FLAKY — and a stability score.
Roadmap
See ROADMAP.md for what's shipped and what's planned.
Contributing
See CONTRIBUTING.md. Issues and PRs welcome — this is being built in public.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file brooder-0.1.0.tar.gz.
File metadata
- Download URL: brooder-0.1.0.tar.gz
- Upload date:
- Size: 90.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af601973c424f65df19c0a0031001a7c3ba68dc4092c0f76d360fbb1541bff68
|
|
| MD5 |
3ebd4c61e3b61ae04288b88d76ed32ed
|
|
| BLAKE2b-256 |
9d96081d80bbb363c95bba15113ce705eb728a385565a36ce94709918adfc11e
|
Provenance
The following attestation bundles were made for brooder-0.1.0.tar.gz:
Publisher:
release.yml on agentbrooder/brooder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
brooder-0.1.0.tar.gz -
Subject digest:
af601973c424f65df19c0a0031001a7c3ba68dc4092c0f76d360fbb1541bff68 - Sigstore transparency entry: 2047479999
- Sigstore integration time:
-
Permalink:
agentbrooder/brooder@1ec17e58ebb4a12da2a295dc25c88684cb7c6bfd -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/agentbrooder
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1ec17e58ebb4a12da2a295dc25c88684cb7c6bfd -
Trigger Event:
push
-
Statement type:
File details
Details for the file brooder-0.1.0-py3-none-any.whl.
File metadata
- Download URL: brooder-0.1.0-py3-none-any.whl
- Upload date:
- Size: 56.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
595e7629a421eb772b287a37706f149da315f7aebbc0a81c5556821242c8f70e
|
|
| MD5 |
c64566f1b787c190492a932ad3371a14
|
|
| BLAKE2b-256 |
9d130b4d443ef8bc92c9e1f907051af526455cc92532719bc88c60c49cc2a1c1
|
Provenance
The following attestation bundles were made for brooder-0.1.0-py3-none-any.whl:
Publisher:
release.yml on agentbrooder/brooder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
brooder-0.1.0-py3-none-any.whl -
Subject digest:
595e7629a421eb772b287a37706f149da315f7aebbc0a81c5556821242c8f70e - Sigstore transparency entry: 2047480006
- Sigstore integration time:
-
Permalink:
agentbrooder/brooder@1ec17e58ebb4a12da2a295dc25c88684cb7c6bfd -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/agentbrooder
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1ec17e58ebb4a12da2a295dc25c88684cb7c6bfd -
Trigger Event:
push
-
Statement type: