Skip to main content

Snapshot testing for AI agents — catch behavior regressions before they ship.

Project description

Brooder — snapshot testing for AI agents

CI PyPI Python versions License: Apache-2.0 Ruff

Snapshot testing for AI agents. Catch behavior regressions before they ship.

Your AI agent is one model upgrade away from silently breaking. You bump the model, tweak a prompt, or change a tool — and the agent starts behaving differently. You find out from a customer.

Brooder is the safety net. Wrap your agent once, and Brooder records its real runs as golden baselines. Every time you change the model, a prompt, or a tool, it re-runs and shows you a behavioral diff — what changed, what broke — and fails your CI if it regressed.

No eval datasets to hand-write. One command. It's jest --updateSnapshot, but for agents.

pip install brooder

brooder migrate catching a dropped tool call and a flipped answer

Status: early alpha, built in public. Apache-2.0.


60-second demo (no API keys needed)

The included example agent simulates a model upgrade with an env var, so you can see Brooder catch a real regression completely offline.

git clone https://github.com/agentbrooder/brooder && cd brooder
pip install -e .

# The signature move: what breaks if I migrate from one model to another?
brooder migrate --from gpt-4o --to gpt-5-new examples/regressing_agent.py

Output (abridged):

──────────────────────── Model Migration Report ────────────────────────
 1 of 3 cases change behavior when migrating gpt-4o → gpt-5-new.

 support-agent · e1ded4070eee · REGRESSED · stability 40
   path diverged at step 0: was TOOL create_ticket(order=12345), now dropped
   - trajectory[0]  {'name': 'create_ticket', 'args': {'order': '12345'}}
   ~ output
       before: I've started your refund.
       after:  Refunds are not supported.

The "new model" silently stopped creating the refund ticket and flipped its answer. That would have shipped to production unnoticed. Brooder caught it — and exited non-zero, so CI would block it.


The normal workflow

brooder record examples/regressing_agent.py     # capture golden baselines from real runs
brooder run    examples/regressing_agent.py     # re-run after a change, diff vs baseline
brooder diff                                    # see exactly what changed
brooder approve                                 # accept the new behavior as the baseline

brooder run exits non-zero when behavior regressed — drop it into CI and it gates your PRs.


Instrument your own agent

Add one decorator. Log tool calls with one function. That's the whole SDK.

import brooder

def search_kb(query):
    brooder.tool_call("search_kb", {"query": query}, result="...")
    return "..."

@brooder.record("support-agent")
def agent(question: str) -> str:
    docs = search_kb(question)
    return answer_from(docs)

# call it over your real inputs; brooder records/replays automatically

Then run it through the CLI. Baselines are plain JSON committed to your repo, so diffs show up in code review like any other change.


Auto-capture (no manual tool_call)

Wrap your LLM client and Brooder records the model's tool-call decisions automatically:

import brooder
import openai

client = brooder.instrument(openai.OpenAI())
# now every client.chat.completions.create(...) call is captured while recording

Supported providers: OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, and Google (Gemini / Vertex). The provider is auto-detected; override it with brooder.instrument(client, provider="bedrock"). Model names are intentionally not diffed, so switching models isn't itself a change — only the model's behavior (which tools it calls, with what arguments) is.

Async works too. @brooder.record and instrument(...) handle async def agents and async clients — AsyncOpenAI, AsyncAzureOpenAI, AsyncAnthropic, and Google's generate_content_async — with no extra setup (the recording context follows your awaits and into child tasks):

client = brooder.instrument(openai.AsyncOpenAI())

@brooder.record("support-agent")
async def agent(question: str) -> str:
    await client.chat.completions.create(model="gpt-4o", messages=[...])
    ...

(Async AWS Bedrock via aioboto3 isn't covered yet — the sync boto3 client is.)

Capture from agent frameworks (OpenTelemetry)

Building on an agent framework? If it emits OpenTelemetry GenAI spans — LangGraph, CrewAI, AutoGen, and anything else on the convention — add one span processor and Brooder ingests the whole trajectory, no manual tool_call:

from opentelemetry import trace
from brooder.integrations.otel import BrooderSpanProcessor

trace.get_tracer_provider().add_span_processor(BrooderSpanProcessor(agent="support-agent"))

It maps inference spans → turns, execute_tool spans → tool calls, and the agent-root span's input/output → the case identity and final answer. It also drops straight into the OTel pipelines you already run (Datadog / Arize / Honeycomb).

Building directly on the Claude Agent SDK? Register Brooder's hooks and it records the tool trajectory automatically:

import brooder
from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions, ResultMessage
from brooder.integrations import claude_agent

options = ClaudeAgentOptions(hooks=brooder.claude_agent_hooks(agent="support-agent"))
async with ClaudeSDKClient(options=options) as client:
    await client.query(prompt)
    async for msg in client.receive_response():
        if isinstance(msg, ResultMessage):
            claude_agent.record_output(msg.session_id, msg.result)  # optional: capture the answer

UserPromptSubmit opens a run (the prompt is the case identity), PostToolUse becomes a tool step, and Stop finalizes it.

On the OpenAI Agents SDK? Its tracing is on by default — install Brooder's trace processor once and every run is captured (no OpenAI API key required for capture):

import brooder.integrations.openai_agents as bd_agents

bd_agents.install(agent="support-agent")   # then run your agents as usual

It maps generation/response spans → turns, function spans → tool calls, and handoffs and triggered guardrails into the trajectory too — so both tool selection and control-flow regressions get diffed.

Using LangChain or LangGraph? Attach one callback handler — no OpenTelemetry setup required:

import brooder.integrations.langchain as bd_lc

handler = bd_lc.callback_handler(agent="support-agent")
graph.invoke({"messages": [...]}, config={"callbacks": [handler]})

The root chain start opens a run (its input is the case identity), model calls become turns, and tool calls become tool steps — one handler covers both LangChain and LangGraph.

It tests agents (the whole trajectory), not single LLM calls

@brooder.record wraps your entire agent — every step of its plan → act → observe loop. The baseline is the full trajectory: every tool call across every turn, in order, plus the final output. So Brooder catches agent-level regressions, not just token changes in one model response.

# A multi-step agent that silently stops verifying before answering on the newer model:
brooder migrate --from gpt-4o --to gpt-5-new examples/loop_agent.py
# -> REGRESSED: trajectory[1] "verify" removed

That dropped verify step happened inside the loop — the kind of thing an LLM-output eval would never see.

Why not just use observability / eval tools?

Tool type Examples What it does The gap Brooder fills
Observability Langfuse, Laminar, Phoenix Trace/monitor after it runs Doesn't gate before you ship
Eval frameworks DeepEval, Braintrust, Ragas Score against hand-written datasets Requires eval authoring nobody maintains
Brooder Record real runs → behavioral diff on every change → CI gate Zero eval-writing, catches model-migration regressions

Gate your PRs (GitHub Action)

Drop Brooder into CI and it re-runs your agent on every pull request, comments the behavioral diff, and fails the check when behavior regresses. Copy examples/github-action.yml to .github/workflows/brooder.yml:

permissions:
  contents: read
  pull-requests: write        # so it can comment the diff

jobs:
  agent-snapshot:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: agentbrooder/brooder@v1
        with:
          script: tests/agent_snapshot.py

The comment is upserted (updated in place, not spammed) and looks like the --format markdown output below.

Machine-readable output (--json / OTLP)

run, ci, and diff take --format table|json|markdown (--json is a shortcut). Exit codes are unchanged, so you can gate and parse:

brooder run agent.py --json | jq '.summary'
# { "total": 3, "passed": 2, "regressed": 1, "flaky": 0, "regressions": 1, "mean_stability": 80 }

For dashboards, point Brooder at any OTLP endpoint and each run emits a snapshot of gauges (brooder.cases.*, brooder.stability.mean) — one exporter that reaches Datadog, Grafana, Honeycomb, and CloudWatch:

pip install 'brooder[otel]'
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/metrics   # or metrics.otlp_endpoint in brooder.yaml
brooder ci agent.py

What it checks

  • Structural diff — the sequence of tool calls, their arguments, and the final output.
  • Semantic diff — a pluggable judge (judge: exact | llm) so equivalent wording isn't a regression.
  • Flakinessbrooder run --runs 3 runs each case N times and flags non-determinism (FLAKY).

Each case gets a verdict — PASS / REGRESSED / NEW / FLAKY — and a stability score.


Roadmap

See ROADMAP.md for what's shipped and what's planned.

Contributing

See CONTRIBUTING.md. Issues and PRs welcome — this is being built in public.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

brooder-0.1.0.tar.gz (90.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

brooder-0.1.0-py3-none-any.whl (56.0 kB view details)

Uploaded Python 3

File details

Details for the file brooder-0.1.0.tar.gz.

File metadata

  • Download URL: brooder-0.1.0.tar.gz
  • Upload date:
  • Size: 90.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for brooder-0.1.0.tar.gz
Algorithm Hash digest
SHA256 af601973c424f65df19c0a0031001a7c3ba68dc4092c0f76d360fbb1541bff68
MD5 3ebd4c61e3b61ae04288b88d76ed32ed
BLAKE2b-256 9d96081d80bbb363c95bba15113ce705eb728a385565a36ce94709918adfc11e

See more details on using hashes here.

Provenance

The following attestation bundles were made for brooder-0.1.0.tar.gz:

Publisher: release.yml on agentbrooder/brooder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file brooder-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: brooder-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 56.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for brooder-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 595e7629a421eb772b287a37706f149da315f7aebbc0a81c5556821242c8f70e
MD5 c64566f1b787c190492a932ad3371a14
BLAKE2b-256 9d130b4d443ef8bc92c9e1f907051af526455cc92532719bc88c60c49cc2a1c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for brooder-0.1.0-py3-none-any.whl:

Publisher: release.yml on agentbrooder/brooder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page