brooder

Snapshot testing for AI agents — catch behavior regressions before they ship.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

iron8kid

These details have not been verified by PyPI

Project links

Homepage

Project description

Brooder — snapshot testing for AI agents

Snapshot testing for AI agents. Catch behavior regressions before they ship.

Your AI agent is one model upgrade away from silently breaking. You bump the model, tweak a prompt, or change a tool — and the agent starts behaving differently. You find out from a customer.

Brooder is the safety net. Wrap your agent once, and Brooder records its real runs as golden baselines. Every time you change the model, a prompt, or a tool, it re-runs and shows you a behavioral diff — what changed, what broke — and fails your CI if it regressed.

No eval datasets to hand-write. One command. It's jest --updateSnapshot, but for agents.

pip install brooder

brooder migrate catching a dropped tool call and a flipped answer

Status: early alpha, built in public. Apache-2.0.

60-second demo (no API keys needed)

The included example agent simulates a model upgrade with an env var, so you can see Brooder catch a real regression completely offline.

git clone https://github.com/agentbrooder/brooder && cd brooder
pip install -e .

# The signature move: what breaks if I migrate from one model to another?
brooder migrate --from gpt-4o --to gpt-5-new examples/regressing_agent.py

Output (abridged):

──────────────────────── Model Migration Report ────────────────────────
 1 of 3 cases change behavior when migrating gpt-4o → gpt-5-new.

 support-agent · e1ded4070eee · REGRESSED · stability 40
   path diverged at step 0: was TOOL create_ticket(order=12345), now dropped
   - trajectory[0]  {'name': 'create_ticket', 'args': {'order': '12345'}}
   ~ output
       before: I've started your refund.
       after:  Refunds are not supported.

The "new model" silently stopped creating the refund ticket and flipped its answer. That would have shipped to production unnoticed. Brooder caught it — and exited non-zero, so CI would block it.

The normal workflow

brooder record examples/regressing_agent.py     # capture golden baselines from real runs
brooder run    examples/regressing_agent.py     # re-run after a change, diff vs baseline
brooder diff                                    # see exactly what changed
brooder approve                                 # accept the new behavior as the baseline

brooder run exits non-zero when behavior regressed — drop it into CI and it gates your PRs.

Instrument your own agent

Add one decorator. Log tool calls with one function. That's the whole SDK.

import brooder

def search_kb(query):
    brooder.tool_call("search_kb", {"query": query}, result="...")
    return "..."

@brooder.record("support-agent")
def agent(question: str) -> str:
    docs = search_kb(question)
    return answer_from(docs)

# call it over your real inputs; brooder records/replays automatically

Then run it through the CLI. Baselines are plain JSON committed to your repo, so diffs show up in code review like any other change.

Auto-capture (no manual `tool_call`)

Wrap your LLM client and Brooder records the model's tool-call decisions automatically:

import brooder
import openai

client = brooder.instrument(openai.OpenAI())
# now every client.chat.completions.create(...) call is captured while recording

Supported providers: OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, and Google (Gemini / Vertex). The provider is auto-detected; override it with brooder.instrument(client, provider="bedrock"). Model names are intentionally not diffed, so switching models isn't itself a change — only the model's behavior (which tools it calls, with what arguments) is.

Async works too. @brooder.record and instrument(...) handle async def agents and async clients — AsyncOpenAI, AsyncAzureOpenAI, AsyncAnthropic, and Google's generate_content_async — with no extra setup (the recording context follows your awaits and into child tasks):

client = brooder.instrument(openai.AsyncOpenAI())

@brooder.record("support-agent")
async def agent(question: str) -> str:
    await client.chat.completions.create(model="gpt-4o", messages=[...])
    ...

(Async AWS Bedrock via aioboto3 isn't covered yet — the sync boto3 client is.)

Capture from agent frameworks (OpenTelemetry)

Building on an agent framework? If it emits OpenTelemetry GenAI spans — LangGraph, CrewAI, AutoGen, and anything else on the convention — add one span processor and Brooder ingests the whole trajectory, no manual tool_call:

from opentelemetry import trace
from brooder.integrations.otel import BrooderSpanProcessor

trace.get_tracer_provider().add_span_processor(BrooderSpanProcessor(agent="support-agent"))

It maps inference spans → turns, execute_tool spans → tool calls, and the agent-root span's input/output → the case identity and final answer. It also drops straight into the OTel pipelines you already run (Datadog / Arize / Honeycomb).

Building directly on the Claude Agent SDK? Register Brooder's hooks and it records the tool trajectory automatically:

import brooder
from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions, ResultMessage
from brooder.integrations import claude_agent

options = ClaudeAgentOptions(hooks=brooder.claude_agent_hooks(agent="support-agent"))
async with ClaudeSDKClient(options=options) as client:
    await client.query(prompt)
    async for msg in client.receive_response():
        if isinstance(msg, ResultMessage):
            claude_agent.record_output(msg.session_id, msg.result)  # optional: capture the answer

UserPromptSubmit opens a run (the prompt is the case identity), PostToolUse becomes a tool step, and Stop finalizes it.

On the OpenAI Agents SDK? Its tracing is on by default — install Brooder's trace processor once and every run is captured (no OpenAI API key required for capture):

import brooder.integrations.openai_agents as bd_agents

bd_agents.install(agent="support-agent")   # then run your agents as usual

It maps generation/response spans → turns, function spans → tool calls, and handoffs and triggered guardrails into the trajectory too — so both tool selection and control-flow regressions get diffed.

Using LangChain or LangGraph? Attach one callback handler — no OpenTelemetry setup required:

import brooder.integrations.langchain as bd_lc

handler = bd_lc.callback_handler(agent="support-agent")
graph.invoke({"messages": [...]}, config={"callbacks": [handler]})

The root chain start opens a run (its input is the case identity), model calls become turns, and tool calls become tool steps — one handler covers both LangChain and LangGraph.

It tests agents (the whole trajectory), not single LLM calls

@brooder.record wraps your entire agent — every step of its plan → act → observe loop. The baseline is the full trajectory: every tool call across every turn, in order, plus the final output. So Brooder catches agent-level regressions, not just token changes in one model response.

# A multi-step agent that silently stops verifying before answering on the newer model:
brooder migrate --from gpt-4o --to gpt-5-new examples/loop_agent.py
# -> REGRESSED: trajectory[1] "verify" removed

That dropped verify step happened inside the loop — the kind of thing an LLM-output eval would never see.

Why not just use observability / eval tools?

Tool type	Examples	What it does	The gap Brooder fills
Observability	Langfuse, Laminar, Phoenix	Trace/monitor after it runs	Doesn't gate before you ship
Eval frameworks	DeepEval, Braintrust, Ragas	Score against hand-written datasets	Requires eval authoring nobody maintains
Brooder	—	Record real runs → behavioral diff on every change → CI gate	Zero eval-writing, catches model-migration regressions

Gate your PRs (GitHub Action)

Drop Brooder into CI and it re-runs your agent on every pull request, comments the behavioral diff, and fails the check when behavior regresses. Copy examples/github-action.yml to .github/workflows/brooder.yml:

permissions:
  contents: read
  pull-requests: write        # so it can comment the diff

jobs:
  agent-snapshot:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: agentbrooder/brooder@v1
        with:
          script: tests/agent_snapshot.py

The comment is upserted (updated in place, not spammed) and looks like the --format markdown output below.

Machine-readable output (`--json` / OTLP)

run, ci, and diff take --format table|json|markdown (--json is a shortcut). Exit codes are unchanged, so you can gate and parse:

brooder run agent.py --json | jq '.summary'
# { "total": 3, "passed": 2, "regressed": 1, "flaky": 0, "regressions": 1, "mean_stability": 80 }

For dashboards, point Brooder at any OTLP endpoint and each run emits a snapshot of gauges (brooder.cases.*, brooder.stability.mean) — one exporter that reaches Datadog, Grafana, Honeycomb, and CloudWatch:

pip install 'brooder[otel]'
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/metrics   # or metrics.otlp_endpoint in brooder.yaml
brooder ci agent.py

What it checks

Structural diff — the sequence of tool calls, their arguments, and the final output.
Semantic diff — a pluggable judge (judge: exact | llm) so equivalent wording isn't a regression.
Flakiness — brooder run --runs 3 runs each case N times and flags non-determinism (FLAKY).

Each case gets a verdict — PASS / REGRESSED / NEW / FLAKY — and a stability score.

Roadmap

See ROADMAP.md for what's shipped and what's planned.

Contributing

See CONTRIBUTING.md. Issues and PRs welcome — this is being built in public.

License

Apache-2.0.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

iron8kid

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Jul 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

brooder-0.1.0.tar.gz (90.5 kB view details)

Uploaded Jul 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

brooder-0.1.0-py3-none-any.whl (56.0 kB view details)

Uploaded Jul 2, 2026 Python 3

File details

Details for the file brooder-0.1.0.tar.gz.

File metadata

Download URL: brooder-0.1.0.tar.gz
Upload date: Jul 2, 2026
Size: 90.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for brooder-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`af601973c424f65df19c0a0031001a7c3ba68dc4092c0f76d360fbb1541bff68`
MD5	`3ebd4c61e3b61ae04288b88d76ed32ed`
BLAKE2b-256	`9d96081d80bbb363c95bba15113ce705eb728a385565a36ce94709918adfc11e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for brooder-0.1.0.tar.gz:

Publisher: release.yml on agentbrooder/brooder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: brooder-0.1.0.tar.gz
- Subject digest: af601973c424f65df19c0a0031001a7c3ba68dc4092c0f76d360fbb1541bff68
- Sigstore transparency entry: 2047479999
- Sigstore integration time: Jul 2, 2026
Source repository:
- Permalink: agentbrooder/brooder@1ec17e58ebb4a12da2a295dc25c88684cb7c6bfd
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/agentbrooder
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1ec17e58ebb4a12da2a295dc25c88684cb7c6bfd
- Trigger Event: push

File details

Details for the file brooder-0.1.0-py3-none-any.whl.

File metadata

Download URL: brooder-0.1.0-py3-none-any.whl
Upload date: Jul 2, 2026
Size: 56.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for brooder-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`595e7629a421eb772b287a37706f149da315f7aebbc0a81c5556821242c8f70e`
MD5	`c64566f1b787c190492a932ad3371a14`
BLAKE2b-256	`9d130b4d443ef8bc92c9e1f907051af526455cc92532719bc88c60c49cc2a1c1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for brooder-0.1.0-py3-none-any.whl:

Publisher: release.yml on agentbrooder/brooder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: brooder-0.1.0-py3-none-any.whl
- Subject digest: 595e7629a421eb772b287a37706f149da315f7aebbc0a81c5556821242c8f70e
- Sigstore transparency entry: 2047480006
- Sigstore integration time: Jul 2, 2026
Source repository:
- Permalink: agentbrooder/brooder@1ec17e58ebb4a12da2a295dc25c88684cb7c6bfd
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/agentbrooder
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1ec17e58ebb4a12da2a295dc25c88684cb7c6bfd
- Trigger Event: push

brooder 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

60-second demo (no API keys needed)

The normal workflow

Instrument your own agent

Auto-capture (no manual tool_call)

Capture from agent frameworks (OpenTelemetry)

It tests agents (the whole trajectory), not single LLM calls

Why not just use observability / eval tools?

Gate your PRs (GitHub Action)

Machine-readable output (--json / OTLP)

What it checks

Roadmap

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Auto-capture (no manual `tool_call`)

Machine-readable output (`--json` / OTLP)