Stress-test agents. Capture production. Replay incidents on demand.
Project description
Tool Pouch
Stress-test agents. Capture production. Replay incidents on demand.
Tool Pouch is the reliability layer for AI agents. It catches silent failures
pre-deploy with pouch scan, captures every production request with
pouch.wrap_anthropic, and replays any captured trace under chaos so you
can answer "would this incident reproduce?" in one command — before
you ship a fix.
pip install tool-pouch
import tool_pouch as pouch
from anthropic import Anthropic
# Wrap once. Every messages.create from here on is captured.
client = pouch.wrap_anthropic(Anthropic())
# Pre-deploy
pouch init && pouch scan --quick
# In production, after the wrap()
pouch traces --since 1h --failed # what's blowing up?
pouch trace --request-id req-abc # one specific request
pouch replay <trace_id> --repeat 100 # would it reproduce?
Installed as
pip install tool-pouch, imported asimport tool_pouch as pouch, and run aspouch(the long formtool-pouchalso works).
What Tool Pouch is for
Three problems, one toolkit:
| Layer | Command | What it answers |
|---|---|---|
| Pre-deploy | pouch scan |
"What does my agent do when its tools break?" |
| Production | pouch.wrap_anthropic |
"What did my agent actually receive and emit?" |
| Incident response | pouch replay |
"Would this 3am incident reproduce?" |
You can adopt any one independently. They share the same data model (local SQLite by default; pluggable destinations for production), so captured traces become testable scenarios with no extra plumbing.
Install
pip install tool-pouch
For OpenAI or Ollama support:
pip install tool-pouch[openai] # OpenAI or any OpenAI-compatible endpoint
pip install tool-pouch[ollama] # Local Ollama
LLM provider
Tool Pouch uses an LLM to classify failures (hallucinated vs handled,
silent_wrong, etc.) and suggest fixes. One API key is enough —
pouch init autodetects which one you have, and pouch scan mirrors
the agent provider to the judge by default.
export OPENAI_API_KEY=... # → provider = openai, judge = openai
# or
export ANTHROPIC_API_KEY=... # → provider = anthropic, judge = anthropic
Override the judge for a single run:
pouch scan --judge ollama # local, fully offline
pouch run my_agent.py --judge openai
If the judge can't reach the LLM (no network, model down), Tool Pouch still runs — crashes, timeouts, and loops are detected without it. Only the nuanced "did this hallucinate?" classification needs the judge.
Supported: Anthropic, OpenAI, Ollama (local, fully offline).
Pick your path
Four integration paths. Each one is five minutes or less. Pick whichever matches your existing setup.
| Use this if... | Path | Jump to |
|---|---|---|
Tools are plain .py functions you control |
A. Decorator | @pouch.tool + pouch scan |
| You already use Anthropic / OpenAI tool calling | B. Adapter | test_anthropic / test_openai |
| LangGraph, MCP, or your own loop | C. Custom orchestration | agent_fn + pouch run |
| You want production capture + replay | D. wrap() | wrap_anthropic / wrap_openai |
What success looks like in any of them: see What the output looks like.
Path A — Decorator (the simplest)
~5 min. Use this when tools are plain Python functions in your own files.
pouch init # autodetects tools/, provider, model
pouch scan --quick # ~15s; runs the highest-signal scenarios first
Tag the functions you want tested:
# tools/web.py
from tool_pouch import tool
@tool
def search(q: str) -> dict:
"""Search the web for q."""
return search_api(q)
@tool
def fetch(url: str) -> dict:
"""Fetch the URL and return content."""
return requests.get(url).json()
pouch init finds your tools folder, picks the right provider based on
your API key, and writes .tool_pouch.toml. The judge defaults to the same
provider as the agent — one API key is enough.
--quick mode runs one input across the four highest-signal failure
scenarios, designed for fix → re-run → verify cycles. Drop the flag for
the full battery (12 scenarios × N inputs).
Path B — Anthropic / OpenAI adapter
~5 min. Use this when you already have a working agent on Anthropic or OpenAI tool calling.
Schemas are derived from each function's signature and docstring — no separate spec file, no rewrite.
import tool_pouch as pouch
from anthropic import Anthropic
def search(q: str) -> dict:
"""Search the web for q."""
return {"results": [...]}
pouch.test_anthropic(
client=Anthropic(),
model="claude-opus-4-7",
tools=[search],
test_inputs=["best pizza in NYC"],
)
OpenAI is identical:
from openai import OpenAI
pouch.test_openai(
client=OpenAI(),
model="gpt-4o",
tools=[search],
test_inputs=["best pizza in NYC"],
)
The adapter drives the model loop, dispatches tool calls through Tool Pouch's
failure-injection proxy, and returns a list of run_ids — same coverage
as Path A, none of the boilerplate.
Path C — Custom orchestration
~10 min. Use this when you're not on OpenAI / Anthropic directly — LangGraph, Pydantic-AI, MCP, or your own loop.
Define four exports in a Python file:
# my_agent.py
async def agent_fn(user_input, tool_caller):
# Use tool_caller(name, args) to call your tools.
result = await tool_caller("search", {"q": user_input})
return {"output": "...", "tool_calls": [...]}
def real_tool_fn(name, args):
if name == "search":
return search_api(args["q"])
...
tools = ["search", "fetch"] # tools to inject failures into
test_inputs = ["best pizza in NYC"] # what to ask your agent
Run it:
pouch run my_agent.py
Path D — Production wrap + replay
~5 min. Use this when you want every production request captured and any of them replayable on demand.
One line wraps your client:
import tool_pouch as pouch
from anthropic import Anthropic
client = pouch.wrap_anthropic(Anthropic(), agent_name="support_bot")
# That's it. Use client.messages.create exactly as before.
OpenAI is identical:
client = pouch.wrap_openai(OpenAI(), agent_name="support_bot")
Async clients work too (AsyncAnthropic, AsyncOpenAI). Streaming is
fully supported — chunks pass through unchanged, and the trace is
committed when the stream exhausts.
Querying captured traces
pouch traces # everything captured
pouch traces --since 1h --failed # last hour, failures only
pouch traces --request-id req-abc # by your request_id
pouch trace <trace_id> # full detail of one capture
request_id flows through to traces — pass a string or a callable that
extracts it from the request kwargs:
client = pouch.wrap_anthropic(
Anthropic(),
request_id=lambda **kw: kw.get("metadata", {}).get("user_id", "anon"),
)
Replaying
# Walk through what actually happened — no API calls.
pouch replay <trace_id> --frozen
# Re-call your model; stub tools with captured outputs.
pouch replay <trace_id> --frozen-tools
# Default: chaos. Real model, real tools, injected scenarios.
pouch replay <trace_id>
# 100 chaos replays → "would this incident reproduce?"
pouch replay <trace_id> --repeat 100
For chaos / frozen-tools modes, Tool Pouch needs your agent_fn and (for
chaos) your real_tool_fn. Set agent in .tool_pouch.toml or pass
--agent-file my_agent.py (same shape as Path C).
--repeat N aggregates verdicts as percentages per (tool, scenario)
cell — useful for surfacing flaky failure rates.
PII redaction
The default redactor scrubs emails, phones, SSNs, credit cards, IPs, and common API keys at capture time:
client = pouch.wrap_anthropic(Anthropic()) # built-in redaction enabled
Extend the regex pack:
client = pouch.wrap_anthropic(
Anthropic(),
redact=pouch.redact.builtin(extra_patterns=[
r"acct_\d{6}",
r"customer_token=[A-Za-z0-9]+",
]),
)
Disable redaction explicitly (if you're handling PII upstream):
client = pouch.wrap_anthropic(Anthropic(), redact=None)
Destinations
Three destinations ship in OSS. Combine them — capture once, pipe anywhere:
client = pouch.wrap_anthropic(
Anthropic(),
destinations=[
pouch.LocalStore(), # SQLite, dev/staging
pouch.JSONLogger(), # NDJSON to stderr
pouch.HTTPSink(url="https://your.api/traces"),
],
)
| Destination | Use it for |
|---|---|
LocalStore |
Dev / staging. SQLite at ~/.tool_pouch/tool_pouch.db. |
JSONLogger |
Production. Pipe stderr into Datadog, Honeycomb, Loki, CloudWatch. |
HTTPSink |
In-house observability backends. Batched POST. |
A future CloudStore will become a fourth destination after Tool Pouch Cloud
ships. The wrap API stays unchanged.
Disabling capture
Set TOOL_POUCH_DISABLE_WRAP=1 and every wrap_anthropic / wrap_openai
call becomes a no-op passthrough. Useful in CI and unit tests.
What the wrap costs you
Sub-millisecond p99 enqueue overhead on the request thread. Serialization, redaction, truncation, and destination IO all run on a background writer thread. Multi-process safe (pre-fork models like gunicorn / uvicorn workers). Per-trace size limits prevent runaway payloads. Fail-open at every destination — a misbehaving sink logs to stderr and never propagates.
What the output looks like
============================================================
Agent Test Report (run abc12345)
============================================================
Total scenarios: 24
Failures: 14 (58%)
Breakdown:
❌ crashed: 8
❌ hallucinated: 4
❌ looped: 2
✓ handled: 10
For full trace of any failure: pouch show abc12345 --filter <type>
Exit code is 0 when all scenarios pass and 1 when any fail — works
in CI out of the box.
Drilling in & re-running
pouch show abc12345 --filter hallucinated # full trace of one type
pouch scan --scenarios timeout,malformed_json # re-run a slice
pouch run my_agent.py --tools search # one tool only
pouch runs --failed # history, failures only
Project config (.tool_pouch.toml)
[tool-pouch]
# For `pouch scan`
tools = "./my_app/tools/"
provider = "openai"
model = "gpt-4o"
test_inputs = ["best pizza in NYC"] # optional — autogenerated otherwise
# For `pouch run` and `pouch replay`
agent = "./my_agent.py"
# Common
parallel = 8
# scenarios = ["timeout", "malformed_json"] # optional filter
Fix bugs in your AI editor
After any run, get a markdown prompt designed for Cursor, Claude Code, Cline, Windsurf, or Aider:
pouch fix-prompt | pbcopy # latest run → clipboard
The format groups failures by source (control flow, prompt, integration) so your AI editor proposes clustered fixes instead of one-line patches.
Architecture
User-facing surface:
tool.py—@pouch.tooldecorator + module-level registrydiscover.py— walks a path, returns every@pouch.toolcallableinit.py—pouch init, autodetects tools/provider/modelautogen.py— generates test prompts from tool docstringsadapters/— drop-in helpers for OpenAI and Anthropic tool calling_introspect.py— Python callables → JSON tool schemasfix_prompt.py— renders a past run as markdown for AI coding tools
Wrap / replay:
wrap/proxy.py—wrap_anthropic/wrap_openaiclient interceptionwrap/writer.py— background writer thread, fail-open, fork-safewrap/destinations.py—LocalStore,JSONLogger,HTTPSinkwrap/limits.py— per-trace + per-tool-result size truncationredact.py— PII redaction pack (extensible)replay.py—build_replay_inputs(trace, mode=...)+ verdict aggregationnudges.py— one-time CLI nudges (cloud upgrade hooks)
Engine:
proxy.py— wraps tool calls during stress testingrunner.py— runs (tool × scenario) in parallel; judge fan-outscenarios/static.py— built-in failuresjudges/llm_judge.py— classifies completed runsconfig.py— judge provider resolutionstore.py— versioned SQLite (WAL mode, multi-process safe)migrations/— versioned schema migrationsreport.py— summary + detailed trace view
Status & roadmap
0.1 ships pre-deploy stress testing, production capture, and replay.
Tool Pouch Cloud is the next layer: push captured traces from any
environment, search by request_id, replay across your team, retain
for compliance. Until it ships, the OSS path is already
production-ready via JSONLogger and HTTPSink.
Get notified at launch: toolpouch.dev.
License
Apache License 2.0. See LICENSE for the full text and NOTICE for required attribution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tool_pouch-0.1.1.tar.gz.
File metadata
- Download URL: tool_pouch-0.1.1.tar.gz
- Upload date:
- Size: 692.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0d0b7371f2331b68ddb2c81372e63417bb6d2edd2c8bb18953dfc8dabe4cd1e
|
|
| MD5 |
afe32d8655b52c0532997e7afbedf897
|
|
| BLAKE2b-256 |
11851a365f8ba54e81c8aeb8c17b807a04a859865f6817d5bc5fd68cd4150d9f
|
Provenance
The following attestation bundles were made for tool_pouch-0.1.1.tar.gz:
Publisher:
release.yml on Tool-pouch/tool-pouch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tool_pouch-0.1.1.tar.gz -
Subject digest:
b0d0b7371f2331b68ddb2c81372e63417bb6d2edd2c8bb18953dfc8dabe4cd1e - Sigstore transparency entry: 1564447041
- Sigstore integration time:
-
Permalink:
Tool-pouch/tool-pouch@24dc7ddad930d05ed28637308dab9ef58ae2cff2 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Tool-pouch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@24dc7ddad930d05ed28637308dab9ef58ae2cff2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tool_pouch-0.1.1-py3-none-any.whl.
File metadata
- Download URL: tool_pouch-0.1.1-py3-none-any.whl
- Upload date:
- Size: 74.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1a607176ec8d01aaa24abba0de9633f435d17ef413862cfeb9b9fd687ae1526
|
|
| MD5 |
5e0351788dbb0b843c2f6c250b9f6364
|
|
| BLAKE2b-256 |
292ac89f3358fa9671e1973e15fc809a868e1b8e63074ab7209a35bf335ce379
|
Provenance
The following attestation bundles were made for tool_pouch-0.1.1-py3-none-any.whl:
Publisher:
release.yml on Tool-pouch/tool-pouch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tool_pouch-0.1.1-py3-none-any.whl -
Subject digest:
c1a607176ec8d01aaa24abba0de9633f435d17ef413862cfeb9b9fd687ae1526 - Sigstore transparency entry: 1564447055
- Sigstore integration time:
-
Permalink:
Tool-pouch/tool-pouch@24dc7ddad930d05ed28637308dab9ef58ae2cff2 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Tool-pouch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@24dc7ddad930d05ed28637308dab9ef58ae2cff2 -
Trigger Event:
push
-
Statement type: