Skip to main content

Stress-test agents. Capture production. Replay incidents on demand.

Project description

Tool Pouch

Stress-test agents. Capture production. Replay incidents on demand.

Tool Pouch is the reliability layer for AI agents. It catches silent failures pre-deploy with pouch scan, captures every production request with pouch.wrap_anthropic, and replays any captured trace under chaos so you can answer "would this incident reproduce?" in one command — before you ship a fix.

pip install tool-pouch
import tool_pouch as pouch
from anthropic import Anthropic

# Wrap once. Every messages.create from here on is captured.
client = pouch.wrap_anthropic(Anthropic())
# Pre-deploy
pouch init && pouch scan --quick

# In production, after the wrap()
pouch traces --since 1h --failed       # what's blowing up?
pouch trace --request-id req-abc       # one specific request
pouch replay <trace_id> --repeat 100   # would it reproduce?

Installed as pip install tool-pouch, imported as import tool_pouch as pouch, and run as pouch (the long form tool-pouch also works).


What Tool Pouch is for

Three problems, one toolkit:

Layer Command What it answers
Pre-deploy pouch scan "What does my agent do when its tools break?"
Production pouch.wrap_anthropic "What did my agent actually receive and emit?"
Incident response pouch replay "Would this 3am incident reproduce?"

You can adopt any one independently. They share the same data model (local SQLite by default; pluggable destinations for production), so captured traces become testable scenarios with no extra plumbing.


Install

pip install tool-pouch

For OpenAI or Ollama support:

pip install tool-pouch[openai]   # OpenAI or any OpenAI-compatible endpoint
pip install tool-pouch[ollama]   # Local Ollama

LLM provider

Tool Pouch uses an LLM to classify failures (hallucinated vs handled, silent_wrong, etc.) and suggest fixes. One API key is enoughpouch init autodetects which one you have, and pouch scan mirrors the agent provider to the judge by default.

export OPENAI_API_KEY=...        # → provider = openai, judge = openai
# or
export ANTHROPIC_API_KEY=...     # → provider = anthropic, judge = anthropic

Override the judge for a single run:

pouch scan --judge ollama        # local, fully offline
pouch run my_agent.py --judge openai

If the judge can't reach the LLM (no network, model down), Tool Pouch still runs — crashes, timeouts, and loops are detected without it. Only the nuanced "did this hallucinate?" classification needs the judge.

Supported: Anthropic, OpenAI, Ollama (local, fully offline).


Pick your path

Four integration paths. Each one is five minutes or less. Pick whichever matches your existing setup.

Use this if... Path Jump to
Tools are plain .py functions you control A. Decorator @pouch.tool + pouch scan
You already use Anthropic / OpenAI tool calling B. Adapter test_anthropic / test_openai
LangGraph, MCP, or your own loop C. Custom orchestration agent_fn + pouch run
You want production capture + replay D. wrap() wrap_anthropic / wrap_openai

What success looks like in any of them: see What the output looks like.


Path A — Decorator (the simplest)

~5 min. Use this when tools are plain Python functions in your own files.

pouch init             # autodetects tools/, provider, model
pouch scan --quick     # ~15s; runs the highest-signal scenarios first

Tag the functions you want tested:

# tools/web.py
from tool_pouch import tool

@tool
def search(q: str) -> dict:
    """Search the web for q."""
    return search_api(q)

@tool
def fetch(url: str) -> dict:
    """Fetch the URL and return content."""
    return requests.get(url).json()

pouch init finds your tools folder, picks the right provider based on your API key, and writes .tool_pouch.toml. The judge defaults to the same provider as the agent — one API key is enough.

--quick mode runs one input across the four highest-signal failure scenarios, designed for fix → re-run → verify cycles. Drop the flag for the full battery (12 scenarios × N inputs).


Path B — Anthropic / OpenAI adapter

~5 min. Use this when you already have a working agent on Anthropic or OpenAI tool calling.

Schemas are derived from each function's signature and docstring — no separate spec file, no rewrite.

import tool_pouch as pouch
from anthropic import Anthropic

def search(q: str) -> dict:
    """Search the web for q."""
    return {"results": [...]}

pouch.test_anthropic(
    client=Anthropic(),
    model="claude-opus-4-7",
    tools=[search],
    test_inputs=["best pizza in NYC"],
)

OpenAI is identical:

from openai import OpenAI

pouch.test_openai(
    client=OpenAI(),
    model="gpt-4o",
    tools=[search],
    test_inputs=["best pizza in NYC"],
)

The adapter drives the model loop, dispatches tool calls through Tool Pouch's failure-injection proxy, and returns a list of run_ids — same coverage as Path A, none of the boilerplate.


Path C — Custom orchestration

~10 min. Use this when you're not on OpenAI / Anthropic directly — LangGraph, Pydantic-AI, MCP, or your own loop.

Define four exports in a Python file:

# my_agent.py

async def agent_fn(user_input, tool_caller):
    # Use tool_caller(name, args) to call your tools.
    result = await tool_caller("search", {"q": user_input})
    return {"output": "...", "tool_calls": [...]}

def real_tool_fn(name, args):
    if name == "search":
        return search_api(args["q"])
    ...

tools = ["search", "fetch"]            # tools to inject failures into
test_inputs = ["best pizza in NYC"]    # what to ask your agent

Run it:

pouch run my_agent.py

Path D — Production wrap + replay

~5 min. Use this when you want every production request captured and any of them replayable on demand.

One line wraps your client:

import tool_pouch as pouch
from anthropic import Anthropic

client = pouch.wrap_anthropic(Anthropic(), agent_name="support_bot")
# That's it. Use client.messages.create exactly as before.

OpenAI is identical:

client = pouch.wrap_openai(OpenAI(), agent_name="support_bot")

Async clients work too (AsyncAnthropic, AsyncOpenAI). Streaming is fully supported — chunks pass through unchanged, and the trace is committed when the stream exhausts.

Querying captured traces

pouch traces                            # everything captured
pouch traces --since 1h --failed        # last hour, failures only
pouch traces --request-id req-abc       # by your request_id
pouch trace <trace_id>                  # full detail of one capture

request_id flows through to traces — pass a string or a callable that extracts it from the request kwargs:

client = pouch.wrap_anthropic(
    Anthropic(),
    request_id=lambda **kw: kw.get("metadata", {}).get("user_id", "anon"),
)

Replaying

# Walk through what actually happened — no API calls.
pouch replay <trace_id> --frozen

# Re-call your model; stub tools with captured outputs.
pouch replay <trace_id> --frozen-tools

# Default: chaos. Real model, real tools, injected scenarios.
pouch replay <trace_id>

# 100 chaos replays → "would this incident reproduce?"
pouch replay <trace_id> --repeat 100

For chaos / frozen-tools modes, Tool Pouch needs your agent_fn and (for chaos) your real_tool_fn. Set agent in .tool_pouch.toml or pass --agent-file my_agent.py (same shape as Path C).

--repeat N aggregates verdicts as percentages per (tool, scenario) cell — useful for surfacing flaky failure rates.

PII redaction

The default redactor scrubs emails, phones, SSNs, credit cards, IPs, and common API keys at capture time:

client = pouch.wrap_anthropic(Anthropic())   # built-in redaction enabled

Extend the regex pack:

client = pouch.wrap_anthropic(
    Anthropic(),
    redact=pouch.redact.builtin(extra_patterns=[
        r"acct_\d{6}",
        r"customer_token=[A-Za-z0-9]+",
    ]),
)

Disable redaction explicitly (if you're handling PII upstream):

client = pouch.wrap_anthropic(Anthropic(), redact=None)

Destinations

Three destinations ship in OSS. Combine them — capture once, pipe anywhere:

client = pouch.wrap_anthropic(
    Anthropic(),
    destinations=[
        pouch.LocalStore(),                       # SQLite, dev/staging
        pouch.JSONLogger(),                       # NDJSON to stderr
        pouch.HTTPSink(url="https://your.api/traces"),
    ],
)
Destination Use it for
LocalStore Dev / staging. SQLite at ~/.tool_pouch/tool_pouch.db.
JSONLogger Production. Pipe stderr into Datadog, Honeycomb, Loki, CloudWatch.
HTTPSink In-house observability backends. Batched POST.

A future CloudStore will become a fourth destination after Tool Pouch Cloud ships. The wrap API stays unchanged.

Disabling capture

Set TOOL_POUCH_DISABLE_WRAP=1 and every wrap_anthropic / wrap_openai call becomes a no-op passthrough. Useful in CI and unit tests.

What the wrap costs you

Sub-millisecond p99 enqueue overhead on the request thread. Serialization, redaction, truncation, and destination IO all run on a background writer thread. Multi-process safe (pre-fork models like gunicorn / uvicorn workers). Per-trace size limits prevent runaway payloads. Fail-open at every destination — a misbehaving sink logs to stderr and never propagates.


What the output looks like

============================================================
Agent Test Report (run abc12345)
============================================================
Total scenarios: 24
Failures: 14 (58%)

Breakdown:
  ❌ crashed: 8
  ❌ hallucinated: 4
  ❌ looped: 2
  ✓ handled: 10

For full trace of any failure: pouch show abc12345 --filter <type>

Exit code is 0 when all scenarios pass and 1 when any fail — works in CI out of the box.


Drilling in & re-running

pouch show abc12345 --filter hallucinated      # full trace of one type
pouch scan --scenarios timeout,malformed_json  # re-run a slice
pouch run my_agent.py --tools search           # one tool only
pouch runs --failed                            # history, failures only

Project config (.tool_pouch.toml)

[tool-pouch]
# For `pouch scan`
tools = "./my_app/tools/"
provider = "openai"
model = "gpt-4o"
test_inputs = ["best pizza in NYC"]   # optional — autogenerated otherwise

# For `pouch run` and `pouch replay`
agent = "./my_agent.py"

# Common
parallel = 8
# scenarios = ["timeout", "malformed_json"]    # optional filter

Fix bugs in your AI editor

After any run, get a markdown prompt designed for Cursor, Claude Code, Cline, Windsurf, or Aider:

pouch fix-prompt | pbcopy             # latest run → clipboard

The format groups failures by source (control flow, prompt, integration) so your AI editor proposes clustered fixes instead of one-line patches.


Architecture

User-facing surface:

  • tool.py@pouch.tool decorator + module-level registry
  • discover.py — walks a path, returns every @pouch.tool callable
  • init.pypouch init, autodetects tools/provider/model
  • autogen.py — generates test prompts from tool docstrings
  • adapters/ — drop-in helpers for OpenAI and Anthropic tool calling
  • _introspect.py — Python callables → JSON tool schemas
  • fix_prompt.py — renders a past run as markdown for AI coding tools

Wrap / replay:

  • wrap/proxy.pywrap_anthropic / wrap_openai client interception
  • wrap/writer.py — background writer thread, fail-open, fork-safe
  • wrap/destinations.pyLocalStore, JSONLogger, HTTPSink
  • wrap/limits.py — per-trace + per-tool-result size truncation
  • redact.py — PII redaction pack (extensible)
  • replay.pybuild_replay_inputs(trace, mode=...) + verdict aggregation
  • nudges.py — one-time CLI nudges (cloud upgrade hooks)

Engine:

  • proxy.py — wraps tool calls during stress testing
  • runner.py — runs (tool × scenario) in parallel; judge fan-out
  • scenarios/static.py — built-in failures
  • judges/llm_judge.py — classifies completed runs
  • config.py — judge provider resolution
  • store.py — versioned SQLite (WAL mode, multi-process safe)
  • migrations/ — versioned schema migrations
  • report.py — summary + detailed trace view

Status & roadmap

0.1 ships pre-deploy stress testing, production capture, and replay. Tool Pouch Cloud is the next layer: push captured traces from any environment, search by request_id, replay across your team, retain for compliance. Until it ships, the OSS path is already production-ready via JSONLogger and HTTPSink.

Get notified at launch: toolpouch.dev.


License

Apache License 2.0. See LICENSE for the full text and NOTICE for required attribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tool_pouch-0.1.1.tar.gz (692.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tool_pouch-0.1.1-py3-none-any.whl (74.8 kB view details)

Uploaded Python 3

File details

Details for the file tool_pouch-0.1.1.tar.gz.

File metadata

  • Download URL: tool_pouch-0.1.1.tar.gz
  • Upload date:
  • Size: 692.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tool_pouch-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b0d0b7371f2331b68ddb2c81372e63417bb6d2edd2c8bb18953dfc8dabe4cd1e
MD5 afe32d8655b52c0532997e7afbedf897
BLAKE2b-256 11851a365f8ba54e81c8aeb8c17b807a04a859865f6817d5bc5fd68cd4150d9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for tool_pouch-0.1.1.tar.gz:

Publisher: release.yml on Tool-pouch/tool-pouch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tool_pouch-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: tool_pouch-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 74.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tool_pouch-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c1a607176ec8d01aaa24abba0de9633f435d17ef413862cfeb9b9fd687ae1526
MD5 5e0351788dbb0b843c2f6c250b9f6364
BLAKE2b-256 292ac89f3358fa9671e1973e15fc809a868e1b8e63074ab7209a35bf335ce379

See more details on using hashes here.

Provenance

The following attestation bundles were made for tool_pouch-0.1.1-py3-none-any.whl:

Publisher: release.yml on Tool-pouch/tool-pouch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page