Open-source reliability testing for AI agent tool chains. Catch cascading failures before production.
Project description
๐ก๏ธ ToolGuard
Reliability testing for AI agent tool chains.
Catch cascading failures before production. Make agent tool calling as predictable as unit tests made software reliable.
๐ง What ToolGuard Actually Solves
Right now, developers don't deploy AI agents because they are fundamentally unstable. They crash.
There are two layers to AI:
- Layer 1: Intelligence (evals, reasoning, accurate answers)
- Layer 2: Execution (tool calls, chaining, JSON payloads, APIs)
ToolGuard does not test Layer 1. We do not care if your AI is "smart" or makes good decisions. That is what eval frameworks are for.
ToolGuard mathematically proves Layer 2. We solve the problem of agents crashing at 3 AM because the LLM hallucinated a JSON key, passed a string instead of an int, or an external API timed out.
"We don't make AI smarter. We make AI systems not break."
๐ Zero Config โ Try It in 60 Seconds
pip install py-toolguard
toolguard run my_agent.py
That's it. ToolGuard auto-discovers your tools, fuzzes them with hallucination attacks (nulls, type mismatches, missing fields), and prints a reliability report. Zero config needed.
๐ Auto-discovered 3 tools from my_agent.py
โข fetch_price (2 params)
โข calculate_position (3 params)
โข generate_alert (2 params)
๐งช Running 42 fuzz tests...
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Reliability Score: my_agent โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ Score: 64.3% โ
โ Risk Level: ๐ HIGH โ
โ Deploy: ๐ซ BLOCK โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ โ ๏ธ Top Risk: Null values propagating through chain โ
โ โ ๏ธ Bottleneck Tools: โ
โ โ fetch_price (50% success) โ
โ โ generate_alert (42% success) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ก fetch_price: Add null check for 'ticker' โ LLM hallucinated None
๐ก generate_alert: Field 'severity' expects int, got str from upstream tool
Or with Python:
from toolguard import create_tool, test_chain, score_chain
@create_tool(schema="auto")
def parse_csv(raw_csv: str) -> dict:
lines = raw_csv.strip().split("\n")
headers = lines[0].split(",")
records = [dict(zip(headers, line.split(","))) for line in lines[1:]]
return {"headers": headers, "records": records, "row_count": len(records)}
report = test_chain(
[parse_csv],
base_input={"raw_csv": "name,age\nAlice,30\nBob,35"},
test_cases=["happy_path", "null_handling", "malformed_data", "type_mismatch", "missing_fields"],
)
score = score_chain(report)
print(score.summary())
๐ค How ToolGuard is Different
Most testing tools (LangSmith, Promptfoo) test your agent by sending prompts to a live LLM. It is slow, expensive, and non-deterministic.
ToolGuard does NOT use an LLM to run its tests.
When you decorate a function with @create_tool(schema="auto"), ToolGuard reads your Python type hints and automatically generates a Pydantic schema. It then uses that schema to know exactly which fields to break, which types to swap, and which values to null โ no manual configuration needed.
It acts like a deterministic fuzzer for AI tool execution, programmatically injecting the exact types of bad data that an LLM would accidentally generate in production:
- Missing dictionary keys
- Null values propagating down the chain
strinstead ofint- Massive 10MB payloads to stress your server
- Extra/unexpected fields in JSON
ToolGuard doesn't test if your AI is smart. It tests if your Python code is bulletproof enough to survive when your AI does something stupid โ running in 1 second and costing $0 in API fees.
Features
๐ก๏ธ Layer-2 Security Firewall (V3.0)
ToolGuard features an impenetrable execution-layer security framework protecting production servers from critical LLM exploits.
- Human-in-the-Loop Risk Tiers: Mark destructive tools with
@create_tool(risk_tier=2). ToolGuard mathematically intercepts these calls and natively streams terminal approval prompts before execution, gracefully protectingasyncioevent loops and headless daemon environments. - Recursive Prompt Injection Fuzzing: The
test_chainfuzzer automatically injects[SYSTEM OVERRIDE]execution payloads into your pipelines. A bespoke recursive depth-first memory parser scans internal custom object serialization, byte arrays, and.casefold()string mutations to eliminate zero-day blind spots. - Golden Traces (DAG Instrumentation): With two lines of code (
with TraceTracker() as trace:), ToolGuard natively intercepts Pythoncontextvarsto construct a chronologically perfect Directed Acyclic Graph of all tools orchestrated by LangChain, CrewAI, Swarm, and AutoGen. - Non-Deterministic Verification: Punishing an AI for self-correcting is an anti-pattern. Developers use
trace.assert_sequence(["auth", "refund"])to mathematically enforce mandatory compliance checkpoints while permitting the LLM complete freedom to autonomously select supplementary network tools.
๐ Schema Validation
Automatic Pydantic input/output validation from type hints. No manual schemas needed.
@create_tool(schema="auto")
def fetch_price(ticker: str) -> dict:
...
๐ Chain Testing
Test multi-tool chains against 8 edge-case categories: null handling, type mismatches, missing fields, malformed data, large payloads, and more.
report = test_chain(
[fetch_price, calculate_position, generate_alert],
base_input={"ticker": "AAPL"},
test_cases=["happy_path", "null_handling", "type_mismatch"],
)
โก Async Support
Works with both def and async def tools transparently. No special flags needed.
@create_tool(schema="auto")
async def fetch_from_api(url: str) -> dict:
async with httpx.AsyncClient() as client:
resp = await client.get(url)
return resp.json()
# Same API โ ToolGuard handles the async automatically
report = test_chain([fetch_from_api, process_data], assert_reliability=0.95)
๐ฆ Immersive Live Dashboard
When testing locally, you don't have to stare at basic print logs. By passing --dashboard, ToolGuard launches a stunning, high-contrast, dark-mode terminal UI (built on Textual).
toolguard run my_agent.py --dashboard
It streams live, concurrent fuzzing results as they happen, calculates metrics in realtime, and tracks exactly which functions crash under payload injectionโall encapsulated in a dedicated hacker-style "Mission Control" interface.
๐ Reliability Scoring
Quantified trust with risk levels and deployment gates.
score = score_chain(report)
if score.deploy_recommendation.value == "BLOCK":
sys.exit(1) # CI/CD gate
โช Local Crash Replay
When a remote tool crashes in production or tests, ToolGuard automatically dumps the structured JSON payload. You can instantly replay the exact crashing state locally to view the stack trace.
toolguard run my_agent.py --dump-failures
toolguard replay .toolguard/failures/fail_1774068587_0.json
๐ฏ Edge-Case Test Coverage
ToolGuard gives you PyTest-style coverage metrics. Instead of arbitrary line-coverage, it calculates exactly what percentage of the 8 known LLM hallucination categories (nulls, missing fields, type mismatches, etc.) your tests successfully covered, and lists what is untested.
โก The Minimal API
For rapid Jupyter Notebook testing and quick demos, use the highly portable 1-line Python wrapper.
from toolguard import quick_check
quick_check(my_agent_function, test_cases=["happy_path", "null_handling"])
๐ Retry & Circuit Breaker
Production-grade resilience patterns built-in.
from toolguard import with_retry, RetryPolicy, CircuitBreaker, with_circuit_breaker
@with_retry(RetryPolicy(max_retries=3, backoff_base=0.5))
def call_api(data: dict) -> dict: ...
breaker = CircuitBreaker(failure_threshold=5, reset_timeout=60)
@with_circuit_breaker(breaker)
def call_flaky_service(data: dict) -> dict: ...
๐ฅ๏ธ CLI
toolguard run my_agent.py # Zero-config auto-test
toolguard run my_agent.py --dashboard # ๐ฆ Live immersive TUI control center
toolguard test --chain my_chain.yaml # YAML-based chain test
toolguard test --chain my_chain.yaml --html out.html # HTML report
toolguard test --chain my_chain.yaml --junit-xml out.xml # JUnit XML for CI
toolguard badge # Generate reliability badge
toolguard check --tools my_tools.py # Check compatibility
toolguard observe --tools my_tools.py # View tool stats
toolguard init --name my_project # Scaffold project
๐ Native Framework Integrations
ToolGuard works with your existing tools. No rewrites needed โ just wrap and fuzz.
# ๐ฆ๐ LangChain
from langchain_core.tools import tool
from toolguard import test_chain
from toolguard.integrations.langchain import guard_langchain_tool
@tool
def search(query: str) -> str:
"""Search the web."""
return f"Results for {query}"
guarded = guard_langchain_tool(search)
report = test_chain([guarded], base_input={"query": "hello"})
# ๐ CrewAI
from crewai.tools import BaseTool
from toolguard.integrations.crewai import guard_crewai_tool
guarded = guard_crewai_tool(my_crew_tool)
# ๐ฆ LlamaIndex
from llama_index.core.tools import FunctionTool
from toolguard.integrations.llamaindex import guard_llamaindex_tool
llama_tool = FunctionTool.from_defaults(fn=my_function)
guarded = guard_llamaindex_tool(llama_tool)
# ๐ค Microsoft AutoGen
from autogen_core.tools import FunctionTool
from toolguard.integrations.autogen import guard_autogen_tool
autogen_tool = FunctionTool(my_function, name="my_tool", description="...")
guarded = guard_autogen_tool(autogen_tool)
# ๐ OpenAI Swarm
from swarm import Agent
from toolguard.integrations.swarm import guard_swarm_agent
agent = Agent(name="My Agent", functions=[func_a, func_b])
guarded_tools = guard_swarm_agent(agent) # Returns list of GuardedTools
# โก FastAPI
from toolguard.integrations.fastapi import as_fastapi_tool
guarded = as_fastapi_tool(my_endpoint_function)
# ๐ OpenAI Function Calling
from toolguard.integrations.openai_func import from_openai_function
openai_schema = {"type": "function", "function": {"name": "my_func", "parameters": {}}}
guarded = from_openai_function(openai_schema, my_python_backend_function)
All 7 integrations tested with real pip-installed libraries โ not mocks, not duck-types.
๐งน 100% Authentic Testing
ToolGuard's integration suite runs exclusively against the actual PyPI codebase implementations of LangChain, AutoGen, Swarm, FastAPI, and CrewAI. There is absolutely no faked compatibilityโit is mathematically proven against the live libraries. We deleted all fake "mock" tests to ensure the standard of reliability is pristine.
๐๏ธ CI/CD Integration
GitHub Action
Add to any repo โ auto-comments on PRs with reliability scores:
# .github/workflows/toolguard.yml
name: ToolGuard Reliability Check
on: [pull_request]
jobs:
reliability:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: Harshit-J004/toolguard@main
with:
script_path: src/agent.py
github_token: ${{ secrets.GITHUB_TOKEN }}
reliability_threshold: "0.95"
PR Comment Example:
๐จ ToolGuard Reliability Check (BLOCKED)
Chain:
my_agentReliability Score:64.3%(Threshold:95%)Warning: The PR introduces agent fragility. 3 tools will crash if the LLM hallucinates null.
JUnit XML (Jenkins / GitLab CI)
toolguard test --chain config.yaml --junit-xml results.xml
Generates standard <testsuites> XML that Jenkins, GitLab CI, and CircleCI parse natively.
Reliability Badges
toolguard badge
Generates shields.io badge markdown for your README:
๐ก Observability & Production Alerts
1. Zero-Latency Hallucination Alerts
Catch "LLM drift" in production. When an LLM hallucinates a bad JSON payload, ToolGuard instantly fires a background alert to your team without slowing down the agent:
import toolguard
toolguard.configure_alerts(
slack_webhook_url="https://hooks.slack.com/...",
discord_webhook_url="https://discord.com/api/webhooks/...",
datadog_api_key="my-api-key",
generic_webhook_url="https://my-dashboard.com/api/ingest"
)
Built with background thread pools so network requests never block the LLM runtime.
2. OpenTelemetry Tracing
Tracing works out of the box with Jaeger, Zipkin, Datadog, and more.
from toolguard.core.tracer import init_tracing, trace_tool
init_tracing(service_name="my-agent")
@trace_tool
def my_tool(data: dict) -> dict: ...
Architecture
toolguard/
โโโ core/
โ โโโ validator.py # @create_tool decorator + GuardedTool (sync + async)
โ โโโ chain.py # Chain testing engine (8 test types, async-aware)
โ โโโ schema.py # Auto Pydantic model generation
โ โโโ scoring.py # Reliability scoring + deploy gates
โ โโโ report.py # Failure analysis + suggestions
โ โโโ errors.py # Exception hierarchy + correlation IDs
โ โโโ retry.py # RetryPolicy + CircuitBreaker
โ โโโ tracer.py # OpenTelemetry integration
โ โโโ compatibility.py # Schema conflict detection
โโโ alerts/
โ โโโ manager.py # Abstract ThreadPool dispatcher
โ โโโ slack.py # Block Kit formatting
โ โโโ discord.py # Embed formatting
โ โโโ datadog.py # HTTP Metrics + Events sink
โโโ cli/
โ โโโ commands/ # run, test, check, observe, badge, init
โโโ reporters/
โ โโโ console.py # Rich terminal output
โ โโโ html.py # Standalone HTML reports
โ โโโ junit.py # JUnit XML for Jenkins/GitLab CI
โ โโโ github.py # GitHub PR auto-commenter
โโโ integrations/
โ โโโ langchain.py # LangChain adapter
โ โโโ crewai.py # CrewAI adapter
โ โโโ llamaindex.py # LlamaIndex adapter
โ โโโ autogen.py # Microsoft AutoGen adapter
โ โโโ swarm.py # OpenAI Swarm adapter
โ โโโ fastapi.py # FastAPI middleware
โ โโโ openai_func.py # OpenAI function calling export
โโโ tests/ # 50 tests (sync + async + integration)
โโโ integration_tests/ # Real-library integration tests
โโโ fuzz_targets/ # Integration fuzz scripts (LangChain, CrewAI, AutoGen, etc.)
โโโ examples/
โโโ test_alerts.py # Phase 4 webhook crash simulation
โโโ weather_chain/ # Working 3-tool example
โโโ demo_failing_chain/ # Intentionally buggy (aha moment)
Why ToolGuard?
| Without ToolGuard | With ToolGuard | |
|---|---|---|
| Failure detection | Stack trace at 3 AM | Caught before deploy |
| Root cause | "TypeError in line 47" | "Tool A returned null for 'price'" |
| Fix guidance | None | "Add default value OR validate response" |
| Confidence | "It works on my machine" | "92% reliability, LOW risk" |
| CI/CD | Manual testing | toolguard run in your pipeline |
| Cost | $0.10/test (LLM calls) | $0 (deterministic fuzzing) |
| Speed | 30s (API roundtrips) | <1s (local execution) |
Tech Stack
| Component | Technology | Why |
|---|---|---|
| Core Language | Python 3.11 - 3.13 | Agent ecosystem standard |
| Schema Validation | Pydantic v2 | 3.5ร faster than JSON Schema |
| Async | Native asyncio | Enterprise-grade concurrency |
| Testing | pytest (50 tests) | CI/CD native |
| Observability | OpenTelemetry | Vendor-neutral |
| CLI | Click + Rich | Beautiful terminal UX |
| CI/CD | GitHub Actions + JUnit | First-class pipeline support |
| Distribution | PyPI | pip install py-toolguard |
License
MIT โ use it, fork it, ship it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file py_toolguard-3.1.1.tar.gz.
File metadata
- Download URL: py_toolguard-3.1.1.tar.gz
- Upload date:
- Size: 108.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56309e03fac75039c1474d979336f8e0a7820814202ff0a4a0846a372abd822b
|
|
| MD5 |
e1aa9b409fdf3e90b77b5d44e8a7782b
|
|
| BLAKE2b-256 |
7217ee14a230b7eaac44d25f9cb9ffb67bf433e09f5f6bf51ffa13c01ff48794
|
Provenance
The following attestation bundles were made for py_toolguard-3.1.1.tar.gz:
Publisher:
publish.yml on Harshit-J004/toolguard
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
py_toolguard-3.1.1.tar.gz -
Subject digest:
56309e03fac75039c1474d979336f8e0a7820814202ff0a4a0846a372abd822b - Sigstore transparency entry: 1155542034
- Sigstore integration time:
-
Permalink:
Harshit-J004/toolguard@982868c8de638f7fe291fd1ae5832051fdada574 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Harshit-J004
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@982868c8de638f7fe291fd1ae5832051fdada574 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file py_toolguard-3.1.1-py3-none-any.whl.
File metadata
- Download URL: py_toolguard-3.1.1-py3-none-any.whl
- Upload date:
- Size: 82.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22a86e798a0dab1b539344b015e5fe8bee05832caaf47e0f5b6b3d1d2473b7e5
|
|
| MD5 |
c31b1fac9a3b4cd3e685614be87c6fc4
|
|
| BLAKE2b-256 |
24ebd6ce0ce0ccc9bdcf6c16d8b6f0e04603e8543a4a413b04ee951c61166b72
|
Provenance
The following attestation bundles were made for py_toolguard-3.1.1-py3-none-any.whl:
Publisher:
publish.yml on Harshit-J004/toolguard
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
py_toolguard-3.1.1-py3-none-any.whl -
Subject digest:
22a86e798a0dab1b539344b015e5fe8bee05832caaf47e0f5b6b3d1d2473b7e5 - Sigstore transparency entry: 1155542037
- Sigstore integration time:
-
Permalink:
Harshit-J004/toolguard@982868c8de638f7fe291fd1ae5832051fdada574 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Harshit-J004
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@982868c8de638f7fe291fd1ae5832051fdada574 -
Trigger Event:
workflow_dispatch
-
Statement type: