Skip to main content

Open-source reliability testing for AI agent tool chains. Catch cascading failures before production.

Project description

๐Ÿ›ก๏ธ ToolGuard

Reliability testing for AI agent tool chains.

Catch cascading failures before production. Make agent tool calling as predictable as unit tests made software reliable.

Python License Tests


๐Ÿง  What ToolGuard Actually Solves

Right now, developers don't deploy AI agents because they are fundamentally unstable. They crash.

There are two layers to AI:

  1. Layer 1: Intelligence (evals, reasoning, accurate answers)
  2. Layer 2: Execution (tool calls, chaining, JSON payloads, APIs)

ToolGuard does not test Layer 1. We do not care if your AI is "smart" or makes good decisions. That is what eval frameworks are for.

ToolGuard mathematically proves Layer 2. We solve the problem of agents crashing at 3 AM because the LLM hallucinated a JSON key, passed a string instead of an int, or an external API timed out.

"We don't make AI smarter. We make AI systems not break."

The Solution

Test your agent's tools against edge-cases before you deploy them. ToolGuard acts like unit tests for AI execution.

from toolguard import create_tool, test_chain, score_chain

@create_tool(schema="auto")
def parse_csv(raw_csv: str) -> dict:
    lines = raw_csv.strip().split("\n")
    headers = lines[0].split(",")
    records = [dict(zip(headers, line.split(","))) for line in lines[1:]]
    return {"headers": headers, "records": records, "row_count": len(records)}

@create_tool(schema="auto")
def compute_statistics(headers: list, records: list, row_count: int) -> dict:
    # Real computation โ€” mean, median, std dev
    ...

@create_tool(schema="auto")
def generate_report(total_rows: int, stats: dict) -> dict:
    # Real report generation
    ...

# One line. Full visibility.
report = test_chain(
    [parse_csv, compute_statistics, generate_report],
    base_input={"raw_csv": "name,age,salary\nAlice,30,75000\nBob,35,92000"},
    test_cases=["happy_path", "null_handling", "malformed_data"],
)

score = score_chain(report)
print(score.summary())

Real Output (not mocked):

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘  Reliability Score: parse_csv โ†’ compute_statistics โ†’ generate_report
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  Score:       50.0%                                               โ•‘
โ•‘  Risk Level: ๐ŸŸ  HIGH                                              โ•‘
โ•‘  Deploy:     ๐Ÿšซ BLOCK                                             โ•‘
โ•‘  Confidence:  45.1%                                               โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  โš ๏ธ  Top Risk: Schema validation failures                         โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  Failure Distribution:                                            โ•‘
โ•‘    schema_violation   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘   4 (67%)              โ•‘
โ•‘    type_mismatch      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘   2 (33%)              โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  โš ๏ธ  Bottleneck Tools:                                            โ•‘
โ•‘    โ†’ parse_csv       (50% success)                                โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ’ก Suggestion:
Agent hallucinated payload. Schema mismatch:
  - Field 'age': Input should be a valid integer (Got: 'thirty' | Type: str)
  - Field 'salary': Field required (Got: <unknown> | Type: None)

---

## Quick Start

```bash
pip install toolguard
from toolguard import create_tool, test_chain

@create_tool(schema="auto")
def my_tool(query: str) -> dict:
    return {"result": query.upper()}

report = test_chain(
    [my_tool],
    base_input={"query": "hello"},
    test_cases=["happy_path", "null_handling", "malformed_data"],
    assert_reliability=0.80,
)

Or scaffold a full project:

toolguard init --name my_agent

Time to value: < 3 minutes.


Features

๐Ÿ” Schema Validation

Automatic Pydantic input/output validation from type hints. No manual schemas needed.

@create_tool(schema="auto")
def fetch_price(ticker: str) -> dict:
    ...

๐Ÿ”— Chain Testing

Test multi-tool chains against 8 edge-case categories: null handling, type mismatches, missing fields, malformed data, large payloads, and more.

report = test_chain(
    [fetch_price, calculate_position, generate_alert],
    base_input={"ticker": "AAPL"},
    test_cases=["happy_path", "null_handling", "type_mismatch"],
)

โšก Async Support

Works with both def and async def tools transparently. No special flags needed.

@create_tool(schema="auto")
async def fetch_from_api(url: str) -> dict:
    async with httpx.AsyncClient() as client:
        resp = await client.get(url)
        return resp.json()

# Same API โ€” ToolGuard handles the async automatically
report = test_chain([fetch_from_api, process_data], assert_reliability=0.95)

๐Ÿ“Š Reliability Scoring

Quantified trust with risk levels and deployment gates.

score = score_chain(report)
if score.deploy_recommendation.value == "BLOCK":
    sys.exit(1)  # CI/CD gate

๐Ÿ”„ Retry & Circuit Breaker

Production-grade resilience patterns built-in.

from toolguard import with_retry, RetryPolicy, CircuitBreaker, with_circuit_breaker

@with_retry(RetryPolicy(max_retries=3, backoff_base=0.5))
def call_api(data: dict) -> dict: ...

breaker = CircuitBreaker(failure_threshold=5, reset_timeout=60)

@with_circuit_breaker(breaker)
def call_flaky_service(data: dict) -> dict: ...

๐Ÿ–ฅ๏ธ CLI

toolguard test --chain my_chain.yaml           # Run chain tests
toolguard test --chain my_chain.yaml --html report.html  # HTML report
toolguard check --tools my_tools.py            # Check compatibility
toolguard observe --tools my_tools.py          # View tool stats
toolguard init --name my_project               # Scaffold project

๐Ÿ”Œ Native Framework Integrations

If you are already using LangChain or CrewAI, you do not need to rewrite your tools to use ToolGuard.

ToolGuard provides native adapters that instantly convert your existing framework tools into GuardedTools so you can stress-test them immediately.

# ๐Ÿฆœ๐Ÿ”— LangChain
from toolguard.integrations.langchain import guard_langchain_tool
from my_app import my_langchain_tool

guarded_tool = guard_langchain_tool(my_langchain_tool)
report = test_chain([guarded_tool], ...)

# โš™๏ธ CrewAI
from toolguard.integrations.crewai import guard_crewai_tool
from my_app import my_crew_tool

guarded_tool = guard_crewai_tool(my_crew_tool)
report = test_chain([guarded_tool], ...)

# ๐Ÿค– OpenAI Function Calling
from toolguard.integrations.openai_func import to_openai_function
from my_app import my_python_tool

# Instantly export any ToolGuard tool to the strict OpenAI JSON schema format
openai_schema = to_openai_function(my_python_tool)

๐Ÿ“ก Observability

OpenTelemetry tracing out of the box โ€” works with Jaeger, Zipkin, Datadog, and more.

from toolguard.core.tracer import init_tracing, trace_tool

init_tracing(service_name="my-agent")

@trace_tool
def my_tool(data: dict) -> dict: ...

Architecture

toolguard/
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ validator.py      # @create_tool decorator + GuardedTool (sync + async)
โ”‚   โ”œโ”€โ”€ chain.py          # Chain testing engine (8 test types, async-aware)
โ”‚   โ”œโ”€โ”€ schema.py         # Auto Pydantic model generation
โ”‚   โ”œโ”€โ”€ scoring.py        # Reliability scoring + deploy gates
โ”‚   โ”œโ”€โ”€ report.py         # Failure analysis + suggestions
โ”‚   โ”œโ”€โ”€ errors.py         # Exception hierarchy + correlation IDs
โ”‚   โ”œโ”€โ”€ retry.py          # RetryPolicy + CircuitBreaker
โ”‚   โ”œโ”€โ”€ tracer.py         # OpenTelemetry integration
โ”‚   โ””โ”€โ”€ compatibility.py  # Schema conflict detection
โ”œโ”€โ”€ cli/
โ”‚   โ””โ”€โ”€ commands/         # init, test, check, observe
โ”œโ”€โ”€ reporters/
โ”‚   โ”œโ”€โ”€ console.py        # Rich terminal output
โ”‚   โ””โ”€โ”€ html.py           # Standalone HTML reports
โ”œโ”€โ”€ integrations/
โ”‚   โ”œโ”€โ”€ langchain.py      # LangChain adapter
โ”‚   โ”œโ”€โ”€ crewai.py         # CrewAI adapter
โ”‚   โ””โ”€โ”€ openai_func.py    # OpenAI function calling
โ”œโ”€โ”€ tests/                # 43 tests (sync + async + storage)
โ””โ”€โ”€ examples/
    โ”œโ”€โ”€ weather_chain/              # Working 3-tool example
    โ”œโ”€โ”€ demo_failing_chain/         # Intentionally buggy (aha moment)
    โ””โ”€โ”€ real_world_validation/      # Real CSV pipeline validation

Why ToolGuard?

Without ToolGuard With ToolGuard
Failure detection Stack trace at 3 AM Caught before deploy
Root cause "TypeError in line 47" "Tool A returned null for 'price'"
Fix guidance None "Add default value OR validate response"
Confidence "It works on my machine" "92% reliability, LOW risk"
CI/CD Manual testing toolguard test in your pipeline

Tech Stack

Component Technology Why
Core Language Python 3.11 - 3.13 Agent ecosystem standard
Schema Validation Pydantic v2 3.5ร— faster than JSON Schema
Async Native asyncio Enterprise-grade concurrency
Testing pytest (43 tests) CI/CD native
Observability OpenTelemetry Vendor-neutral
CLI Click + Rich Beautiful terminal UX
Distribution PyPI pip install toolguard

License

MIT โ€” use it, fork it, ship it.


Built to make AI agents actually work in production.

GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_toolguard-0.1.0.tar.gz (69.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_toolguard-0.1.0-py3-none-any.whl (54.6 kB view details)

Uploaded Python 3

File details

Details for the file py_toolguard-0.1.0.tar.gz.

File metadata

  • Download URL: py_toolguard-0.1.0.tar.gz
  • Upload date:
  • Size: 69.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for py_toolguard-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c67f11a5dfff90b1aeb67a025773dac0f286e5055bc8659bc82231abfa9dcea8
MD5 d1f18c2943514a1bec035c6f1e682a20
BLAKE2b-256 514db45baad5f2dce9017373ea246ee2f8339fb07b4562386bf8e015d02ed25c

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_toolguard-0.1.0.tar.gz:

Publisher: publish.yml on Harshit-J004/toolguard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file py_toolguard-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: py_toolguard-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 54.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for py_toolguard-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 732d7c091eea1bfe4ac4ac734a5ad554773344d9daa865eb8635b8df9ff56059
MD5 a0c4c10baed3d9fe91777ebdff41e664
BLAKE2b-256 62f4f3bac5fdfca1fe70939aeb549bde89e50e2f961af8579c2d90a7ea7e730c

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_toolguard-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Harshit-J004/toolguard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page