Detect silent failures, drift, and stuck states in AI agents
Project description
FailGuard
Detect silent failures, drift, and stuck states in AI agents.
The Problem
AI agents fail silently. They don't crash - they just slowly degrade:
- Latency drift: Response times creep up until timeouts
- Stuck states: Same output repeated endlessly
- Cycles: A→B→A→B patterns that never progress
Traditional error handling doesn't catch these. Your agent looks "fine" while burning tokens and failing users.
The Solution
from failguard import failguard
@failguard(max_latency_drift=2.0, max_identical_outputs=3)
def agent_step(query: str) -> str:
return llm.complete(query)
# Raises FailGuardError if:
# - Latency exceeds 2x baseline
# - Same output repeated 3+ times
# - Cycle pattern detected (A→B→A→B)
Installation
pip install failguard
Features
- Zero dependencies - Only Python stdlib
- Latency drift detection - Catches gradual slowdowns
- Stuck detection - Identifies repeated identical outputs
- Cycle detection - Finds A→B→A→B patterns (complements LoopGuard)
- Thread-safe - Safe for concurrent use
- Flexible API - Decorator or inline Monitor class
Usage
Decorator API
from failguard import failguard, FailGuardError
@failguard(
max_latency_drift=3.0, # Alert if latency > 3x baseline
max_identical_outputs=5, # Alert after 5 identical outputs
stuck_window=60, # Within 60 seconds
detect_cycles=True, # Detect A→B→A patterns
)
def agent_step(query: str) -> str:
return llm.complete(query)
try:
result = agent_step("What is 2+2?")
except FailGuardError as e:
print(f"Failure detected: {e.failure_type}")
print(f"Metrics: {e.metrics}")
Custom Failure Handler
def my_handler(status):
logger.warning(f"Agent failing: {status.failure_types}")
return "fallback response"
@failguard(max_identical_outputs=3, on_failure=my_handler)
def agent_step(query: str) -> str:
return llm.complete(query)
# Returns "fallback response" instead of raising
Inline Monitor
from failguard import Monitor
monitor = Monitor(max_identical_outputs=3)
for step in workflow:
result = agent.run(step)
status = monitor.check(result, step_name=step)
if status.is_stuck:
print(f"Agent stuck: {status.identical_count} repeats")
break
if status.has_cycle:
print(f"Cycle detected: {status.cycle_pattern}")
break
if status.has_latency_drift:
print(f"Slowdown: {status.latency_drift_ratio}x baseline")
With LoopGuard (Full Reliability Suite)
from loopguard import loopguard
from failguard import failguard
@loopguard(max_repeats=5) # Catch A→A→A (same args)
@failguard(detect_cycles=True) # Catch A→B→A→B (different outputs)
def agent_action(query):
return llm.complete(query)
API Reference
@failguard(**options)
Decorator for detecting failures.
| Option | Default | Description |
|---|---|---|
max_latency_drift |
3.0 | Alert if latency > N × baseline |
max_identical_outputs |
5 | Alert after N identical outputs |
stuck_window |
60.0 | Time window (seconds) for stuck detection |
detect_cycles |
True | Detect repeating patterns |
cycle_min_length |
2 | Minimum cycle length |
cycle_max_length |
5 | Maximum cycle length |
on_failure |
None | Callback: (FailureStatus) -> Any |
raise_on_failure |
True | Raise FailGuardError on failure |
Attached methods:
func.reset()- Clear all statefunc.get_status()- Get current status
Monitor(**options)
Inline monitor with same options as decorator.
monitor = Monitor()
status = monitor.check(value, step_name="step1", latency_ms=150)
monitor.reset()
FailureStatus
Status object returned by checks.
| Field | Type | Description |
|---|---|---|
has_failure |
bool | Any failure detected |
failure_types |
list | List of FailureType values |
has_latency_drift |
bool | Latency exceeded threshold |
latency_drift_ratio |
float | Current/baseline ratio |
is_stuck |
bool | Identical outputs exceeded threshold |
identical_count |
int | Number of identical outputs |
has_cycle |
bool | Cycle pattern detected |
cycle_pattern |
list | The repeating pattern |
FailGuardError
Exception raised on failure.
try:
agent_step()
except FailGuardError as e:
e.failure_type # "stuck", "cycle", "latency_drift"
e.message # Human-readable description
e.metrics # Dict with relevant metrics
FailureType
Constants for failure types:
FailureType.LATENCY_DRIFTFailureType.STUCKFailureType.CYCLE
Part of the Guard Suite
FailGuard is part of a reliability suite for AI agents:
- LoopGuard - Prevent infinite loops
- EvalGuard - Validate outputs
- FailGuard - Detect silent failures (this package)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file failguard-0.1.0.tar.gz.
File metadata
- Download URL: failguard-0.1.0.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7466ee8877f83d543c64207e418ac720ef4e3519bbb576be6a0f4ec6613c5033
|
|
| MD5 |
12d1cd892403b5b1350cfeedd7ed72ba
|
|
| BLAKE2b-256 |
4656cbb3e3acbb29fc4a2e1f4d17750041449654e97824d2db940667aa56c331
|
File details
Details for the file failguard-0.1.0-py3-none-any.whl.
File metadata
- Download URL: failguard-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1116b8c5a442e9c140e0abedc48895fc58aba06f9fe9cfae6e97f734c85c836e
|
|
| MD5 |
9885bcd1401c0b59c11975df436714db
|
|
| BLAKE2b-256 |
fbf2088ea784c150d406f357984d0d3637263e4bf1c7d714d7cb18218d85b155
|