Skip to main content

Detect silent failures, drift, and stuck states in AI agents

Project description

FailGuard

Detect silent failures, drift, and stuck states in AI agents.

The Problem

AI agents fail silently. They don't crash - they just slowly degrade:

  • Latency drift: Response times creep up until timeouts
  • Stuck states: Same output repeated endlessly
  • Cycles: A→B→A→B patterns that never progress

Traditional error handling doesn't catch these. Your agent looks "fine" while burning tokens and failing users.

The Solution

from failguard import failguard

@failguard(max_latency_drift=2.0, max_identical_outputs=3)
def agent_step(query: str) -> str:
    return llm.complete(query)

# Raises FailGuardError if:
# - Latency exceeds 2x baseline
# - Same output repeated 3+ times
# - Cycle pattern detected (A→B→A→B)

Installation

pip install failguard

Features

  • Zero dependencies - Only Python stdlib
  • Latency drift detection - Catches gradual slowdowns
  • Stuck detection - Identifies repeated identical outputs
  • Cycle detection - Finds A→B→A→B patterns (complements LoopGuard)
  • Thread-safe - Safe for concurrent use
  • Flexible API - Decorator or inline Monitor class

Usage

Decorator API

from failguard import failguard, FailGuardError

@failguard(
    max_latency_drift=3.0,      # Alert if latency > 3x baseline
    max_identical_outputs=5,    # Alert after 5 identical outputs
    stuck_window=60,            # Within 60 seconds
    detect_cycles=True,         # Detect A→B→A patterns
)
def agent_step(query: str) -> str:
    return llm.complete(query)

try:
    result = agent_step("What is 2+2?")
except FailGuardError as e:
    print(f"Failure detected: {e.failure_type}")
    print(f"Metrics: {e.metrics}")

Custom Failure Handler

def my_handler(status):
    logger.warning(f"Agent failing: {status.failure_types}")
    return "fallback response"

@failguard(max_identical_outputs=3, on_failure=my_handler)
def agent_step(query: str) -> str:
    return llm.complete(query)

# Returns "fallback response" instead of raising

Inline Monitor

from failguard import Monitor

monitor = Monitor(max_identical_outputs=3)

for step in workflow:
    result = agent.run(step)
    status = monitor.check(result, step_name=step)

    if status.is_stuck:
        print(f"Agent stuck: {status.identical_count} repeats")
        break
    if status.has_cycle:
        print(f"Cycle detected: {status.cycle_pattern}")
        break
    if status.has_latency_drift:
        print(f"Slowdown: {status.latency_drift_ratio}x baseline")

With LoopGuard (Full Reliability Suite)

from loopguard import loopguard
from failguard import failguard

@loopguard(max_repeats=5)        # Catch A→A→A (same args)
@failguard(detect_cycles=True)   # Catch A→B→A→B (different outputs)
def agent_action(query):
    return llm.complete(query)

API Reference

@failguard(**options)

Decorator for detecting failures.

Option Default Description
max_latency_drift 3.0 Alert if latency > N × baseline
max_identical_outputs 5 Alert after N identical outputs
stuck_window 60.0 Time window (seconds) for stuck detection
detect_cycles True Detect repeating patterns
cycle_min_length 2 Minimum cycle length
cycle_max_length 5 Maximum cycle length
on_failure None Callback: (FailureStatus) -> Any
raise_on_failure True Raise FailGuardError on failure

Attached methods:

  • func.reset() - Clear all state
  • func.get_status() - Get current status

Monitor(**options)

Inline monitor with same options as decorator.

monitor = Monitor()
status = monitor.check(value, step_name="step1", latency_ms=150)
monitor.reset()

FailureStatus

Status object returned by checks.

Field Type Description
has_failure bool Any failure detected
failure_types list List of FailureType values
has_latency_drift bool Latency exceeded threshold
latency_drift_ratio float Current/baseline ratio
is_stuck bool Identical outputs exceeded threshold
identical_count int Number of identical outputs
has_cycle bool Cycle pattern detected
cycle_pattern list The repeating pattern

FailGuardError

Exception raised on failure.

try:
    agent_step()
except FailGuardError as e:
    e.failure_type   # "stuck", "cycle", "latency_drift"
    e.message        # Human-readable description
    e.metrics        # Dict with relevant metrics

FailureType

Constants for failure types:

  • FailureType.LATENCY_DRIFT
  • FailureType.STUCK
  • FailureType.CYCLE

Part of the Guard Suite

FailGuard is part of a reliability suite for AI agents:

  • LoopGuard - Prevent infinite loops
  • EvalGuard - Validate outputs
  • FailGuard - Detect silent failures (this package)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

failguard-0.1.0.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

failguard-0.1.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file failguard-0.1.0.tar.gz.

File metadata

  • Download URL: failguard-0.1.0.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for failguard-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7466ee8877f83d543c64207e418ac720ef4e3519bbb576be6a0f4ec6613c5033
MD5 12d1cd892403b5b1350cfeedd7ed72ba
BLAKE2b-256 4656cbb3e3acbb29fc4a2e1f4d17750041449654e97824d2db940667aa56c331

See more details on using hashes here.

File details

Details for the file failguard-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: failguard-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for failguard-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1116b8c5a442e9c140e0abedc48895fc58aba06f9fe9cfae6e97f734c85c836e
MD5 9885bcd1401c0b59c11975df436714db
BLAKE2b-256 fbf2088ea784c150d406f357984d0d3637263e4bf1c7d714d7cb18218d85b155

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page