Skip to main content

Catch silent agent failures in CI. Behavior lint for LLM agents, powered by agent-failure-debugger.

Project description

pytest-agent-health

Catch silent agent failures in CI. Behavior lint for LLM agents.

pip install pytest-agent-health
def test_agent_answers_correctly(agent_health):
    raw_log = my_agent.run("What was Q3 revenue?")
    agent_health.check(raw_log, adapter="langchain")
    # → FAIL if execution failed or quality degraded
    # → WARN if minor concerns detected
    # → PASS if healthy

The Problem

Traditional tests can't catch silent agent failures:

def test_agent():
    response = agent.run("Book me a flight to LAX")
    assert response is not None  # PASSES — but agent just said "I can help!"

The agent produced a response, so assertions pass. But it never actually booked anything. pytest-agent-health detects this by analyzing the execution trace — tool usage, grounding, alignment, termination behavior — and fails the test when the agent silently broke.


How It Works

agent log → diagnose() → execution_quality → CI policy → FAIL / WARN / PASS
                (debugger)                     (plugin)

The plugin wraps agent-failure-debugger, which detects 17 failure patterns using deterministic causal analysis (no ML). The CI policy layer converts diagnosis results into test verdicts.


Usage

Single-run check

def test_agent_completes_task(agent_health):
    raw_log = my_agent.run("What was Q3 revenue?")
    agent_health.check(raw_log, adapter="langchain")

If the agent's execution is failed or degraded with risk indicators, the test fails with a diagnostic report:

❌ FAIL: execution_quality=degraded
  FAILED: incorrect_output [retryable]
    → In production, consider create_health_check() for automatic recovery
  WARN [risk]: response.alignment_score = 0.12
    → output alignment with user intent is weak

Multi-run stability

def test_agent_consistency(agent_health):
    logs = [my_agent.run("Q3 revenue?") for _ in range(3)]
    agent_health.compare(logs, adapter="langchain", min_agreement=1.0)
    # → FAIL if root_cause_agreement < 1.0

Regression detection

def test_no_regression(agent_health):
    success_logs = load_baseline_logs()
    current_logs = [my_agent.run(q) for q in test_queries]
    agent_health.diff(success_logs, current_logs, adapter="langchain")
    # → FAIL if new failure patterns appear

CI Policy

The plugin applies a CI-specific policy on top of execution quality:

Default

Condition Verdict Rationale
failed FAIL Task incomplete, error, or silent exit
degraded WARN Output produced but quality concerns detected
healthy PASS No issues

Strict (--strict)

Condition Verdict Rationale
failed FAIL Task incomplete, error, or silent exit
degraded + risk indicators FAIL Weak alignment, no tool data, hallucination signals
degraded + info indicators only WARN Diagnostic limitations, not agent failure
healthy PASS No issues

Indicator classification

Degradation indicators are classified as risk (agent quality problem) or info (diagnostic limitation):

Tag Indicators Meaning
[risk] alignment_score, tool_provided_data, tool_result_diversity, expansion_ratio Output may be wrong or unreliable
[info] observation_coverage, unmodeled_failure, conflicting_signals Diagnosis has limited confidence

Pattern override

Force specific failure patterns to always fail, regardless of execution quality:

pytest --agent-health --agent-health-fail-on=premature_termination,context_truncation_loss

CLI Options

pytest --agent-health                    # Enable the plugin
pytest --agent-health --agent-health-strict  # Strict: degraded+risk = FAIL
pytest --agent-health --agent-health-fail-on=premature_termination

Marker

Override policy per-test:

@pytest.mark.agent_health(strict=True)
def test_critical_flow(agent_health):
    agent_health.check(log, adapter="langchain")

@pytest.mark.agent_health(fail_on=["premature_termination"])
def test_critical_flow(agent_health):
    agent_health.check(log, adapter="langchain")

CI Output

The terminal summary shows all checks at the end of the test session:

============================= agent-health summary =============================
agent-health: 4 checks — 2 FAIL, 2 PASS

Failed checks (2):
  ❌ FAIL: execution_quality=degraded
    WARN [risk]: response.alignment_score = 0.0
      → output alignment with user intent is weak
  ❌ FAIL: execution_quality=degraded
    FAILED: incorrect_output [retryable]
      → In production, consider create_health_check() for automatic recovery
    WARN [risk]: response.alignment_score = 0.12
      → output alignment with user intent is weak

Failure patterns are tagged [retryable] or [structural]. Retryable failures can be automatically recovered in production using create_health_check(). Structural failures require prompt or configuration changes.


Non-determinism

LLM agents are non-deterministic. The same test may produce different results across runs. This is expected behavior, not a plugin bug.

Recommendations for stable CI:

  • Set temperature=0 in your agent configuration
  • Use model-specific seed parameters where available
  • Use agent_health.compare() to explicitly test for consistency
  • Accept that some flakiness is inherent to LLM-based systems

Related

Package Role
agent-failure-debugger Diagnosis engine (this plugin wraps it)
llm-failure-atlas Failure patterns, signals, matcher
create_health_check() Runtime self-healing (complementary to CI)

CI detects. Production heals. Use pytest-agent-health to catch failures in CI, and create_health_check() to automatically recover in production.


License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agent_health-0.1.0.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_agent_health-0.1.0-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file pytest_agent_health-0.1.0.tar.gz.

File metadata

  • Download URL: pytest_agent_health-0.1.0.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for pytest_agent_health-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c108b5df3d7fa73023df4f58fe4dd4a917820fa7566935aad549ff09563b8c8b
MD5 460aa81257e9bfaaa84ab3347cebf0ad
BLAKE2b-256 77d2a2315504f02b8eeb1cd6f0b5f3c83b5ffb72111f935bf4dcddafc85b191e

See more details on using hashes here.

File details

Details for the file pytest_agent_health-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_agent_health-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c05a3d4330a210f4605d26765183dca7ad2772e4a0bbe655db95f280e4eb3299
MD5 7062d6ceb0c3d6e7f4aacb7a414e3ee7
BLAKE2b-256 1b53e56afb84063876d16fdce64ffe571f74d267acdf74e39f2bb5e1e0ed446a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page