Catch silent agent failures in CI. Behavior lint for LLM agents, powered by agent-failure-debugger.

These details have not been verified by PyPI

Project links

Project description

pytest-agent-health

Catch silent agent failures in CI. Behavior lint for LLM agents.

pip install pytest-agent-health

def test_agent_answers_correctly(agent_health):
    raw_log = my_agent.run("What was Q3 revenue?")
    agent_health.check(raw_log, adapter="langchain")
    # → FAIL if execution failed or regression detected
    # → WARN if quality concerns detected
    # → PASS if healthy

The Problem

Traditional tests can't catch silent agent failures:

def test_agent():
    response = agent.run("Book me a flight to LAX")
    assert response is not None  # PASSES — but agent just said "I can help!"

The agent produced a response, so assertions pass. But it never actually booked anything.

Worse: the agent worked yesterday but silently broke today. The same test passes both times because assert response is not None doesn't check behavior — only existence.

pytest-agent-health solves both problems: it detects silent failures by analyzing execution traces, and it catches regressions by comparing against previous CI runs.

How It Works

agent log → diagnose() → execution_quality → CI policy → FAIL / WARN / PASS
                (debugger)                     (plugin)
                                                  ↕
                                           baseline comparison
                                          (regression detection)

The plugin wraps agent-failure-debugger, which detects 17 failure patterns using deterministic causal analysis (no ML). The CI policy layer converts diagnosis results into test verdicts. Baseline comparison detects regressions across CI runs — a capability that single-run diagnosis cannot provide.

Usage

Single-run check

def test_agent_completes_task(agent_health):
    raw_log = my_agent.run("What was Q3 revenue?")
    agent_health.check(raw_log, adapter="langchain")

If the agent's execution is failed or degraded with risk indicators, the test fails with a diagnostic report:

❌ FAIL: execution_quality=degraded
  FAILED: incorrect_output [retryable]
    → In production, consider create_health_check() for automatic recovery
  WARN [risk]: response.alignment_score = 0.12
    → output alignment with user intent is weak

Regression detection (automatic)

When --agent-health-update-baseline is set, each check() saves a diagnosis snapshot. On subsequent runs, the plugin automatically compares against the baseline and fails if regressions are detected:

🔄 REGRESSION DETECTED (compared to baseline):
  New failures: incorrect_output
  Status change: healthy → degraded
  New indicators: response.alignment_score, grounding.tool_provided_data

Regressions are detected when:

New failure patterns appear that didn't exist in the baseline
Execution status degrades (healthy → degraded, degraded → failed)
New risk indicators appear

Workflow:

# First run: establish baselines
pytest --agent-health --agent-health-update-baseline

# Subsequent runs: detect regressions automatically
pytest --agent-health

# After fixing regressions: update baselines
pytest --agent-health --agent-health-update-baseline

Baselines are stored in .agent-health/ (one JSON per test). Commit this directory to git to share baselines across CI runs.

Multi-run stability

def test_agent_consistency(agent_health):
    logs = [my_agent.run("Q3 revenue?") for _ in range(3)]
    agent_health.compare(logs, adapter="langchain", min_agreement=1.0)
    # → FAIL if root_cause_agreement < 1.0

Differential diagnosis

def test_no_new_failures(agent_health):
    success_logs = load_baseline_logs()
    current_logs = [my_agent.run(q) for q in test_queries]
    agent_health.diff(success_logs, current_logs, adapter="langchain")
    # → FAIL if new failure patterns appear

CI Policy

The plugin applies a CI-specific policy on top of execution quality:

Default

Condition	Verdict	Rationale
Regression detected	FAIL	New failures or status degradation vs baseline
`failed`	FAIL	Task incomplete, error, or silent exit
`degraded`	WARN	Output produced but quality concerns detected
`healthy`	PASS	No issues

Strict (`--agent-health-strict`)

Condition	Verdict	Rationale
Regression detected	FAIL	New failures or status degradation vs baseline
`failed`	FAIL	Task incomplete, error, or silent exit
`degraded` + risk indicators	FAIL	Weak alignment, no tool data, hallucination signals
`degraded` + info indicators only	WARN	Diagnostic limitations, not agent failure
`healthy`	PASS	No issues

Indicator classification

Degradation indicators are classified as risk (agent quality problem) or info (diagnostic limitation):

Tag	Indicators	Meaning
`[risk]`	alignment_score, tool_provided_data, tool_result_diversity, expansion_ratio	Output may be wrong or unreliable
`[info]`	observation_coverage, unmodeled_failure, conflicting_signals	Diagnosis has limited confidence

Pattern override

Force specific failure patterns to always fail, regardless of execution quality:

pytest --agent-health --agent-health-fail-on=premature_termination,context_truncation_loss

CLI Options

pytest --agent-health                          # Enable the plugin
pytest --agent-health --agent-health-strict    # Strict: degraded+risk = FAIL
pytest --agent-health --agent-health-fail-on=premature_termination
pytest --agent-health --agent-health-update-baseline   # Save current as baseline
pytest --agent-health --agent-health-no-baseline       # Skip regression detection
pytest --agent-health --agent-health-baseline-dir=path # Custom baseline directory

Marker

Override policy per-test:

@pytest.mark.agent_health(strict=True)
def test_critical_flow(agent_health):
    agent_health.check(log, adapter="langchain")

@pytest.mark.agent_health(fail_on=["premature_termination"])
def test_must_not_terminate_early(agent_health):
    agent_health.check(log, adapter="langchain")

CI Output

The terminal summary shows all checks at the end of the test session:

============================= agent-health summary =============================
agent-health: 4 checks — 2 FAIL, 2 PASS

Failed checks (2):
  ❌ FAIL: execution_quality=degraded
    WARN [risk]: response.alignment_score = 0.0
      → output alignment with user intent is weak
  ❌ FAIL: execution_quality=degraded
    FAILED: incorrect_output [retryable]
      → In production, consider create_health_check() for automatic recovery
    WARN [risk]: response.alignment_score = 0.12
      → output alignment with user intent is weak

Failure patterns are tagged [retryable] or [structural]. Retryable failures can be automatically recovered in production using create_health_check(). Structural failures require prompt or configuration changes.

Non-determinism

LLM agents are non-deterministic. The same test may produce different results across runs. This is expected behavior, not a plugin bug.

Recommendations for stable CI:

Set temperature=0 in your agent configuration
Use model-specific seed parameters where available
Use agent_health.compare() to explicitly test for consistency
Accept that some flakiness is inherent to LLM-based systems

Package	Role
agent-failure-debugger	Diagnosis engine (this plugin wraps it)
llm-failure-atlas	Failure patterns, signals, matcher
create_health_check()	Runtime self-healing (complementary to CI)

CI detects. Production heals. Use pytest-agent-health to catch failures in CI, and create_health_check() to automatically recover in production.

License

MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 3, 2026

0.1.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agent_health-0.2.0.tar.gz (19.1 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytest_agent_health-0.2.0-py3-none-any.whl (16.4 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file pytest_agent_health-0.2.0.tar.gz.

File metadata

Download URL: pytest_agent_health-0.2.0.tar.gz
Upload date: Apr 3, 2026
Size: 19.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for pytest_agent_health-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`1d0f4f22c56b0329a67779c267d8673ad37dfbdb9f9509832ad1d8e2944988ba`
MD5	`82067f6f14e4efe37fa5be6f47625908`
BLAKE2b-256	`6c670d2f43b5309bf10e1c0a5606aa0d25dd17f1c35daafe013238762c968a33`

See more details on using hashes here.

File details

Details for the file pytest_agent_health-0.2.0-py3-none-any.whl.

File metadata

Download URL: pytest_agent_health-0.2.0-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 16.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for pytest_agent_health-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`644c37e03acf7768003440696c1fef48bc8353e02af3e1af540c20ad49bab2b6`
MD5	`31594054f6332fd29a8927c5b6ce8cba`
BLAKE2b-256	`9a8524fab5edaa5ad908ddb6bb22e0002c29c5642d1cd59b078a27634715d28d`

See more details on using hashes here.

pytest-agent-health 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pytest-agent-health

The Problem

How It Works

Usage

Single-run check

Regression detection (automatic)

Multi-run stability

Differential diagnosis

CI Policy

Default

Strict (--agent-health-strict)

Indicator classification

Pattern override

CLI Options

Marker

CI Output

Non-determinism

Related

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Strict (`--agent-health-strict`)