Diagnose why your LLM agent failed. Deterministic causal analysis with fix generation.
Project description
agent-failure-debugger
Diagnoses agent execution behavior — not just what failed, but why, and whether execution quality is healthy, degraded, or failed. Deterministic causal analysis with fix generation.
pip install agent-failure-debugger
from agent_failure_debugger import diagnose
result = diagnose(raw_log, adapter="langchain")
print(result["summary"]["execution_quality"]["status"]) # healthy / degraded / failed
print(result["explanation"]["context_summary"])
Use the Debugger
Call diagnose() after every agent run. It returns execution quality (healthy, degraded, or failed), root cause analysis when failures are detected, and fix proposals.
result = diagnose(raw_log, adapter="langchain")
status = result["summary"]["execution_quality"]["status"]
# In CI/CD or automated pipelines:
assert status != "failed", f"Agent execution failed: {result['summary']['root_cause']}"
When the agent runs normally, you get healthy with confidence scores and grounding state. When something goes wrong, you get the root cause, causal path, and a fix proposal — without changing how you call the tool.
Three ways to use it:
- Failure diagnosis — an agent broke, you need to know why.
diagnose()returns root cause, causal path, explanation, and a fix proposal. This is the core use case. - Health check — call
diagnose()after every run and checkexecution_quality.status. Healthy runs returnhealthy; degraded quality (weak grounding, redundant tool results, low alignment) is surfaced before it becomes a failure. Track degraded frequency over time to catch regressions early. - Run comparison — same prompt produces different results across runs.
compare_runs()measures stability;diff_runs()identifies what structurally separates successful runs from failed ones.
Atlas detects failures; the debugger explains why they happened and proposes fixes. You can use Atlas alone for detection, but diagnosis requires the debugger.
From a raw log (simplest)
from agent_failure_debugger import diagnose
# Example: LangChain agent trace (no tool data)
raw_log = {
"steps": [
{"type": "llm", "output": "The Q4 revenue was $4.2M, up 31% year-over-year."}
],
"tool_calls": [],
}
result = diagnose(raw_log, adapter="langchain")
print(result["summary"])
# → {'root_cause': '...', 'failure_count': ..., 'gate_mode': '...', ...}
print(result["explanation"]["context_summary"])
# → describes what happened and why
raw_log is a loosely structured dict — its format depends on the source. The adapter normalizes it into the telemetry format Atlas expects. The more structured and complete the log (especially tool calls and outputs), the more accurate the diagnosis. Minimal logs may result in incomplete or degraded analysis.
One function: adapt → detect (via Atlas) → diagnose → explain. Atlas is installed automatically as a dependency. Output quality depends entirely on the input log — incomplete telemetry will silently degrade detection and diagnosis.
Which adapter to use:
Adapters normalize raw logs from different sources into Atlas's telemetry format.
| Adapter | Use for |
|---|---|
langchain |
LangChain / LangGraph traces |
langsmith |
LangSmith run-tree exports |
crewai |
CrewAI crew execution logs |
redis_help_demo |
Redis workshop Help Center |
If unsure: use "langchain" for agent traces, "redis_help_demo" for the Redis workshop demo. For the JSON format each adapter expects, see Adapter Formats.
Note: crewai and redis_help_demo adapters do not yet produce state or grounding telemetry. Some failure patterns (e.g., agent_tool_call_loop) may not fire through these adapters. See the Atlas adapter verification status for details.
CLI:
# From a raw log (full pipeline)
python -m agent_failure_debugger.diagnose log.json --adapter langchain
# From matcher output (diagnosis only)
python -m agent_failure_debugger.main matcher_output.json
From matcher output (direct)
from agent_failure_debugger.pipeline import run_pipeline
result = run_pipeline(
matcher_output,
use_learning=True,
include_explanation=True,
)
print(result["summary"]["root_cause"])
print(result["explanation"]["interpretation"])
print(result["explanation"]["risk"]["level"])
Use this when you already have matcher output, or when building a custom adapter.
From a live agent (via Atlas watch)
Atlas's watch() wraps a LangGraph agent and runs the debugger pipeline on completion. It is a separate entry point from diagnose() — both produce the same pipeline output but from different starting points: watch() captures telemetry from a live execution, while diagnose() accepts a raw log after the fact.
If you use llm-failure-atlas for detection, watch() runs the debugger automatically:
from llm_failure_atlas.adapters.callback_handler import watch
graph = watch(workflow.compile(), auto_diagnose=True, auto_pipeline=True)
result = graph.invoke({"messages": [...]})
# → detection + debugger pipeline + explanation printed automatically
For a copy-paste example without an API key, see Reproducible Examples below.
Self-healing agent (LangGraph)
Add automatic failure detection and informed retry to any LangGraph agent. When the health check detects a retryable failure, it injects the diagnosis into the conversation — the LLM reads why it failed and adjusts its approach. This is not a blind retry.
from agent_failure_debugger import create_health_check
from langgraph.graph import StateGraph, MessagesState, START, END
health_check, route = create_health_check(max_retries=2)
workflow = StateGraph(MessagesState)
workflow.add_node("agent", agent_node)
workflow.add_node("tools", tool_node)
workflow.add_node("health_check", health_check)
workflow.add_edge(START, "agent")
workflow.add_conditional_edges("agent", should_continue,
{"tools": "tools", "check": "health_check"})
workflow.add_edge("tools", "agent")
workflow.add_conditional_edges("health_check", route,
{"retry": "agent", "end": END})
On retry, the health check appends a message like: "Previous attempt status: failed. The tool may have experienced a transient error that has since resolved. Please call the tool again." — the LLM reads this and retries the tool.
Not all failures benefit from retry. The integration classifies all 17 Atlas patterns as either retryable (transient errors, LLM non-determinism) or structural (bad prompts, config issues). Structural failures are reported immediately without wasting retries. See examples/self_healing/ for a working demo validated across GPT, Claude, and Gemini.
Quick Start
pip install agent-failure-debugger
Healthy run
from agent_failure_debugger import diagnose
raw_log = {
"inputs": {"query": "What was Q3 revenue?"},
"outputs": {"response": "Q3 revenue was $4.2M based on the latest earnings report."},
"steps": [
{"type": "tool", "name": "search_earnings", "inputs": {"quarter": "Q3"},
"outputs": {"revenue": "$4.2M", "source": "10-Q filing"}, "error": None},
{"type": "llm", "outputs": {"text": "Q3 revenue was $4.2M based on the latest earnings report."}}
]
}
result = diagnose(raw_log, adapter="langchain")
print(result["summary"]["execution_quality"]["status"]) # healthy
print(result["summary"]["failure_count"]) # 0
The tool returns a result on every run. When the agent is healthy, you get confirmation — not silence.
Degraded run
from agent_failure_debugger import diagnose
raw_log = {
"inputs": {"query": "Change my flight to tomorrow morning"},
"outputs": {"response": "I've found several hotels near the airport for you."},
"steps": [
{"type": "llm", "outputs": {"text": "Let me check available flights."}},
{"type": "tool", "name": "search_flights", "inputs": {"date": "2025-03-20"},
"outputs": {"flights": []}, "error": None},
{"type": "tool", "name": "search_flights", "inputs": {"date": "2025-03-20"},
"outputs": {"flights": []}, "error": None},
{"type": "tool", "name": "search_flights", "inputs": {"date": "2025-03-20"},
"outputs": {"flights": []}, "error": None},
{"type": "llm", "outputs": {"text": "I've found several hotels near the airport."}}
],
"feedback": {"user_correction": "I asked about flights, not hotels."}
}
result = diagnose(raw_log, adapter="langchain")
print(result["summary"]["root_cause"]) # incorrect_output
print(result["summary"]["execution_quality"]["status"]) # degraded
print(result["explanation"]["context_summary"])
# → "Root cause identified: the system produced output misaligned with
# user intent, requiring correction (confidence: 0.625)."
print(result["explanation"]["risk"]["level"]) # low
print(result["summary"]["fix_count"]) # 1
Same function, same interface. The difference is in the input, not in how you call the tool.
From matcher output (advanced)
If you already have matcher output (e.g., from a custom integration):
from agent_failure_debugger.pipeline import run_pipeline
result = run_pipeline(matcher_output, use_learning=True)
print(result["summary"])
See Quick Start Guide for more usage patterns including watch(), multi-run analysis, and direct telemetry.
Common Mistakes
| Problem | Cause | Fix |
|---|---|---|
| "0 failures detected" | Adapter got insufficient data | Provide complete trace with tool calls |
| Wrong results | Input format doesn't match adapter | See Adapter Formats |
| Pattern doesn't fire | Adapter doesn't produce required fields | Check Adapter Coverage |
⚠ No error is raised for wrong inputs. The system silently returns zero failures if the adapter cannot extract signals.
This Tool Cannot
- Verify factual correctness of agent responses
- Detect semantic mismatch (requires embeddings)
- Analyze multi-agent system coordination
See Limitations & FAQ for details.
API Details
Execution quality
Every diagnose() and run_pipeline() result includes execution quality assessment — this is what makes the tool useful on every run, not just when failures occur.
eq = result["summary"]["execution_quality"]
print(eq["status"]) # "healthy" | "degraded" | "failed"
print(eq["termination"]["mode"]) # "normal" | "silent_exit" | "error_exit" | "partial_exit" | "unknown"
print(eq["indicators"]) # list of degradation concerns (empty if healthy)
print(eq["summary"]) # one-line human-readable assessment
- healthy — no significant issues detected
- degraded — output may have been produced but quality indicators are weak (low alignment, weak grounding, redundant tool results, unmodeled failures)
- failed — execution did not produce usable output (silent exit or error)
Degradation indicators include: low alignment score (< 0.5), tools called but no usable data returned, high expansion ratio without uncertainty disclosure (> 3.0), low tool result diversity (< 0.5 across 2+ calls — tools returned identical results), low observation coverage, and unmodeled or conflicting failure signals.
Execution quality uses existing telemetry and diagnosis results. No new matcher patterns are added.
Multi-run analysis
When the same prompt produces different results across runs, you need to know whether the agent is unstable and what causes the divergence.
from agent_failure_debugger import compare_runs, diff_runs
# Step 1: Is the agent stable across runs?
stability = compare_runs(all_run_results)
print(stability["stability"]["root_cause_agreement"]) # 1.0 = fully stable
print(stability["interpretation"])
# Step 2: What separates success from failure?
diff = diff_runs(success_runs, failure_runs)
print(diff["hypothesis"])
print(diff["failure_set_diff"]["failure_only"]) # patterns only in failures
print(diff["causal_path_diff"]) # where paths diverge
compare_runs() measures stability — whether the same task produces consistent diagnoses across runs. diff_runs() identifies divergence — what structural differences separate successful runs from failed ones. Together they answer "is this agent reliable, and if not, why does it sometimes fail?"
For runnable examples with expected output, see examples/multi_run_stability (compare_runs → diff_runs workflow) and examples/termination_divergence (same root cause, different exit modes).
Enhanced explanation
expl = result["explanation"]
print(expl["context_summary"]) # what happened
print(expl["interpretation"]) # why it happened
print(expl["risk"]["level"]) # HIGH / MEDIUM / LOW
print(expl["recommendation"]) # what to do
print(expl["observation"]) # signal coverage info
When observation coverage is low (many signals were not observed), the risk level is automatically raised and the interpretation notes that the diagnosis may be incomplete.
CLI: python -m agent_failure_debugger.explain --enhanced debugger_output.json
Individual steps
from agent_failure_debugger.pipeline import run_diagnosis, run_fix
diag = run_diagnosis(matcher_output)
fix_result = run_fix(diag, use_learning=True, top_k=2)
External evaluation
def my_staging_test(bundle):
fixes = bundle["autofix"]["recommended_fixes"]
# apply fixes in your staging env
return {
"success": True,
"failure_count": 0,
"root": None,
"has_hard_regression": False,
"notes": "passed staging tests",
}
result = run_pipeline(
matcher_output,
auto_apply=True,
evaluation_runner=my_staging_test,
)
If evaluation_runner is not provided, the built-in counterfactual simulation is used. If the runner raises an exception, the pipeline falls back to staged_review deterministically.
For real-world interpretation examples — including before/after fix effects — see Applied Debugging Examples and Operational Playbook in the Atlas repository.
Input Format
A JSON array of failure results from the matcher. Each entry needs failure_id, diagnosed, and confidence:
[
{
"failure_id": "premature_model_commitment",
"diagnosed": true,
"confidence": 0.7,
"signals": {
"ambiguity_without_clarification": true,
"assumption_persistence_after_correction": true
}
}
]
The pipeline validates input at entry and rejects malformed data with clear error messages.
Output Format
{
"root_candidates": ["premature_model_commitment"],
"root_ranking": [{"id": "premature_model_commitment", "score": 0.85}],
"failures": [
{"id": "premature_model_commitment", "confidence": 0.7},
{"id": "semantic_cache_intent_bleeding", "confidence": 0.7,
"caused_by": ["premature_model_commitment"]}
],
"causal_paths": [
["premature_model_commitment", "semantic_cache_intent_bleeding", "rag_retrieval_drift"]
]
}
Auto-Apply Gate
| Score | Mode | Behavior |
|---|---|---|
| >= 0.85 | auto_apply |
Apply, evaluate, keep or rollback |
| 0.65-0.85 | staged_review |
Write to patches/, await human approval |
| < 0.65 | proposal_only |
Present fix proposal only |
Hard blockers (force proposal_only regardless of score):
safety != "high"review_required == truefix_type == "workflow_patch"- Execution plan has conflicts or failed validation
grounding_gap_not_acknowledgedsignal active
Fix Safety
Fixes are generated from predefined templates, not learned behavior. They are deterministic and reproducible, but not guaranteed to be correct — some fixes may introduce regressions in complex workflows.
Safety mechanisms: the confidence gate prevents low-evidence fixes from auto-apply, hard blockers prevent unsafe categories of changes, the evaluation runner validates fixes before acceptance, and rollback is triggered automatically if evaluation fails.
Always review or evaluate fixes before applying in production environments.
Automation Guidance
| Environment | Recommended mode | Notes |
|---|---|---|
| Development | auto_apply |
Iterate quickly, evaluate fixes automatically |
| Staging | staged_review |
Use evaluation_runner to validate before applying |
| Production | proposal_only |
Human approval required, avoid auto_apply |
The debugger is designed for assisted decision-making, not fully autonomous system modification.
Pipeline Steps
matcher_output.json
→ pipeline.py (orchestrator)
├ main.py causal resolution + root ranking
├ abstraction.py top-k path selection (optional)
├ decision_support.py priority scoring + action plan
├ autofix.py fix selection + patch generation
├ auto_apply.py confidence gate + reason_code
├ pipeline_post_apply.py evaluation runner or counterfactual
├ pipeline_summary.py summary + execution quality assessment
├ execution_quality.py healthy/degraded/failed classification
└ explainer.py explanation (context + risk + observation)
File Structure
| File | Role |
|---|---|
diagnose.py |
Single entry point: raw log → full diagnosis |
pipeline.py |
Pipeline orchestrator (from matcher output) |
pipeline_post_apply.py |
Post-apply evaluation (runner + counterfactual) |
pipeline_summary.py |
Summary generation |
main.py |
CLI entry point for diagnosis only (from matcher output) |
config.py |
Paths, weights, thresholds |
graph_loader.py |
Load failure_graph.yaml |
causal_resolver.py |
Normalize, find roots, build paths, rank |
formatter.py |
Path scoring + conflict resolution |
labels.py |
SIGNAL_MAP (34) + FAILURE_MAP (17) |
explainer.py |
Deterministic + optional LLM explanation |
explain.py |
CLI for explanation generation (--enhanced, --deterministic) |
decision_support.py |
Failure to action mapping |
autofix.py |
Fix selection + patch generation |
fix_templates.py |
17 fix definitions (14 domain + 3 meta) |
auto_apply.py |
Confidence gate + auto-apply |
execute_fix.py |
Dependency ordering + staged apply |
evaluate_fix.py |
Counterfactual simulation |
policy_loader.py |
Read-only learning store access |
reliability.py |
Cross-run stability and differential analysis |
execution_quality.py |
Single-run execution behavior assessment |
integrations/langgraph.py |
LangGraph self-healing health check node |
Examples
| Directory | Demonstrates |
|---|---|
examples/self_healing/ |
create_health_check(): LangGraph self-healing with informed retry across 3 models |
examples/termination_divergence/ |
diff_runs(): same root cause, different termination modes |
examples/multi_run_stability/ |
compare_runs() → diff_runs(): two-step stability and divergence workflow |
Graph Source
The canonical failure_graph.yaml is bundled in the llm-failure-atlas package. The debugger loads the graph automatically via the Atlas package.
from agent_failure_debugger.config import GRAPH_PATH
print(GRAPH_PATH) # shows which graph is loaded
Configuration
| Variable | Default | Description |
|---|---|---|
LLM_FAILURE_ATLAS_GRAPH_PATH |
Bundled in package | Override graph location |
LLM_FAILURE_ATLAS_PATTERNS_DIR |
Bundled in package | Override patterns directory |
LLM_FAILURE_ATLAS_LEARNING_DIR |
Bundled in package | Override learning store |
All scoring weights and gate thresholds are in config.py.
Design Principles
- Deterministic — same matcher output, same root cause, same fix, same gate decision
- Graph is for interpretation only — not used during detection
- Signal names are contracts — no redefinition allowed
- Learning is suggestion-only — structure is never auto-modified
- Fail fast on invalid input — pipeline validates at entry
- Enhanced explanations —
include_explanation=Trueadds context, interpretation, risk, and recommendation
Related Repositories
| Repository | Role |
|---|---|
| llm-failure-atlas | Failure patterns, causal graph, matcher, adapters |
| agent-pld-metrics | Behavioral stability framework (PLD) |
Reproducible Examples
Healthy run (copy-paste-run, no API key needed):
pip install agent-failure-debugger
from agent_failure_debugger import diagnose
raw_log = {
"inputs": {"query": "What was Q3 revenue?"},
"outputs": {"response": "Q3 revenue was $4.2M based on the latest earnings report."},
"steps": [
{"type": "tool", "name": "search_earnings", "inputs": {"quarter": "Q3"},
"outputs": {"revenue": "$4.2M", "source": "10-Q filing"}, "error": None},
{"type": "llm", "outputs": {"text": "Q3 revenue was $4.2M based on the latest earnings report."}}
]
}
result = diagnose(raw_log, adapter="langchain")
print(result["summary"]["execution_quality"]["status"]) # healthy
print(result["summary"]["failure_count"]) # 0
Degraded run (copy-paste-run):
raw_log = {
"inputs": {"query": "Change my flight to tomorrow morning"},
"outputs": {"response": "I've found several hotels near the airport for you."},
"steps": [
{"type": "llm", "outputs": {"text": "Let me check available flights."}},
{"type": "tool", "name": "search_flights", "inputs": {"date": "2025-03-20"},
"outputs": {"flights": []}, "error": None},
{"type": "tool", "name": "search_flights", "inputs": {"date": "2025-03-20"},
"outputs": {"flights": []}, "error": None},
{"type": "tool", "name": "search_flights", "inputs": {"date": "2025-03-20"},
"outputs": {"flights": []}, "error": None},
{"type": "llm", "outputs": {"text": "I've found several hotels near the airport."}}
],
"feedback": {"user_correction": "I asked about flights, not hotels."}
}
result = diagnose(raw_log, adapter="langchain")
print(result["summary"]["root_cause"])
print(result["summary"]["execution_quality"]["status"])
# → root cause + execution quality (degraded)
With a live agent (requires langchain-core and langgraph):
pip install agent-failure-debugger[langchain] langgraph
from langchain_core.language_models import FakeListLLM
from langchain_core.messages import HumanMessage, AIMessage
from langgraph.graph import StateGraph, MessagesState, START, END
from llm_failure_atlas.adapters.callback_handler import watch
llm = FakeListLLM(responses=[
"The revenue was $4.2M in Q3 2024, representing 31% year-over-year "
"growth. The Asia-Pacific segment contributed 45% of total revenue. "
"Operating margins expanded to 19.3% across all regions."
])
def agent(state: MessagesState):
return {"messages": [AIMessage(content=llm.invoke(state["messages"]))]}
workflow = StateGraph(MessagesState)
workflow.add_node("agent", agent)
workflow.add_edge(START, "agent")
workflow.add_edge("agent", END)
graph = watch(workflow.compile(), auto_diagnose=True)
graph.invoke({"messages": [HumanMessage(content="What was Q3 revenue?")]})
Note: watch() with FakeListLLM demonstrates the callback integration but may not trigger failure patterns — the fake LLM produces no tool calls or user corrections. For failure detection examples, use diagnose() with the raw log above.
Regression test examples:
12 examples in llm-failure-atlas under examples/ (10 agent + 2 non-LLM). Each contains log.json, matcher_output.json, and expected_debugger_output.json.
python -m agent_failure_debugger.main matcher_output.json
Multi-run analysis examples:
2 examples in this repository under examples/. Each contains input fixtures, a runnable script, and expected_output.json:
- termination_divergence —
diff_runs()comparing silent exit vs error exit - multi_run_stability —
compare_runs()→diff_runs()two-step workflow
Internals
Root ranking formula:
score = 0.5 * confidence + 0.3 * normalized_downstream + 0.2 * (1 - normalized_depth)
More downstream impact ranks higher, even with lower confidence. This reflects causal priority, not detection confidence alone.
This tool implements a single control step within the PLD loop: post-incident causal analysis and intervention decision.
License
MIT License. See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_failure_debugger-0.3.0.tar.gz.
File metadata
- Download URL: agent_failure_debugger-0.3.0.tar.gz
- Upload date:
- Size: 78.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
baa7a0a25e56c58511fecef4530cf3f39035c4f1b9ebf26ee3869fc03f014e95
|
|
| MD5 |
39148952687d634052fab8d4565ce8b7
|
|
| BLAKE2b-256 |
bbcf6835ee800316fe0f99b5666c5dced3b8b3d45c122fabb37da6fba2d47dc9
|
File details
Details for the file agent_failure_debugger-0.3.0-py3-none-any.whl.
File metadata
- Download URL: agent_failure_debugger-0.3.0-py3-none-any.whl
- Upload date:
- Size: 80.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df7ef46fad539773f1e52827dc53d86c11b9fae0df4a5b94184198fa9ce3de22
|
|
| MD5 |
b291764ec9ee6319ead2f31e440f26b5
|
|
| BLAKE2b-256 |
fb1d93f6f15dc862a6c319a91b1187f0e1e969abc3dc9c543134ded11f79dc39
|