Skip to main content

Local-first failure diagnosis for AI browser automation, Playwright, crawler, and RPA runs.

Project description

Agent Failure Doctor

中文文档

CI License: MIT Python 3.10+

Local-first failure diagnosis lifecycle tool for AI browser automation, Playwright, crawler, RPA, and business automation failures.

  • Current milestone: Agent Failure Doctor v3.2 Auto Collector P98 Gate
  • Previous stable line: Agent Failure Doctor v3.1.0 P98 Master Gate.
  • Previous P95 stable line: Agent Failure Doctor v2.4.1 P95 Alignment & Missing Tracks Pack.

Input: trace.zip / error.log / console.txt / network.json / screenshot metadata / user_description.txt

Output: diagnosis, evidence, next action, repair suggestions, GitHub issue draft, Codex fix prompt.

Quickstart

git clone https://github.com/tobybgy-lsd/web-agent-runtime-bench.git
cd web-agent-runtime-bench
python -m pip install -e .
failure-doctor diagnose .\examples\failed_runs\proxy_network_error --out .\report
failure-doctor plan .\report --out .\fix_plan
failure-doctor collect --project . --preset auto --out .\failure_doctor_auto_report `
  --auto-diagnose --auto-handoff --auto-sanitize
failure-doctor agent-bootstrap --target all --project .

See validation/dashboard.md, docs/P98_LIMITS.md, docs/AGENT_FRONTEND_INVOCATION.md, and docs/safety_boundary.md.

P98 master gate passed with the auto collector pillar included.

Advanced commands include failure-doctor handoff, failure-doctor agent-bootstrap, failure-doctor propose-patch, and failure-doctor batch.

Core commands: collect / diagnose / plan / verify / run / watch / sanitize / adapt / handoff / agent-bootstrap / propose-patch / batch

Classic lifecycle: diagnose / plan / verify / run / sanitize / adapt -> diagnose -> plan -> AI handoff / patch proposal -> verify -> sanitize/share

P98 gate: knowledge base -> coverage matrix -> trace/cross-framework/training/composite/handoff/batch/sanitize/auto-collector -> master gate

Distribution & Feedback

v3.2.0 is the current stable technical baseline. The next phase is distribution and real user feedback, not more synthetic feature expansion.

After PyPI publication, the target install command is:

pip install agent-failure-doctor

For non-technical Windows users, double-click scripts/windows/Start-FailureDoctor-Diagnosis.bat or drag a failed project folder onto it.

Advanced v3.2 commands include failure-doctor collect and failure-doctor watch.

Agent frontend invocation:

failure-doctor agent-bootstrap --target all --project .

This writes .failure-doctor/AGENT_ENTRYPOINT.md plus Codex, Cursor, Claude Code, VS Code/Copilot, Antigravity, OpenCode, Qoder, Trae, WorkBuddy, OpenClaw, Hermes, and generic agent workflow instructions.

Agent Failure Doctor uses a deterministic evidence-based diagnostic engine. It does not claim to solve arbitrary failures, but it provides explainable classification, evidence, fix plans, and before/after verification for known automation failure patterns.

Applied scenario demos are local-only mock workflows for commerce automation, live monitoring, content publishing, GUI data bridge, and ERP sync failure diagnosis.

Spiderbuf-inspired challenge demos are local-only mock failure packs inspired by public crawler-training challenge categories; they validate diagnosis and safe next actions without accessing spiderbuf.cn or publishing private solution logic.

Integration commands: failure-doctor collect-playwright / failure-doctor pack-logs / failure-doctor adapt

What You Get

report/
|-- diagnosis.json
|-- diagnosis.md
|-- evidence.json
|-- input_summary.json
|-- issue_draft.md
|-- repair_suggestions.md
|-- codex_fix_prompt.md
`-- failure_doctor_report.zip

Agent Failure Doctor turns sanitized automation failure materials into a report that explains what likely failed, what evidence supports the diagnosis, what evidence is missing, and what to ask Codex or another coding assistant to change next.

One-Minute Start

Auto Capture:

failure-doctor run -- python crawler.py
failure-doctor run -- pytest tests/test_listing.py
failure-doctor run -- playwright test

This writes a local run folder under .failure-doctor/runs/<run_id>/:

.failure-doctor/runs/<run_id>/
|-- command.txt
|-- exit_code.txt
|-- stdout.log
|-- stderr.log
|-- environment.json
|-- detected_artifacts.json
|-- input_summary.json
|-- diagnosis/
|-- fix_plan/
|-- verification_hint.md
`-- shareable_failure_pack.zip

The generated safe_to_share.json defaults to safe_to_share=false; review and sanitize before sending a pack to anyone else.

Sanitize & Share Pack:

Sanitize a failed run before sharing it:

failure-doctor sanitize .\.failure-doctor\runs\<run_id> --out .\shareable_failure_pack

This writes redacted logs, redacted network summaries, trace metadata only, a redaction report, a review gate, and shareable_failure_pack.zip.

Raw trace.zip archives are not copied into the sanitized pack.

Put a failed run in a folder:

my_failed_run/
|-- error.log
|-- console.txt
|-- network.json
|-- README.txt
`-- screenshot.png

Then run:

failure-doctor diagnose .\my_failed_run --out .\report

The tool inventories inputs and uses this evidence priority:

trace.zip > log > network.json > user description > screenshot metadata

When evidence is too thin, it should downgrade to insufficient_evidence instead of guessing.

Minimal Demos

Proxy/network failure:

failure-doctor diagnose .\examples\failed_runs\proxy_failed --out .\report_proxy

Strict mode locator conflict:

failure-doctor diagnose .\examples\failed_runs\strict_mode_locator --out .\report_locator

Low-evidence screenshot-only run:

failure-doctor diagnose .\examples\failed_runs\low_evidence_screenshot_only --out .\report_low_evidence

Native Playwright trace fixture:

trace-doctor diagnose .\examples\realistic_playwright_traces\02_login_redirect_302\trace.zip --out .\report_login_trace

Before / After Report

Report structure: conclusion / evidence / why / next action / Codex fix prompt

Before:

page.goto: net::ERR_PROXY_CONNECTION_FAILED while opening https://example.test

After:

Conclusion: network/proxy setup failed before the page loaded.
Evidence: Playwright reported net::ERR_PROXY_CONNECTION_FAILED.
Next action: check proxy settings, DNS, VPN, and CI network configuration.
Codex fix prompt: add trace/log capture and make proxy configuration explicit.

Verify a Fix

failure-doctor diagnose .\failed_run --out .\report
failure-doctor plan .\report --out .\fix_plan
failure-doctor verify --before .\failed_run --after .\rerun_after_fix --out .\verification_report

verify compares before/after evidence and reports whether the original failure is resolved, unchanged, changed into another failure, or insufficiently evidenced.

AI Handoff & Patch Proposal

Turn a report into task packs that Codex, Claude Code, or Cursor can execute:

failure-doctor handoff .\report --target codex --out .\ai_handoff
failure-doctor handoff .\report --target claude_code --out .\ai_handoff
failure-doctor handoff .\report --target cursor --out .\ai_handoff

This writes:

ai_handoff/
|-- ai_handoff.json
|-- ai_handoff.md
|-- codex_task.md
|-- claude_code_task.md
|-- cursor_task.md
|-- affected_files.json
|-- validation_commands.md
|-- forbidden_actions.md
|-- token_budget_report.json
`-- ai_handoff_pack.zip

Generate a dry-run patch proposal without modifying source code:

failure-doctor propose-patch --repo . --report .\report --out .\patch_plan

This writes:

patch_plan/
|-- patch_proposal.md
|-- proposed_changes.json
|-- affected_files.json
|-- validation_commands.md
`-- patch_risk_assessment.json

propose-patch is intentionally proposal-only. It does not edit files, apply patches, run tests, or open pull requests.

v2.5 validation writes validation/ai_handoff_validation.json:

20/20 Codex task files generated
20/20 Claude Code task files generated
20/20 Cursor task files generated
18/20 patch proposals generated
20/20 required sections present
20/20 concise token budget checks pass
0 forbidden outputs

Batch Diagnosis / Fleet Mode

Diagnose many failed runs and get a fleet-level summary:

failure-doctor batch .\runs --out .\batch_report

Input:

runs/
|-- run_001/
|-- run_002/
|-- run_003/
`-- ...

Output:

batch_report/
|-- summary.json
|-- summary.md
|-- failures_by_type.csv
|-- top_root_causes.md
|-- repeated_failures.md
|-- suggested_regression_cases.md
|-- repair_priority.md
`-- reports/

Fleet mode answers which failures repeat, which root causes dominate, which runs should become regression cases, and which fixes deserve priority.

P98 Controlled Maturity

v3.0 starts the P98 controlled maturity track. This is not an ecosystem score; it does not count stars, external PRs, external issues, PyPI downloads, or long-term community adoption.

Current P98 assets:

Knowledge-base commands:

python -m tools.knowledge_base.validate_patterns
python -m tools.knowledge_base.search_patterns --query selector_drift
python -m tools.validation.run_crawler_failure_coverage_matrix

Applied Scenario Demos

Local-only mock demos show how Agent Failure Doctor can diagnose failures in:

  • hot product collection
  • live commerce monitoring
  • ecommerce listing automation
  • authorized content publishing workflow
  • GUI / RPA data bridge
  • ERP-to-ecommerce sync

Run:

python -m tools.validation.run_applied_scenario_validation

Spiderbuf-Inspired Challenge Demos

examples/spiderbuf_inspired_challenges/ contains local-only mock failure packs inspired by public crawler-training challenge categories:

  • cookie/session required
  • iframe extraction
  • Ajax dynamic loading
  • random CSS selector drift
  • infinite scroll missing items
  • rate limit 429
  • API signature required
  • browser fingerprint risk
  • Selenium detection risk
  • challenge page detected

These cases are diagnosis-only. They do not access spiderbuf.cn, do not include private solutions, and do not include access-control defeat steps.

python -m tools.validation.run_spiderbuf_inspired_validation

Integrations

Collect Playwright test-results into a failure pack:

failure-doctor collect-playwright .\examples\mock_playwright_test_results --out .\tmp_failure_pack
failure-doctor diagnose .\tmp_failure_pack --out .\tmp_collected_report

Normalize a loose log folder:

failure-doctor pack-logs .\examples\mock_raw_logs --out .\tmp_log_pack
failure-doctor diagnose .\tmp_log_pack --out .\tmp_log_report

Normalize a Selenium, Puppeteer, Cypress, Scrapy, requests, or httpx failure log:

failure-doctor adapt .\examples\cross_framework_fixtures\selenium\no_such_element\raw --framework selenium --out .\tmp_selenium_pack
failure-doctor diagnose .\tmp_selenium_pack --out .\tmp_selenium_report
failure-doctor plan .\tmp_selenium_report --out .\tmp_selenium_fix_plan

Supported adapter frameworks:

selenium | puppeteer | cypress | scrapy | requests | httpx | auto

Playwright remains the deepest native trace backend. Cross-framework adapters normalize local logs and metadata into the same failure lifecycle; they do not run those frameworks or connect to external platforms.

See docs/INTEGRATIONS.md and docs/GITHUB_ACTION_USAGE.md.

Validation Status

Current milestone: Agent Failure Doctor v3.2 Auto Collector P98 Gate.

Previous stable line: Agent Failure Doctor v2.4.1 P95 Alignment & Missing Tracks Pack.

  • 131 source-ledger records with separated real_public_issue, official_doc_pattern, and public_inspired_sanitized labels
  • 50 traceable real public issue records
  • 100 Playwright Trace Doctor P95 fixtures
  • 100/100 Playwright trace reasonable classifications
  • 100/100 Playwright trace exact subtype matches
  • 62 external public reference seeds
  • 20 external public reference held-out records
  • 20/20 external public reference reasonable classifications
  • 20/20 external public reference actionable next actions
  • 12 resolution validation cases
  • 12/12 resolution statuses correct
  • 18 applied scenario validation cases
  • 18/18 applied scenario reasonable classifications
  • 18/18 applied scenario valid fix plans
  • 18/18 applied scenario verification statuses correct
  • Playwright collector, generic log packer, browser-use adapter, and GitHub Actions usage docs
  • v2.0 Auto Capture command wrapper: failure-doctor run -- <command>
  • Sanitize & Share command: failure-doctor sanitize <failed_run> --out <shareable_failure_pack>
  • Cross-framework adapter command: failure-doctor adapt <input> --framework <framework> --out <failure_pack>
  • 100 cross-framework P95 fixtures across Selenium, Puppeteer, Cypress, Scrapy, requests, httpx, browser-use, and generic RPA
  • 100/100 cross-framework P95 reasonable classifications
  • 100/100 cross-framework P95 valid fix plans
  • 0 forbidden outputs in cross-framework P95 validation
  • 40 training challenge P95 local-only validation cases
  • 40/40 training challenge reasonable classifications
  • 40/40 training challenge valid fix plans
  • 40/40 training challenge verification statuses correct
  • 0 forbidden outputs and 0 private solution leaks in training challenge validation
  • 160 composite P95 strict local-only validation cases
  • 160/160 composite primary classifications correct
  • 160/160 composite repair-order checks correct
  • 160/160 composite evidence graphs generated
  • 0 forbidden outputs in composite P95 strict validation
  • P95 Core Triad Gate: pass
  • 3 composite showcase reports under sample_reports/composite_showcase/
  • 10 external held-out public-source records
  • 9/10 external held-out reasonable classifications
  • 10/10 external held-out actionable next actions
  • 0 forbidden outputs in generated reports/prompts
  • GitHub Actions green across Ubuntu, macOS, Windows, plus Windows benchmark/smoke/safety

See docs/VALIDATION_REPORT.md, docs/EXTERNAL_DATA_SOURCES.md, and validation/dashboard.md for validation metrics, limits, and boundaries.

Reproduce Validation

python -m tools.real_trace_generation.generate_real_trace_fixtures `
  --out .\examples\realistic_playwright_traces `
  --count 30 `
  --clean
python -m tools.validation.run_real_trace_validation
python -m tools.validation.run_playwright_trace_p95_validation
python -m tools.validation.run_external_public_reference_validation
python -m tools.validation.run_resolution_validation
python -m tools.validation.run_spiderbuf_inspired_validation
python -m tools.validation.run_training_challenge_validation
python -m tools.validation.run_cross_framework_p95_validation
python -m tools.validation.run_composite_diagnosis_p95_strict_validation
python -m tools.validation.run_p95_core_triage_gate
python scripts\validate_external_heldout.py

Safety Boundary

This project is for local, sanitized failure diagnosis.

It is not:

  • a challenge-solving tool
  • an access-control circumvention tool
  • a credential extractor
  • a real-platform scraper
  • a tool for unauthorized collection

For suspected platform risk cases, the intended output is identification, routing, and compliance-oriented next steps such as reducing request volume, using an official API, confirming authorization, contacting the platform, or stopping unauthorized collection.

Contributing Failure Cases

You do not need to write code. The most useful contribution is a sanitized failure case: log snippets, trace metadata, network summaries, screenshot metadata, and a short description of what happened.

Open an External failure case issue and remove secrets before posting:

  • passwords
  • API keys
  • cookies
  • tokens
  • authorization headers
  • private screenshots
  • private data
  • personal data

Accepted input types include sanitized error.log, trace.zip, console.txt, network.json, screenshot metadata, and user_description.txt.

If you allow it, a sanitized case may be assigned an EXT-YYYY-NNNN id, run once with the current released version before rule changes, and added to the external validation dashboard.

Templates and author-generated examples are not counted as external cases.

See CONTRIBUTING.md, docs/external_validation_protocol.md, docs/REAL_TRACE_CONTRIBUTION_GUIDE.md, and docs/REAL_DATA_SOURCES.md.

Commands

Run all tests:

python -m unittest discover -s tests -p "test_*.py"

Run smoke and safety checks:

scripts\smoke_test.ps1
scripts\local_safety_scan.ps1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_failure_doctor-3.2.0.tar.gz (213.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_failure_doctor-3.2.0-py3-none-any.whl (187.7 kB view details)

Uploaded Python 3

File details

Details for the file agent_failure_doctor-3.2.0.tar.gz.

File metadata

  • Download URL: agent_failure_doctor-3.2.0.tar.gz
  • Upload date:
  • Size: 213.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_failure_doctor-3.2.0.tar.gz
Algorithm Hash digest
SHA256 f652126e76e95f42af4c404307613e4f821d352fad5f898ec73ea721009790b9
MD5 3b66418752c72ff85bd2d1fa8d67baee
BLAKE2b-256 34ce52c71452458fbc5536283a5443ff6f4f47133a595d5679e0c65c2dc98571

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_failure_doctor-3.2.0.tar.gz:

Publisher: publish-pypi.yml on tobybgy-lsd/web-agent-runtime-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agent_failure_doctor-3.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_failure_doctor-3.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7075371b27cc7942267741cd37a7b877e062b1386fd25838dec097786d7854c6
MD5 ef1d1322c65c8b97b4820a9d80516e40
BLAKE2b-256 f297f58e4678fa0c67d93c3b2a8e21e260991007d931797735cb0609bf131acc

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_failure_doctor-3.2.0-py3-none-any.whl:

Publisher: publish-pypi.yml on tobybgy-lsd/web-agent-runtime-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page