Local-first failure diagnosis for AI browser automation, Playwright, crawler, and RPA runs.
Project description
Agent Failure Doctor
Local-first failure diagnosis lifecycle tool for AI browser automation, Playwright, crawler, RPA, and business automation failures.
- Current milestone: Agent Failure Doctor v3.2 Auto Collector P98 Gate
- Previous stable line: Agent Failure Doctor v3.1.0 P98 Master Gate.
- Previous P95 stable line: Agent Failure Doctor v2.4.1 P95 Alignment & Missing Tracks Pack.
Input: trace.zip / error.log / console.txt / network.json / screenshot metadata / user_description.txt
Output: diagnosis, evidence, next action, repair suggestions, GitHub issue draft, Codex fix prompt.
Quickstart
git clone https://github.com/tobybgy-lsd/web-agent-runtime-bench.git
cd web-agent-runtime-bench
python -m pip install -e .
failure-doctor diagnose .\examples\failed_runs\proxy_network_error --out .\report
failure-doctor plan .\report --out .\fix_plan
failure-doctor collect --project . --preset auto --out .\failure_doctor_auto_report `
--auto-diagnose --auto-handoff --auto-sanitize
failure-doctor agent-bootstrap --target all --project .
See validation/dashboard.md, docs/P98_LIMITS.md, docs/AGENT_FRONTEND_INVOCATION.md, and docs/safety_boundary.md.
P98 master gate passed with the auto collector pillar included.
Advanced commands include failure-doctor handoff,
failure-doctor agent-bootstrap, failure-doctor propose-patch, and
failure-doctor batch.
Core commands: collect / diagnose / plan / verify / run /
watch / sanitize / adapt / handoff / agent-bootstrap /
propose-patch / batch
Classic lifecycle: diagnose / plan / verify / run /
sanitize / adapt -> diagnose -> plan -> AI handoff / patch proposal -> verify -> sanitize/share
P98 gate: knowledge base -> coverage matrix -> trace/cross-framework/training/composite/handoff/batch/sanitize/auto-collector -> master gate
Distribution & Feedback
v3.2.0 is the current stable technical baseline. The next phase is distribution and real user feedback, not more synthetic feature expansion.
- PyPI release runbook: docs/PYPI_RELEASE.md
- 2-minute demo script: docs/DEMO_VIDEO_SCRIPT.md
- Technical article draft: docs/TECH_ARTICLE_DRAFT.md
- Real user feedback loop: docs/REAL_USER_FEEDBACK_LOOP.md
After PyPI publication, the target install command is:
pip install agent-failure-doctor
For non-technical Windows users, double-click
scripts/windows/Start-FailureDoctor-Diagnosis.bat or drag a failed project
folder onto it.
Advanced v3.2 commands include failure-doctor collect and failure-doctor watch.
Agent frontend invocation:
failure-doctor agent-bootstrap --target all --project .
This writes .failure-doctor/AGENT_ENTRYPOINT.md plus Codex, Cursor,
Claude Code, VS Code/Copilot, Antigravity, OpenCode, Qoder, Trae, WorkBuddy,
OpenClaw, Hermes, and generic agent workflow instructions.
Agent Failure Doctor uses a deterministic evidence-based diagnostic engine. It does not claim to solve arbitrary failures, but it provides explainable classification, evidence, fix plans, and before/after verification for known automation failure patterns.
Applied scenario demos are local-only mock workflows for commerce automation, live monitoring, content publishing, GUI data bridge, and ERP sync failure diagnosis.
Spiderbuf-inspired challenge demos are local-only mock failure packs inspired by public crawler-training challenge categories; they validate diagnosis and safe next actions without accessing spiderbuf.cn or publishing private solution logic.
Integration commands: failure-doctor collect-playwright / failure-doctor pack-logs / failure-doctor adapt
What You Get
report/
|-- diagnosis.json
|-- diagnosis.md
|-- evidence.json
|-- input_summary.json
|-- issue_draft.md
|-- repair_suggestions.md
|-- codex_fix_prompt.md
`-- failure_doctor_report.zip
Agent Failure Doctor turns sanitized automation failure materials into a report that explains what likely failed, what evidence supports the diagnosis, what evidence is missing, and what to ask Codex or another coding assistant to change next.
One-Minute Start
Auto Capture:
failure-doctor run -- python crawler.py
failure-doctor run -- pytest tests/test_listing.py
failure-doctor run -- playwright test
This writes a local run folder under .failure-doctor/runs/<run_id>/:
.failure-doctor/runs/<run_id>/
|-- command.txt
|-- exit_code.txt
|-- stdout.log
|-- stderr.log
|-- environment.json
|-- detected_artifacts.json
|-- input_summary.json
|-- diagnosis/
|-- fix_plan/
|-- verification_hint.md
`-- shareable_failure_pack.zip
The generated safe_to_share.json defaults to safe_to_share=false; review and sanitize before sending a pack to anyone else.
Sanitize & Share Pack:
Sanitize a failed run before sharing it:
failure-doctor sanitize .\.failure-doctor\runs\<run_id> --out .\shareable_failure_pack
This writes redacted logs, redacted network summaries, trace metadata only, a
redaction report, a review gate, and shareable_failure_pack.zip.
Raw trace.zip archives are not copied into the sanitized pack.
Put a failed run in a folder:
my_failed_run/
|-- error.log
|-- console.txt
|-- network.json
|-- README.txt
`-- screenshot.png
Then run:
failure-doctor diagnose .\my_failed_run --out .\report
The tool inventories inputs and uses this evidence priority:
trace.zip > log > network.json > user description > screenshot metadata
When evidence is too thin, it should downgrade to insufficient_evidence instead of guessing.
Minimal Demos
Proxy/network failure:
failure-doctor diagnose .\examples\failed_runs\proxy_failed --out .\report_proxy
Strict mode locator conflict:
failure-doctor diagnose .\examples\failed_runs\strict_mode_locator --out .\report_locator
Low-evidence screenshot-only run:
failure-doctor diagnose .\examples\failed_runs\low_evidence_screenshot_only --out .\report_low_evidence
Native Playwright trace fixture:
trace-doctor diagnose .\examples\realistic_playwright_traces\02_login_redirect_302\trace.zip --out .\report_login_trace
Before / After Report
Report structure: conclusion / evidence / why / next action / Codex fix prompt
Before:
page.goto: net::ERR_PROXY_CONNECTION_FAILED while opening https://example.test
After:
Conclusion: network/proxy setup failed before the page loaded.
Evidence: Playwright reported net::ERR_PROXY_CONNECTION_FAILED.
Next action: check proxy settings, DNS, VPN, and CI network configuration.
Codex fix prompt: add trace/log capture and make proxy configuration explicit.
Verify a Fix
failure-doctor diagnose .\failed_run --out .\report
failure-doctor plan .\report --out .\fix_plan
failure-doctor verify --before .\failed_run --after .\rerun_after_fix --out .\verification_report
verify compares before/after evidence and reports whether the original failure
is resolved, unchanged, changed into another failure, or insufficiently
evidenced.
AI Handoff & Patch Proposal
Turn a report into task packs that Codex, Claude Code, or Cursor can execute:
failure-doctor handoff .\report --target codex --out .\ai_handoff
failure-doctor handoff .\report --target claude_code --out .\ai_handoff
failure-doctor handoff .\report --target cursor --out .\ai_handoff
This writes:
ai_handoff/
|-- ai_handoff.json
|-- ai_handoff.md
|-- codex_task.md
|-- claude_code_task.md
|-- cursor_task.md
|-- affected_files.json
|-- validation_commands.md
|-- forbidden_actions.md
|-- token_budget_report.json
`-- ai_handoff_pack.zip
Generate a dry-run patch proposal without modifying source code:
failure-doctor propose-patch --repo . --report .\report --out .\patch_plan
This writes:
patch_plan/
|-- patch_proposal.md
|-- proposed_changes.json
|-- affected_files.json
|-- validation_commands.md
`-- patch_risk_assessment.json
propose-patch is intentionally proposal-only. It does not edit files, apply patches, run tests, or open pull requests.
v2.5 validation writes validation/ai_handoff_validation.json:
20/20 Codex task files generated
20/20 Claude Code task files generated
20/20 Cursor task files generated
18/20 patch proposals generated
20/20 required sections present
20/20 concise token budget checks pass
0 forbidden outputs
Batch Diagnosis / Fleet Mode
Diagnose many failed runs and get a fleet-level summary:
failure-doctor batch .\runs --out .\batch_report
Input:
runs/
|-- run_001/
|-- run_002/
|-- run_003/
`-- ...
Output:
batch_report/
|-- summary.json
|-- summary.md
|-- failures_by_type.csv
|-- top_root_causes.md
|-- repeated_failures.md
|-- suggested_regression_cases.md
|-- repair_priority.md
`-- reports/
Fleet mode answers which failures repeat, which root causes dominate, which runs should become regression cases, and which fixes deserve priority.
P98 Controlled Maturity
v3.0 starts the P98 controlled maturity track. This is not an ecosystem score; it does not count stars, external PRs, external issues, PyPI downloads, or long-term community adoption.
Current P98 assets:
- docs/P98_CONTROLLED_MATURITY_SCORECARD.md
- knowledge_base/
- docs/CRAWLER_FAILURE_COVERAGE_MATRIX.md
- validation/crawler_failure_coverage_matrix.json
Knowledge-base commands:
python -m tools.knowledge_base.validate_patterns
python -m tools.knowledge_base.search_patterns --query selector_drift
python -m tools.validation.run_crawler_failure_coverage_matrix
Applied Scenario Demos
Local-only mock demos show how Agent Failure Doctor can diagnose failures in:
- hot product collection
- live commerce monitoring
- ecommerce listing automation
- authorized content publishing workflow
- GUI / RPA data bridge
- ERP-to-ecommerce sync
Run:
python -m tools.validation.run_applied_scenario_validation
Spiderbuf-Inspired Challenge Demos
examples/spiderbuf_inspired_challenges/ contains local-only mock failure packs inspired by public crawler-training challenge categories:
- cookie/session required
- iframe extraction
- Ajax dynamic loading
- random CSS selector drift
- infinite scroll missing items
- rate limit 429
- API signature required
- browser fingerprint risk
- Selenium detection risk
- challenge page detected
These cases are diagnosis-only. They do not access spiderbuf.cn, do not include private solutions, and do not include access-control defeat steps.
python -m tools.validation.run_spiderbuf_inspired_validation
Integrations
Collect Playwright test-results into a failure pack:
failure-doctor collect-playwright .\examples\mock_playwright_test_results --out .\tmp_failure_pack
failure-doctor diagnose .\tmp_failure_pack --out .\tmp_collected_report
Normalize a loose log folder:
failure-doctor pack-logs .\examples\mock_raw_logs --out .\tmp_log_pack
failure-doctor diagnose .\tmp_log_pack --out .\tmp_log_report
Normalize a Selenium, Puppeteer, Cypress, Scrapy, requests, or httpx failure log:
failure-doctor adapt .\examples\cross_framework_fixtures\selenium\no_such_element\raw --framework selenium --out .\tmp_selenium_pack
failure-doctor diagnose .\tmp_selenium_pack --out .\tmp_selenium_report
failure-doctor plan .\tmp_selenium_report --out .\tmp_selenium_fix_plan
Supported adapter frameworks:
selenium | puppeteer | cypress | scrapy | requests | httpx | auto
Playwright remains the deepest native trace backend. Cross-framework adapters normalize local logs and metadata into the same failure lifecycle; they do not run those frameworks or connect to external platforms.
See docs/INTEGRATIONS.md and docs/GITHUB_ACTION_USAGE.md.
Validation Status
Current milestone: Agent Failure Doctor v3.2 Auto Collector P98 Gate.
Previous stable line: Agent Failure Doctor v2.4.1 P95 Alignment & Missing Tracks Pack.
- 131 source-ledger records with separated
real_public_issue,official_doc_pattern, andpublic_inspired_sanitizedlabels - 50 traceable real public issue records
- 100 Playwright Trace Doctor P95 fixtures
- 100/100 Playwright trace reasonable classifications
- 100/100 Playwright trace exact subtype matches
- 62 external public reference seeds
- 20 external public reference held-out records
- 20/20 external public reference reasonable classifications
- 20/20 external public reference actionable next actions
- 12 resolution validation cases
- 12/12 resolution statuses correct
- 18 applied scenario validation cases
- 18/18 applied scenario reasonable classifications
- 18/18 applied scenario valid fix plans
- 18/18 applied scenario verification statuses correct
- Playwright collector, generic log packer, browser-use adapter, and GitHub Actions usage docs
- v2.0 Auto Capture command wrapper:
failure-doctor run -- <command> - Sanitize & Share command:
failure-doctor sanitize <failed_run> --out <shareable_failure_pack> - Cross-framework adapter command:
failure-doctor adapt <input> --framework <framework> --out <failure_pack> - 100 cross-framework P95 fixtures across Selenium, Puppeteer, Cypress, Scrapy, requests, httpx, browser-use, and generic RPA
- 100/100 cross-framework P95 reasonable classifications
- 100/100 cross-framework P95 valid fix plans
- 0 forbidden outputs in cross-framework P95 validation
- 40 training challenge P95 local-only validation cases
- 40/40 training challenge reasonable classifications
- 40/40 training challenge valid fix plans
- 40/40 training challenge verification statuses correct
- 0 forbidden outputs and 0 private solution leaks in training challenge validation
- 160 composite P95 strict local-only validation cases
- 160/160 composite primary classifications correct
- 160/160 composite repair-order checks correct
- 160/160 composite evidence graphs generated
- 0 forbidden outputs in composite P95 strict validation
- P95 Core Triad Gate: pass
- 3 composite showcase reports under
sample_reports/composite_showcase/ - 10 external held-out public-source records
- 9/10 external held-out reasonable classifications
- 10/10 external held-out actionable next actions
- 0 forbidden outputs in generated reports/prompts
- GitHub Actions green across Ubuntu, macOS, Windows, plus Windows benchmark/smoke/safety
See docs/VALIDATION_REPORT.md, docs/EXTERNAL_DATA_SOURCES.md, and validation/dashboard.md for validation metrics, limits, and boundaries.
Reproduce Validation
python -m tools.real_trace_generation.generate_real_trace_fixtures `
--out .\examples\realistic_playwright_traces `
--count 30 `
--clean
python -m tools.validation.run_real_trace_validation
python -m tools.validation.run_playwright_trace_p95_validation
python -m tools.validation.run_external_public_reference_validation
python -m tools.validation.run_resolution_validation
python -m tools.validation.run_spiderbuf_inspired_validation
python -m tools.validation.run_training_challenge_validation
python -m tools.validation.run_cross_framework_p95_validation
python -m tools.validation.run_composite_diagnosis_p95_strict_validation
python -m tools.validation.run_p95_core_triage_gate
python scripts\validate_external_heldout.py
Safety Boundary
This project is for local, sanitized failure diagnosis.
It is not:
- a challenge-solving tool
- an access-control circumvention tool
- a credential extractor
- a real-platform scraper
- a tool for unauthorized collection
For suspected platform risk cases, the intended output is identification, routing, and compliance-oriented next steps such as reducing request volume, using an official API, confirming authorization, contacting the platform, or stopping unauthorized collection.
Contributing Failure Cases
You do not need to write code. The most useful contribution is a sanitized failure case: log snippets, trace metadata, network summaries, screenshot metadata, and a short description of what happened.
Open an External failure case issue and remove secrets before posting:
- passwords
- API keys
- cookies
- tokens
- authorization headers
- private screenshots
- private data
- personal data
Accepted input types include sanitized error.log, trace.zip, console.txt,
network.json, screenshot metadata, and user_description.txt.
If you allow it, a sanitized case may be assigned an EXT-YYYY-NNNN id, run
once with the current released version before rule changes, and added to the
external validation dashboard.
Templates and author-generated examples are not counted as external cases.
See CONTRIBUTING.md, docs/external_validation_protocol.md, docs/REAL_TRACE_CONTRIBUTION_GUIDE.md, and docs/REAL_DATA_SOURCES.md.
Commands
Run all tests:
python -m unittest discover -s tests -p "test_*.py"
Run smoke and safety checks:
scripts\smoke_test.ps1
scripts\local_safety_scan.ps1
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_failure_doctor-3.2.0.tar.gz.
File metadata
- Download URL: agent_failure_doctor-3.2.0.tar.gz
- Upload date:
- Size: 213.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f652126e76e95f42af4c404307613e4f821d352fad5f898ec73ea721009790b9
|
|
| MD5 |
3b66418752c72ff85bd2d1fa8d67baee
|
|
| BLAKE2b-256 |
34ce52c71452458fbc5536283a5443ff6f4f47133a595d5679e0c65c2dc98571
|
Provenance
The following attestation bundles were made for agent_failure_doctor-3.2.0.tar.gz:
Publisher:
publish-pypi.yml on tobybgy-lsd/web-agent-runtime-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_failure_doctor-3.2.0.tar.gz -
Subject digest:
f652126e76e95f42af4c404307613e4f821d352fad5f898ec73ea721009790b9 - Sigstore transparency entry: 2010567042
- Sigstore integration time:
-
Permalink:
tobybgy-lsd/web-agent-runtime-bench@c416d3fc8f3123b5c76f9de0f82623b632e1638c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tobybgy-lsd
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@c416d3fc8f3123b5c76f9de0f82623b632e1638c -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file agent_failure_doctor-3.2.0-py3-none-any.whl.
File metadata
- Download URL: agent_failure_doctor-3.2.0-py3-none-any.whl
- Upload date:
- Size: 187.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7075371b27cc7942267741cd37a7b877e062b1386fd25838dec097786d7854c6
|
|
| MD5 |
ef1d1322c65c8b97b4820a9d80516e40
|
|
| BLAKE2b-256 |
f297f58e4678fa0c67d93c3b2a8e21e260991007d931797735cb0609bf131acc
|
Provenance
The following attestation bundles were made for agent_failure_doctor-3.2.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on tobybgy-lsd/web-agent-runtime-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_failure_doctor-3.2.0-py3-none-any.whl -
Subject digest:
7075371b27cc7942267741cd37a7b877e062b1386fd25838dec097786d7854c6 - Sigstore transparency entry: 2010567091
- Sigstore integration time:
-
Permalink:
tobybgy-lsd/web-agent-runtime-bench@c416d3fc8f3123b5c76f9de0f82623b632e1638c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tobybgy-lsd
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@c416d3fc8f3123b5c76f9de0f82623b632e1638c -
Trigger Event:
workflow_dispatch
-
Statement type: