Auto-heal broken Playwright selectors using a local or cloud LLM
Project description
๐ Self-Healing Test Automation Framework
A Playwright wrapper that uses a local or cloud LLM to automatically fix broken CSS selectors โ no flaky CI pipelines, no manual triaging.
The Problem
UI changes break test selectors constantly:
TimeoutError: page.click: Timeout 30000ms exceeded.
waiting for selector "#submit-btn"
The button still exists โ it's just [data-testid="login-submit"] now. A human would fix it in 10 seconds. But at 3 AM in CI, it blocks your entire pipeline.
How It Works
Test runs selector โ TimeoutError โ DOM snapshot captured
โ
DOM compressed (scripts/styles stripped, ~8KB)
โ
Prompt sent to LLM (local Ollama or Anthropic Claude)
โ
LLM returns: { "selector": "#new-id", "confidence": "high" }
โ
New selector validated in Playwright
โ โ invalid? retry with feedback
Test continues โ
("that selector didn't match โ try another")
โ
Result cached to disk, keyed by (url, selector) โ reused across runs
Project Structure
pytest-self-healer/
โโโ src/
โ โโโ pytest_self_healer/ # Installable package (pip install pytest-self-healer)
โ โ โโโ __init__.py
โ โ โโโ plugin.py # pytest entry point (fixtures + CLI options)
โ โ โโโ healing_engine.py # Core: LLM clients, DOM compression, healing logic
โ โ โโโ page_wrapper.py # SelfHealingPage: drop-in Playwright Page replacement
โ โโโ evals/
โ โ โโโ selector_evalset.json # Ground-truth dataset for LLM accuracy benchmarking
โ โ โโโ run_eval.py # Standalone eval runner (scores + saves report)
โ โ โโโ compare_models.py # Diff two eval reports side by side
โ โโโ tests/
โ โ โโโ test_healing_examples.py # Integration tests with intentionally stale selectors
โ โ โโโ test_evalset.py # pytest integration for the evalset
โ โ โโโ test_accuracy.py # LLM accuracy benchmarks (3 tiers)
โ โ โโโ test_unit.py # Unit tests (no browser/LLM required)
โ โโโ conftest.py # pytest fixtures, CLI options, report hook
โโโ docker/
โ โโโ Dockerfile # Test runner image (Playwright + Python)
โ โโโ docker-compose.yml # Ollama + test runner, health-checked
โโโ reports/
โ โโโ healing_report_<ts>.json # Per-run healing reports
โ โโโ evals/
โ โโโ eval_<provider>_<ts>.json # Per-run eval reports
โโโ requirements.txt
โโโ pytest.ini
โโโ README.md
Quickstart
Option 1: Unit tests only (no browser or LLM needed)
pip install -r requirements.txt
playwright install chromium
PYTHONPATH=src pytest src/tests/test_unit.py -v
Option 2: Full integration tests (requires Ollama running locally)
# Install and start Ollama
brew install ollama # or: curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5-coder:3b
# Run the tests
PYTHONPATH=src pytest src/tests/ -v \
--ollama-url=http://localhost:11434 \
--ollama-model=qwen2.5-coder:3b
Option 3: Use Anthropic Claude instead of Ollama
export ANTHROPIC_API_KEY=sk-ant-...
PYTHONPATH=src pytest src/tests/ -v \
--llm-provider=anthropic \
--anthropic-model=claude-haiku-4-5-20251001
Option 4: Docker (everything bundled)
docker compose -f docker/docker-compose.yml up --build
# Reports land in ./reports/healing_report_<timestamp>.json
Writing Your Own Healing Tests
Replace page with SelfHealingPage. Add a purpose string to every interaction:
# After: pip install pytest-self-healer
# No import needed โ healing_page fixture is auto-available
async def test_checkout(healing_page):
await healing_page.goto("https://myapp.com/checkout")
# Selector is stale โ LLM will find the real one
await healing_page.click(
selector="button#old-checkout-id",
purpose="checkout submit button in the cart summary",
)
await healing_page.fill(
selector="input.card-num",
value="4242424242424242",
purpose="credit card number input in payment form",
)
Tips for better healing:
- Be specific in
purpose: "blue submit button in the login modal" > "button" - Use
data-testidattributes in your app for stable baseline selectors - The LLM favors
data-testid>aria-label>id> semantic CSS
Healing-aware actions: click, dblclick, hover, fill, type, press,
check, uncheck, select_option, focus, tap, set_input_files,
text_content, inner_text, input_value, get_attribute, is_visible,
is_enabled, wait_for_selector, and drag_and_drop (which heals both the
source and the target). Any other Playwright Page method (goto, keyboard,
mouse, wait_for_load_state, โฆ) is transparently delegated to the underlying
page โ SelfHealingPage is a true drop-in.
CLI Options
| Flag | Default | Description |
|---|---|---|
--llm-provider |
ollama |
ollama | anthropic | auto |
--ollama-url |
http://localhost:11434 |
Ollama server endpoint |
--ollama-model |
qwen2.5-coder:3b |
Model name (also works with llama3, mistral) |
--anthropic-model |
claude-haiku-4-5-20251001 |
Any Claude model ID |
--anthropic-api-key |
None |
Falls back to ANTHROPIC_API_KEY env var |
--healing-report-dir |
reports |
Where to write JSON healing reports |
--screenshot-dir |
reports/screenshots |
Where to write BEFORE/AFTER screenshots |
--healing-max-attempts |
2 |
How many times the LLM may retry a heal with feedback |
--selector-cache-file |
reports/selector_cache.json |
Persistent selector cache, reused across runs |
--no-selector-cache |
false |
Disable the persistent cache for this run |
--headless |
true |
Run browser headless |
Healing Report
After each run, a JSON report is written to reports/:
{
"total_healings_attempted": 3,
"successful_healings": 3,
"failed_healings": 0,
"attempts": [
{
"original_selector": "#user-name",
"element_purpose": "username input field on login form",
"suggested_selector": "#username",
"success": true,
"timestamp": "2024-01-15T10:23:45.123456",
"model_response_time_ms": 1840.5,
"dom_size_chars": 4231,
"provider": "ollama"
}
]
}
Evalset โ Benchmarking LLM Accuracy
The evalset is a structured ground-truth dataset (src/evals/selector_evalset.json) used to measure how accurately the LLM finds correct selectors. It is independent of the healing tests โ no browser required.
What's in the evalset
12 cases across 6 categories and 3 difficulty levels:
| Category | Cases | Difficulty |
|---|---|---|
| login | 3 | easy |
| checkout | 2 | medium |
| search | 2 | easy |
| navigation | 1 | easy |
| modal | 2 | medium |
| profile | 1 | hard |
| data-table | 1 | hard |
Each case contains a stale selector, a purpose string, a minimal HTML snippet, and a list of acceptable correct selectors.
Running the evalset
Standalone runner (fastest, no pytest overhead):
# Against local Ollama
PYTHONPATH=src python src/evals/run_eval.py
# Against Anthropic Claude
PYTHONPATH=src python src/evals/run_eval.py \
--provider anthropic \
--anthropic-model claude-haiku-4-5-20251001
# Filter to a category or difficulty
PYTHONPATH=src python src/evals/run_eval.py --category login
PYTHONPATH=src python src/evals/run_eval.py --difficulty hard
Via pytest (integrates with your existing test flags):
PYTHONPATH=src pytest src/tests/test_evalset.py -v
PYTHONPATH=src pytest src/tests/test_evalset.py -v -k "login"
Comparing two models
Each eval run saves a timestamped report to reports/evals/. Use compare_models.py to diff two runs:
# Run against model A
PYTHONPATH=src python src/evals/run_eval.py --ollama-model qwen2.5-coder:3b
# Run against model B
PYTHONPATH=src python src/evals/run_eval.py --ollama-model llama3
# Compare
python src/evals/compare_models.py \
reports/evals/eval_ollama_20260601_120000.json \
reports/evals/eval_ollama_20260601_120500.json
Output:
Metric A B Delta
-------------------------------------------------------
Accuracy 75.0% 91.7% +16.7%
Avg response (ms) 2340 1820 -520.0
Adding new evalset cases
Open src/evals/selector_evalset.json and append to the cases array. Each case needs:
{
"id": "unique-slug",
"category": "login",
"difficulty": "easy",
"stale_selector": "#old-btn",
"purpose": "login submit button",
"expected_selectors": ["[data-testid='login-btn']", "button[type='submit']"],
"html": "<minimal HTML snippet containing the target element>"
}
No code changes needed โ the runner and pytest integration pick up new cases automatically.
Architecture Decisions
| Decision | Rationale |
|---|---|
| Local LLM first (Ollama) | No API keys, no data leakage, works offline in CI |
| Anthropic as opt-in cloud backend | Higher accuracy on complex DOMs; useful when RAM is limited |
auto provider mode |
Uses Claude if ANTHROPIC_API_KEY is set, otherwise Ollama โ same command works locally and in CI |
| DOM compression | Strips scripts/styles, keeps semantic attrs. Fits in small model context (~8KB) |
| Persistent, URL-scoped cache | Keyed by (url, selector) so the same selector on different pages never collides; written to disk so a selector healed once is reused across runs, not just within one |
| Retry with feedback | If a suggestion fails validation, the next prompt names the failed selector and asks for a different one โ meaningfully lifts success rate on small models |
| Confidence scores | LLM self-reports certainty; useful for alerting on low confidence heals |
purpose string |
Natural language > brittle heuristics. Tells LLM why you want the element |
| Automatic passthrough | Non-selector Page APIs are delegated via __getattr__, so the wrapper never lags behind Playwright's API |
| Evalset separate from tests | Ground-truth data lives in JSON, not test code โ easy to grow and compare across models |
Extending
- Swap the LLM: Change
--ollama-model=mistralor use--llm-provider=anthropicfor Claude - Tune retries: Raise
--healing-max-attemptson small/local models, lower it to1to fail fast - Alert on low confidence: Check
attempt["confidence"] == "low"in the report and open a GitHub issue automatically - Grow the evalset: Add cases to
selector_evalset.jsonto cover your app's specific UI patterns - CI accuracy gate: Run
run_eval.pyin CI and fail the build if accuracy drops below a threshold - Auto-PR on heal: Use a high-confidence heal as the trigger to open a PR updating the stale selector at its source
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytest_self_healer-0.2.0.tar.gz.
File metadata
- Download URL: pytest_self_healer-0.2.0.tar.gz
- Upload date:
- Size: 38.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ceeaf884d71a67f2806af56c965167ffa9a40670c07ab0c5662a0e9ffa5664bb
|
|
| MD5 |
204ecf8e2f38ea9f96e5578791b5b761
|
|
| BLAKE2b-256 |
dbdc42d1f3c8a90d88f3ab0d7856b32a3a9261ea89e4fa6c0770f4409d5d9574
|
File details
Details for the file pytest_self_healer-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pytest_self_healer-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50e1dbb24f471a0d67eb15a7232494203fe4b983cc0a948781be084463bfa335
|
|
| MD5 |
917048847dc1c815bf2fe9829c9685c5
|
|
| BLAKE2b-256 |
786a0f9a9e41f022532b9c90a0573fa4434d58ec0ec03792dca0e2065206803c
|