Skip to main content

Auto-heal broken Playwright selectors using a local or cloud LLM

Project description

๐Ÿ›  Self-Healing Test Automation Framework

A Playwright wrapper that uses a local or cloud LLM to automatically fix broken CSS selectors โ€” no flaky CI pipelines, no manual triaging.


The Problem

UI changes break test selectors constantly:

TimeoutError: page.click: Timeout 30000ms exceeded.
  waiting for selector "#submit-btn"

The button still exists โ€” it's just [data-testid="login-submit"] now. A human would fix it in 10 seconds. But at 3 AM in CI, it blocks your entire pipeline.


How It Works

Test runs selector  โ†’  TimeoutError  โ†’  DOM snapshot captured
        โ†“
  DOM compressed (scripts/styles stripped, ~8KB)
        โ†“
  Prompt sent to LLM (local Ollama or Anthropic Claude)
        โ†“
  LLM returns: { "selector": "#new-id", "confidence": "high" }
        โ†“
  New selector validated in Playwright
        โ†“                          โ†˜ invalid? retry with feedback
  Test continues โœ…                  ("that selector didn't match โ€” try another")
        โ†“
  Result cached to disk, keyed by (url, selector) โ€” reused across runs

Project Structure

pytest-self-healer/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ pytest_self_healer/            # Installable package (pip install pytest-self-healer)
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ plugin.py                  # pytest entry point (fixtures + CLI options)
โ”‚   โ”‚   โ”œโ”€โ”€ healing_engine.py          # Core: LLM clients, DOM compression, healing logic
โ”‚   โ”‚   โ””โ”€โ”€ page_wrapper.py            # SelfHealingPage: drop-in Playwright Page replacement
โ”‚   โ”œโ”€โ”€ evals/
โ”‚   โ”‚   โ”œโ”€โ”€ selector_evalset.json      # Ground-truth dataset for LLM accuracy benchmarking
โ”‚   โ”‚   โ”œโ”€โ”€ run_eval.py                # Standalone eval runner (scores + saves report)
โ”‚   โ”‚   โ””โ”€โ”€ compare_models.py          # Diff two eval reports side by side
โ”‚   โ”œโ”€โ”€ tests/
โ”‚   โ”‚   โ”œโ”€โ”€ test_healing_examples.py   # Integration tests with intentionally stale selectors
โ”‚   โ”‚   โ”œโ”€โ”€ test_evalset.py            # pytest integration for the evalset
โ”‚   โ”‚   โ”œโ”€โ”€ test_accuracy.py           # LLM accuracy benchmarks (3 tiers)
โ”‚   โ”‚   โ””โ”€โ”€ test_unit.py               # Unit tests (no browser/LLM required)
โ”‚   โ””โ”€โ”€ conftest.py                    # pytest fixtures, CLI options, report hook
โ”œโ”€โ”€ docker/
โ”‚   โ”œโ”€โ”€ Dockerfile                     # Test runner image (Playwright + Python)
โ”‚   โ””โ”€โ”€ docker-compose.yml             # Ollama + test runner, health-checked
โ”œโ”€โ”€ reports/
โ”‚   โ”œโ”€โ”€ healing_report_<ts>.json       # Per-run healing reports
โ”‚   โ””โ”€โ”€ evals/
โ”‚       โ””โ”€โ”€ eval_<provider>_<ts>.json  # Per-run eval reports
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ pytest.ini
โ””โ”€โ”€ README.md

Quickstart

Option 1: Unit tests only (no browser or LLM needed)

pip install -r requirements.txt
playwright install chromium
PYTHONPATH=src pytest src/tests/test_unit.py -v

Option 2: Full integration tests (requires Ollama running locally)

# Install and start Ollama
brew install ollama        # or: curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5-coder:3b

# Run the tests
PYTHONPATH=src pytest src/tests/ -v \
  --ollama-url=http://localhost:11434 \
  --ollama-model=qwen2.5-coder:3b

Option 3: Use Anthropic Claude instead of Ollama

export ANTHROPIC_API_KEY=sk-ant-...

PYTHONPATH=src pytest src/tests/ -v \
  --llm-provider=anthropic \
  --anthropic-model=claude-haiku-4-5-20251001

Option 4: Docker (everything bundled)

docker compose -f docker/docker-compose.yml up --build
# Reports land in ./reports/healing_report_<timestamp>.json

Writing Your Own Healing Tests

Replace page with SelfHealingPage. Add a purpose string to every interaction:

# After: pip install pytest-self-healer
# No import needed โ€” healing_page fixture is auto-available

async def test_checkout(healing_page):
    await healing_page.goto("https://myapp.com/checkout")

    # Selector is stale โ€” LLM will find the real one
    await healing_page.click(
        selector="button#old-checkout-id",
        purpose="checkout submit button in the cart summary",
    )

    await healing_page.fill(
        selector="input.card-num",
        value="4242424242424242",
        purpose="credit card number input in payment form",
    )

Tips for better healing:

  • Be specific in purpose: "blue submit button in the login modal" > "button"
  • Use data-testid attributes in your app for stable baseline selectors
  • The LLM favors data-testid > aria-label > id > semantic CSS

Healing-aware actions: click, dblclick, hover, fill, type, press, check, uncheck, select_option, focus, tap, set_input_files, text_content, inner_text, input_value, get_attribute, is_visible, is_enabled, wait_for_selector, and drag_and_drop (which heals both the source and the target). Any other Playwright Page method (goto, keyboard, mouse, wait_for_load_state, โ€ฆ) is transparently delegated to the underlying page โ€” SelfHealingPage is a true drop-in.


CLI Options

Flag Default Description
--llm-provider ollama ollama | anthropic | auto
--ollama-url http://localhost:11434 Ollama server endpoint
--ollama-model qwen2.5-coder:3b Model name (also works with llama3, mistral)
--anthropic-model claude-haiku-4-5-20251001 Any Claude model ID
--anthropic-api-key None Falls back to ANTHROPIC_API_KEY env var
--healing-report-dir reports Where to write JSON healing reports
--screenshot-dir reports/screenshots Where to write BEFORE/AFTER screenshots
--healing-max-attempts 2 How many times the LLM may retry a heal with feedback
--selector-cache-file reports/selector_cache.json Persistent selector cache, reused across runs
--no-selector-cache false Disable the persistent cache for this run
--headless true Run browser headless

Healing Report

After each run, a JSON report is written to reports/:

{
  "total_healings_attempted": 3,
  "successful_healings": 3,
  "failed_healings": 0,
  "attempts": [
    {
      "original_selector": "#user-name",
      "element_purpose": "username input field on login form",
      "suggested_selector": "#username",
      "success": true,
      "timestamp": "2024-01-15T10:23:45.123456",
      "model_response_time_ms": 1840.5,
      "dom_size_chars": 4231,
      "provider": "ollama"
    }
  ]
}

Evalset โ€” Benchmarking LLM Accuracy

The evalset is a structured ground-truth dataset (src/evals/selector_evalset.json) used to measure how accurately the LLM finds correct selectors. It is independent of the healing tests โ€” no browser required.

What's in the evalset

12 cases across 6 categories and 3 difficulty levels:

Category Cases Difficulty
login 3 easy
checkout 2 medium
search 2 easy
navigation 1 easy
modal 2 medium
profile 1 hard
data-table 1 hard

Each case contains a stale selector, a purpose string, a minimal HTML snippet, and a list of acceptable correct selectors.

Running the evalset

Standalone runner (fastest, no pytest overhead):

# Against local Ollama
PYTHONPATH=src python src/evals/run_eval.py

# Against Anthropic Claude
PYTHONPATH=src python src/evals/run_eval.py \
  --provider anthropic \
  --anthropic-model claude-haiku-4-5-20251001

# Filter to a category or difficulty
PYTHONPATH=src python src/evals/run_eval.py --category login
PYTHONPATH=src python src/evals/run_eval.py --difficulty hard

Via pytest (integrates with your existing test flags):

PYTHONPATH=src pytest src/tests/test_evalset.py -v
PYTHONPATH=src pytest src/tests/test_evalset.py -v -k "login"

Comparing two models

Each eval run saves a timestamped report to reports/evals/. Use compare_models.py to diff two runs:

# Run against model A
PYTHONPATH=src python src/evals/run_eval.py --ollama-model qwen2.5-coder:3b

# Run against model B
PYTHONPATH=src python src/evals/run_eval.py --ollama-model llama3

# Compare
python src/evals/compare_models.py \
  reports/evals/eval_ollama_20260601_120000.json \
  reports/evals/eval_ollama_20260601_120500.json

Output:

  Metric                          A          B     Delta
  -------------------------------------------------------
  Accuracy                    75.0%      91.7%    +16.7%
  Avg response (ms)            2340       1820     -520.0

Adding new evalset cases

Open src/evals/selector_evalset.json and append to the cases array. Each case needs:

{
  "id": "unique-slug",
  "category": "login",
  "difficulty": "easy",
  "stale_selector": "#old-btn",
  "purpose": "login submit button",
  "expected_selectors": ["[data-testid='login-btn']", "button[type='submit']"],
  "html": "<minimal HTML snippet containing the target element>"
}

No code changes needed โ€” the runner and pytest integration pick up new cases automatically.


Architecture Decisions

Decision Rationale
Local LLM first (Ollama) No API keys, no data leakage, works offline in CI
Anthropic as opt-in cloud backend Higher accuracy on complex DOMs; useful when RAM is limited
auto provider mode Uses Claude if ANTHROPIC_API_KEY is set, otherwise Ollama โ€” same command works locally and in CI
DOM compression Strips scripts/styles, keeps semantic attrs. Fits in small model context (~8KB)
Persistent, URL-scoped cache Keyed by (url, selector) so the same selector on different pages never collides; written to disk so a selector healed once is reused across runs, not just within one
Retry with feedback If a suggestion fails validation, the next prompt names the failed selector and asks for a different one โ€” meaningfully lifts success rate on small models
Confidence scores LLM self-reports certainty; useful for alerting on low confidence heals
purpose string Natural language > brittle heuristics. Tells LLM why you want the element
Automatic passthrough Non-selector Page APIs are delegated via __getattr__, so the wrapper never lags behind Playwright's API
Evalset separate from tests Ground-truth data lives in JSON, not test code โ€” easy to grow and compare across models

Extending

  • Swap the LLM: Change --ollama-model=mistral or use --llm-provider=anthropic for Claude
  • Tune retries: Raise --healing-max-attempts on small/local models, lower it to 1 to fail fast
  • Alert on low confidence: Check attempt["confidence"] == "low" in the report and open a GitHub issue automatically
  • Grow the evalset: Add cases to selector_evalset.json to cover your app's specific UI patterns
  • CI accuracy gate: Run run_eval.py in CI and fail the build if accuracy drops below a threshold
  • Auto-PR on heal: Use a high-confidence heal as the trigger to open a PR updating the stale selector at its source

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_self_healer-0.2.0.tar.gz (38.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_self_healer-0.2.0-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file pytest_self_healer-0.2.0.tar.gz.

File metadata

  • Download URL: pytest_self_healer-0.2.0.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.0

File hashes

Hashes for pytest_self_healer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ceeaf884d71a67f2806af56c965167ffa9a40670c07ab0c5662a0e9ffa5664bb
MD5 204ecf8e2f38ea9f96e5578791b5b761
BLAKE2b-256 dbdc42d1f3c8a90d88f3ab0d7856b32a3a9261ea89e4fa6c0770f4409d5d9574

See more details on using hashes here.

File details

Details for the file pytest_self_healer-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_self_healer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 50e1dbb24f471a0d67eb15a7232494203fe4b983cc0a948781be084463bfa335
MD5 917048847dc1c815bf2fe9829c9685c5
BLAKE2b-256 786a0f9a9e41f022532b9c90a0573fa4434d58ec0ec03792dca0e2065206803c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page