Skip to main content

Auto-heal broken Playwright selectors using a local or cloud LLM

Project description

๐Ÿ›  Self-Healing Test Automation Framework

A Playwright wrapper that uses a local or cloud LLM to automatically fix broken CSS selectors โ€” no flaky CI pipelines, no manual triaging.


The Problem

UI changes break test selectors constantly:

TimeoutError: page.click: Timeout 30000ms exceeded.
  waiting for selector "#submit-btn"

The button still exists โ€” it's just [data-testid="login-submit"] now. A human would fix it in 10 seconds. But at 3 AM in CI, it blocks your entire pipeline.


How It Works

Test runs selector  โ†’  TimeoutError  โ†’  DOM snapshot captured
        โ†“
  DOM compressed (scripts/styles stripped, ~8KB)
        โ†“
  Prompt sent to LLM (local Ollama or Anthropic Claude)
        โ†“
  LLM returns: { "selector": "#new-id", "confidence": "high" }
        โ†“
  New selector validated in Playwright
        โ†“
  Test continues โœ…  +  Result cached for reuse

Project Structure

pytest-self-healer/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ pytest_self_healer/            # Installable package (pip install pytest-self-healer)
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ plugin.py                  # pytest entry point (fixtures + CLI options)
โ”‚   โ”‚   โ”œโ”€โ”€ healing_engine.py          # Core: LLM clients, DOM compression, healing logic
โ”‚   โ”‚   โ””โ”€โ”€ page_wrapper.py            # SelfHealingPage: drop-in Playwright Page replacement
โ”‚   โ”œโ”€โ”€ evals/
โ”‚   โ”‚   โ”œโ”€โ”€ selector_evalset.json      # Ground-truth dataset for LLM accuracy benchmarking
โ”‚   โ”‚   โ”œโ”€โ”€ run_eval.py                # Standalone eval runner (scores + saves report)
โ”‚   โ”‚   โ””โ”€โ”€ compare_models.py          # Diff two eval reports side by side
โ”‚   โ”œโ”€โ”€ tests/
โ”‚   โ”‚   โ”œโ”€โ”€ test_healing_examples.py   # Integration tests with intentionally stale selectors
โ”‚   โ”‚   โ”œโ”€โ”€ test_evalset.py            # pytest integration for the evalset
โ”‚   โ”‚   โ”œโ”€โ”€ test_accuracy.py           # LLM accuracy benchmarks (3 tiers)
โ”‚   โ”‚   โ””โ”€โ”€ test_unit.py               # Unit tests (no browser/LLM required)
โ”‚   โ””โ”€โ”€ conftest.py                    # pytest fixtures, CLI options, report hook
โ”œโ”€โ”€ docker/
โ”‚   โ”œโ”€โ”€ Dockerfile                     # Test runner image (Playwright + Python)
โ”‚   โ””โ”€โ”€ docker-compose.yml             # Ollama + test runner, health-checked
โ”œโ”€โ”€ reports/
โ”‚   โ”œโ”€โ”€ healing_report_<ts>.json       # Per-run healing reports
โ”‚   โ””โ”€โ”€ evals/
โ”‚       โ””โ”€โ”€ eval_<provider>_<ts>.json  # Per-run eval reports
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ pytest.ini
โ””โ”€โ”€ README.md

Quickstart

Option 1: Unit tests only (no browser or LLM needed)

pip install -r requirements.txt
playwright install chromium
PYTHONPATH=src pytest src/tests/test_unit.py -v

Option 2: Full integration tests (requires Ollama running locally)

# Install and start Ollama
brew install ollama        # or: curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5-coder:3b

# Run the tests
PYTHONPATH=src pytest src/tests/ -v \
  --ollama-url=http://localhost:11434 \
  --ollama-model=qwen2.5-coder:3b

Option 3: Use Anthropic Claude instead of Ollama

export ANTHROPIC_API_KEY=sk-ant-...

PYTHONPATH=src pytest src/tests/ -v \
  --llm-provider=anthropic \
  --anthropic-model=claude-haiku-4-5-20251001

Option 4: Docker (everything bundled)

docker compose -f docker/docker-compose.yml up --build
# Reports land in ./reports/healing_report_<timestamp>.json

Writing Your Own Healing Tests

Replace page with SelfHealingPage. Add a purpose string to every interaction:

# After: pip install pytest-self-healer
# No import needed โ€” healing_page fixture is auto-available

async def test_checkout(healing_page):
    await healing_page.goto("https://myapp.com/checkout")

    # Selector is stale โ€” LLM will find the real one
    await healing_page.click(
        selector="button#old-checkout-id",
        purpose="checkout submit button in the cart summary",
    )

    await healing_page.fill(
        selector="input.card-num",
        value="4242424242424242",
        purpose="credit card number input in payment form",
    )

Tips for better healing:

  • Be specific in purpose: "blue submit button in the login modal" > "button"
  • Use data-testid attributes in your app for stable baseline selectors
  • The LLM favors data-testid > aria-label > id > semantic CSS

CLI Options

Flag Default Description
--llm-provider ollama ollama | anthropic | auto
--ollama-url http://localhost:11434 Ollama server endpoint
--ollama-model qwen2.5-coder:3b Model name (also works with llama3, mistral)
--anthropic-model claude-haiku-4-5-20251001 Any Claude model ID
--anthropic-api-key None Falls back to ANTHROPIC_API_KEY env var
--healing-report-dir reports Where to write JSON healing reports
--screenshot-dir reports/screenshots Where to write BEFORE/AFTER screenshots
--headless true Run browser headless

Healing Report

After each run, a JSON report is written to reports/:

{
  "total_healings_attempted": 3,
  "successful_healings": 3,
  "failed_healings": 0,
  "attempts": [
    {
      "original_selector": "#user-name",
      "element_purpose": "username input field on login form",
      "suggested_selector": "#username",
      "success": true,
      "timestamp": "2024-01-15T10:23:45.123456",
      "model_response_time_ms": 1840.5,
      "dom_size_chars": 4231,
      "provider": "ollama"
    }
  ]
}

Evalset โ€” Benchmarking LLM Accuracy

The evalset is a structured ground-truth dataset (src/evals/selector_evalset.json) used to measure how accurately the LLM finds correct selectors. It is independent of the healing tests โ€” no browser required.

What's in the evalset

12 cases across 6 categories and 3 difficulty levels:

Category Cases Difficulty
login 3 easy
checkout 2 medium
search 2 easy
navigation 1 easy
modal 2 medium
profile 1 hard
data-table 1 hard

Each case contains a stale selector, a purpose string, a minimal HTML snippet, and a list of acceptable correct selectors.

Running the evalset

Standalone runner (fastest, no pytest overhead):

# Against local Ollama
PYTHONPATH=src python src/evals/run_eval.py

# Against Anthropic Claude
PYTHONPATH=src python src/evals/run_eval.py \
  --provider anthropic \
  --anthropic-model claude-haiku-4-5-20251001

# Filter to a category or difficulty
PYTHONPATH=src python src/evals/run_eval.py --category login
PYTHONPATH=src python src/evals/run_eval.py --difficulty hard

Via pytest (integrates with your existing test flags):

PYTHONPATH=src pytest src/tests/test_evalset.py -v
PYTHONPATH=src pytest src/tests/test_evalset.py -v -k "login"

Comparing two models

Each eval run saves a timestamped report to reports/evals/. Use compare_models.py to diff two runs:

# Run against model A
PYTHONPATH=src python src/evals/run_eval.py --ollama-model qwen2.5-coder:3b

# Run against model B
PYTHONPATH=src python src/evals/run_eval.py --ollama-model llama3

# Compare
python src/evals/compare_models.py \
  reports/evals/eval_ollama_20260601_120000.json \
  reports/evals/eval_ollama_20260601_120500.json

Output:

  Metric                          A          B     Delta
  -------------------------------------------------------
  Accuracy                    75.0%      91.7%    +16.7%
  Avg response (ms)            2340       1820     -520.0

Adding new evalset cases

Open src/evals/selector_evalset.json and append to the cases array. Each case needs:

{
  "id": "unique-slug",
  "category": "login",
  "difficulty": "easy",
  "stale_selector": "#old-btn",
  "purpose": "login submit button",
  "expected_selectors": ["[data-testid='login-btn']", "button[type='submit']"],
  "html": "<minimal HTML snippet containing the target element>"
}

No code changes needed โ€” the runner and pytest integration pick up new cases automatically.


Architecture Decisions

Decision Rationale
Local LLM first (Ollama) No API keys, no data leakage, works offline in CI
Anthropic as opt-in cloud backend Higher accuracy on complex DOMs; useful when RAM is limited
auto provider mode Uses Claude if ANTHROPIC_API_KEY is set, otherwise Ollama โ€” same command works locally and in CI
DOM compression Strips scripts/styles, keeps semantic attrs. Fits in small model context (~8KB)
Selector caching Avoids repeated LLM calls for the same broken selector in one run
Confidence scores LLM self-reports certainty; useful for alerting on low confidence heals
purpose string Natural language > brittle heuristics. Tells LLM why you want the element
Evalset separate from tests Ground-truth data lives in JSON, not test code โ€” easy to grow and compare across models

Extending

  • Swap the LLM: Change --ollama-model=mistral or use --llm-provider=anthropic for Claude
  • Persist the cache: Serialize engine._cache to reports/selector_cache.json between runs
  • Alert on low confidence: Check attempt["confidence"] == "low" in the report and open a GitHub issue automatically
  • Grow the evalset: Add cases to selector_evalset.json to cover your app's specific UI patterns
  • CI accuracy gate: Run run_eval.py in CI and fail the build if accuracy drops below a threshold

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_self_healer-0.1.0.tar.gz (31.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_self_healer-0.1.0-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file pytest_self_healer-0.1.0.tar.gz.

File metadata

  • Download URL: pytest_self_healer-0.1.0.tar.gz
  • Upload date:
  • Size: 31.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.0

File hashes

Hashes for pytest_self_healer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fd325d0d37004089d856962204c22cd3671062aa3596999b12fc1f9463b76c4b
MD5 7504912d47f8235196003aa675be2e8e
BLAKE2b-256 3ff30f5af4d8f26f9871c6e76169f06a70e39a27d16a69ba2f5787a47216d0d5

See more details on using hashes here.

File details

Details for the file pytest_self_healer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_self_healer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e4c729eb4928b0da036e2517f3e3a2593c99622b4f70d22c8c637d738eb1d1d
MD5 6c8aeb1d0663994526d36484daf1e223
BLAKE2b-256 7e599e3fa00608a222fbaec0a2eb611b7de53da750165a411d6d41c18db4c864

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page