Auto-heal broken Playwright selectors using a local or cloud LLM

These details have not been verified by PyPI

Project links

Project description

🛠 Self-Healing Test Automation Framework

A Playwright wrapper that uses a local or cloud LLM to automatically fix broken CSS selectors — no flaky CI pipelines, no manual triaging.

The Problem

UI changes break test selectors constantly:

TimeoutError: page.click: Timeout 30000ms exceeded.
  waiting for selector "#submit-btn"

The button still exists — it's just [data-testid="login-submit"] now. A human would fix it in 10 seconds. But at 3 AM in CI, it blocks your entire pipeline.

How It Works

Test runs selector  →  TimeoutError  →  DOM snapshot captured
        ↓
  DOM compressed (scripts/styles stripped, ~8KB)
        ↓
  Prompt sent to LLM (local Ollama or Anthropic Claude)
        ↓
  LLM returns: { "selector": "#new-id", "confidence": "high" }
        ↓
  New selector validated in Playwright
        ↓                          ↘ invalid? retry with feedback
  Test continues ✅                  ("that selector didn't match — try another")
        ↓
  Result cached to disk, keyed by (url, selector) — reused across runs

Project Structure

pytest-self-healer/
├── src/
│   ├── pytest_self_healer/            # Installable package (pip install pytest-self-healer)
│   │   ├── __init__.py
│   │   ├── plugin.py                  # pytest entry point (fixtures + CLI options)
│   │   ├── healing_engine.py          # Core: LLM clients, DOM compression, healing logic
│   │   └── page_wrapper.py            # SelfHealingPage: drop-in Playwright Page replacement
│   ├── evals/
│   │   ├── selector_evalset.json      # Ground-truth dataset for LLM accuracy benchmarking
│   │   ├── run_eval.py                # Standalone eval runner (scores + saves report)
│   │   └── compare_models.py          # Diff two eval reports side by side
│   ├── tests/
│   │   ├── test_healing_examples.py   # Integration tests with intentionally stale selectors
│   │   ├── test_evalset.py            # pytest integration for the evalset
│   │   ├── test_accuracy.py           # LLM accuracy benchmarks (3 tiers)
│   │   └── test_unit.py               # Unit tests (no browser/LLM required)
│   └── conftest.py                    # pytest fixtures, CLI options, report hook
├── docker/
│   ├── Dockerfile                     # Test runner image (Playwright + Python)
│   └── docker-compose.yml             # Ollama + test runner, health-checked
├── reports/
│   ├── healing_report_<ts>.json       # Per-run healing reports
│   └── evals/
│       └── eval_<provider>_<ts>.json  # Per-run eval reports
├── requirements.txt
├── pytest.ini
└── README.md

Quickstart

Option 1: Unit tests only (no browser or LLM needed)

pip install -r requirements.txt
playwright install chromium
PYTHONPATH=src pytest src/tests/test_unit.py -v

Option 2: Full integration tests (requires Ollama running locally)

# Install and start Ollama
brew install ollama        # or: curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5-coder:3b

# Run the tests
PYTHONPATH=src pytest src/tests/ -v \
  --ollama-url=http://localhost:11434 \
  --ollama-model=qwen2.5-coder:3b

Option 3: Use Anthropic Claude instead of Ollama

export ANTHROPIC_API_KEY=sk-ant-...

PYTHONPATH=src pytest src/tests/ -v \
  --llm-provider=anthropic \
  --anthropic-model=claude-haiku-4-5-20251001

Option 4: Docker (everything bundled)

docker compose -f docker/docker-compose.yml up --build
# Reports land in ./reports/healing_report_<timestamp>.json

Writing Your Own Healing Tests

Replace page with SelfHealingPage. Add a purpose string to every interaction:

# After: pip install pytest-self-healer
# No import needed — healing_page fixture is auto-available

async def test_checkout(healing_page):
    await healing_page.goto("https://myapp.com/checkout")

    # Selector is stale — LLM will find the real one
    await healing_page.click(
        selector="button#old-checkout-id",
        purpose="checkout submit button in the cart summary",
    )

    await healing_page.fill(
        selector="input.card-num",
        value="4242424242424242",
        purpose="credit card number input in payment form",
    )

Tips for better healing:

Be specific in purpose: "blue submit button in the login modal" > "button"
Use data-testid attributes in your app for stable baseline selectors
The LLM favors data-testid > aria-label > id > semantic CSS

Healing-aware actions: click, dblclick, hover, fill, type, press, check, uncheck, select_option, focus, tap, set_input_files, text_content, inner_text, input_value, get_attribute, is_visible, is_enabled, wait_for_selector, and drag_and_drop (which heals both the source and the target). Any other Playwright Page method (goto, keyboard, mouse, wait_for_load_state, …) is transparently delegated to the underlying page — SelfHealingPage is a true drop-in.

CLI Options

Flag	Default	Description
`--llm-provider`	`ollama`	`ollama` \| `anthropic` \| `auto`
`--ollama-url`	`http://localhost:11434`	Ollama server endpoint
`--ollama-model`	`qwen2.5-coder:3b`	Model name (also works with `llama3`, `mistral`)
`--anthropic-model`	`claude-haiku-4-5-20251001`	Any Claude model ID
`--anthropic-api-key`	`None`	Falls back to `ANTHROPIC_API_KEY` env var
`--healing-report-dir`	`reports`	Where to write JSON healing reports
`--screenshot-dir`	`reports/screenshots`	Where to write BEFORE/AFTER screenshots
`--healing-max-attempts`	`2`	How many times the LLM may retry a heal with feedback
`--selector-cache-file`	`reports/selector_cache.json`	Persistent selector cache, reused across runs
`--no-selector-cache`	`false`	Disable the persistent cache for this run
`--headless`	`true`	Run browser headless

Healing Report

After each run, a JSON report is written to reports/:

{
  "total_healings_attempted": 3,
  "successful_healings": 3,
  "failed_healings": 0,
  "attempts": [
    {
      "original_selector": "#user-name",
      "element_purpose": "username input field on login form",
      "suggested_selector": "#username",
      "success": true,
      "timestamp": "2024-01-15T10:23:45.123456",
      "model_response_time_ms": 1840.5,
      "dom_size_chars": 4231,
      "provider": "ollama"
    }
  ]
}

Evalset — Benchmarking LLM Accuracy

The evalset is a structured ground-truth dataset (src/evals/selector_evalset.json) used to measure how accurately the LLM finds correct selectors. It is independent of the healing tests — no browser required.

What's in the evalset

12 cases across 6 categories and 3 difficulty levels:

Category	Cases	Difficulty
login	3	easy
checkout	2	medium
search	2	easy
navigation	1	easy
modal	2	medium
profile	1	hard
data-table	1	hard

Each case contains a stale selector, a purpose string, a minimal HTML snippet, and a list of acceptable correct selectors.

Running the evalset

Standalone runner (fastest, no pytest overhead):

# Against local Ollama
PYTHONPATH=src python src/evals/run_eval.py

# Against Anthropic Claude
PYTHONPATH=src python src/evals/run_eval.py \
  --provider anthropic \
  --anthropic-model claude-haiku-4-5-20251001

# Filter to a category or difficulty
PYTHONPATH=src python src/evals/run_eval.py --category login
PYTHONPATH=src python src/evals/run_eval.py --difficulty hard

Via pytest (integrates with your existing test flags):

PYTHONPATH=src pytest src/tests/test_evalset.py -v
PYTHONPATH=src pytest src/tests/test_evalset.py -v -k "login"

Comparing two models

Each eval run saves a timestamped report to reports/evals/. Use compare_models.py to diff two runs:

# Run against model A
PYTHONPATH=src python src/evals/run_eval.py --ollama-model qwen2.5-coder:3b

# Run against model B
PYTHONPATH=src python src/evals/run_eval.py --ollama-model llama3

# Compare
python src/evals/compare_models.py \
  reports/evals/eval_ollama_20260601_120000.json \
  reports/evals/eval_ollama_20260601_120500.json

Output:

  Metric                          A          B     Delta
  -------------------------------------------------------
  Accuracy                    75.0%      91.7%    +16.7%
  Avg response (ms)            2340       1820     -520.0

Adding new evalset cases

Open src/evals/selector_evalset.json and append to the cases array. Each case needs:

{
  "id": "unique-slug",
  "category": "login",
  "difficulty": "easy",
  "stale_selector": "#old-btn",
  "purpose": "login submit button",
  "expected_selectors": ["[data-testid='login-btn']", "button[type='submit']"],
  "html": "<minimal HTML snippet containing the target element>"
}

No code changes needed — the runner and pytest integration pick up new cases automatically.

Architecture Decisions

Decision	Rationale
Local LLM first (Ollama)	No API keys, no data leakage, works offline in CI
Anthropic as opt-in cloud backend	Higher accuracy on complex DOMs; useful when RAM is limited
`auto` provider mode	Uses Claude if `ANTHROPIC_API_KEY` is set, otherwise Ollama — same command works locally and in CI
DOM compression	Strips scripts/styles, keeps semantic attrs. Fits in small model context (~8KB)
Persistent, URL-scoped cache	Keyed by `(url, selector)` so the same selector on different pages never collides; written to disk so a selector healed once is reused across runs, not just within one
Retry with feedback	If a suggestion fails validation, the next prompt names the failed selector and asks for a different one — meaningfully lifts success rate on small models
Confidence scores	LLM self-reports certainty; useful for alerting on `low` confidence heals
`purpose` string	Natural language > brittle heuristics. Tells LLM why you want the element
Automatic passthrough	Non-selector `Page` APIs are delegated via `__getattr__`, so the wrapper never lags behind Playwright's API
Evalset separate from tests	Ground-truth data lives in JSON, not test code — easy to grow and compare across models

Extending

Swap the LLM: Change --ollama-model=mistral or use --llm-provider=anthropic for Claude
Tune retries: Raise --healing-max-attempts on small/local models, lower it to 1 to fail fast
Alert on low confidence: Check attempt["confidence"] == "low" in the report and open a GitHub issue automatically
Grow the evalset: Add cases to selector_evalset.json to cover your app's specific UI patterns
CI accuracy gate: Run run_eval.py in CI and fail the build if accuracy drops below a threshold
Auto-PR on heal: Use a high-confidence heal as the trigger to open a PR updating the stale selector at its source

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jun 26, 2026

0.1.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_self_healer-0.2.0.tar.gz (38.3 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytest_self_healer-0.2.0-py3-none-any.whl (17.0 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file pytest_self_healer-0.2.0.tar.gz.

File metadata

Download URL: pytest_self_healer-0.2.0.tar.gz
Upload date: Jun 26, 2026
Size: 38.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.0

File hashes

Hashes for pytest_self_healer-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ceeaf884d71a67f2806af56c965167ffa9a40670c07ab0c5662a0e9ffa5664bb`
MD5	`204ecf8e2f38ea9f96e5578791b5b761`
BLAKE2b-256	`dbdc42d1f3c8a90d88f3ab0d7856b32a3a9261ea89e4fa6c0770f4409d5d9574`

See more details on using hashes here.

File details

Details for the file pytest_self_healer-0.2.0-py3-none-any.whl.

File metadata

Download URL: pytest_self_healer-0.2.0-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 17.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.0

File hashes

Hashes for pytest_self_healer-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`50e1dbb24f471a0d67eb15a7232494203fe4b983cc0a948781be084463bfa335`
MD5	`917048847dc1c815bf2fe9829c9685c5`
BLAKE2b-256	`786a0f9a9e41f022532b9c90a0573fa4434d58ec0ec03792dca0e2065206803c`

See more details on using hashes here.

pytest-self-healer 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🛠 Self-Healing Test Automation Framework

The Problem

How It Works

Project Structure

Quickstart

Option 1: Unit tests only (no browser or LLM needed)

Option 2: Full integration tests (requires Ollama running locally)

Option 3: Use Anthropic Claude instead of Ollama

Option 4: Docker (everything bundled)

Writing Your Own Healing Tests

CLI Options

Healing Report

Evalset — Benchmarking LLM Accuracy

What's in the evalset

Running the evalset

Comparing two models

Adding new evalset cases

Architecture Decisions

Extending

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes