Agentic Crawler Discovery Framework.

These details have not been verified by PyPI

Project description

azcrawlerpy

A framework for navigating and filling multi-step web forms programmatically. Supports Camoufox anti-detect browser (built on Firefox with C++ level fingerprint spoofing) and standard Chromium via Playwright. Uses JSON instruction files to define form navigation workflows, making it ideal for automated form submission, web scraping, and AI agent-driven web interactions.

Installation
Quick Start
Logging & Telemetry
Core Concepts
Instructions Schema
Data Points (input_data)
Element Discovery
AI Agent Guidance
Retry and Resilience
Error Handling and Diagnostics
Browser Profile Building (Profiler)
Examples

Installation

uv add azcrawlerpy

Or install from source:

uv pip install -e .

Quick Start

FormCrawler requires a logger. In production, pass an AzureLogger from azpaddypy so crawl telemetry (spans, correlation IDs, structured records) flows into Application Insights. For local scripts and notebooks, any stdlib-shaped logger works — anything that implements debug/info/warning/error/exception/critical.

import asyncio
from pathlib import Path

from azcrawlerpy import CrawlerBrowserConfig, DebugMode, FormCrawler, HumanizeConfig
from azpaddypy.mgmt.logging import AzureLogger, bootstrap_azure_monitor


async def main():
    # 1. Wire telemetry once per process (idempotent; no-op without a
    #    connection string so it's safe in local runs).
    bootstrap_azure_monitor(service_name="crawler-demo")
    logger = AzureLogger(__name__)

    # 2. CrawlerBrowserConfig controls runtime browser settings
    #    (proxy, stealth, humanize).
    browser_config = CrawlerBrowserConfig(humanize=HumanizeConfig(enabled=True))

    # 3. FormCrawler now requires a logger -- pass AzureLogger for App Insights
    #    integration, or a stdlib `logging.getLogger(__name__)` for local work.
    crawler = FormCrawler(
        headless=True,
        browser_config=browser_config,
        logger=logger,
    )

    instructions = {
        "url": "https://example.com/form",
        "browser_config": {
            "browser_type": "camoufox",
            "viewport_width": 1920,
            "viewport_height": 1080
        },
        "steps": [
            {
                "name": "step_1",
                "wait_for": "input[name='email']",
                "timeout_ms": 15000,
                "fields": [
                    {
                        "type": "text",
                        "selector": "input[name='email']",
                        "data_key": "email"
                    }
                ],
                "next_action": {
                    "type": "click",
                    "selector": "button[type='submit']"
                }
            }
        ],
        "final_page": {
            "wait_for": ".success-message",
            "timeout_ms": 60000
        }
    }

    input_data = {
        "email": "user@example.com"
    }

    result = await crawler.crawl(
        url=instructions["url"],
        input_data=input_data,
        instructions=instructions,
        output_dir=Path("./output"),
        debug_mode=DebugMode.ALL,
    )

    print(f"Final URL: {result.final_url}")
    print(f"Steps completed: {result.steps_completed}")
    print(f"Screenshot saved: {result.screenshot_path}")
    print(f"Extracted data: {result.extracted_data}")

asyncio.run(main())

Logging & Telemetry

FormCrawler accepts any logger that matches the stdlib logging.Logger shape (structural LoggerLike protocol in crawling/utils.py): debug, info, warning, error, exception, critical.

Production: AzureLogger

Pairing the crawler with AzureLogger from azpaddypy gives you, with no code in the crawler itself:

Correlation IDs propagated via contextvars so one crawl request is grep-able end-to-end.
Trace context (trace_id, span_id) attached to every log record.
Application Insights export of all crawl log records.
OpenTelemetry spans on methods you decorate with @trace_function.

from azpaddypy.mgmt.logging import AzureLogger, bootstrap_azure_monitor, trace_function

bootstrap_azure_monitor(service_name="crawler-api", service_version="0.5.1")
logger = AzureLogger(__name__)

# Optional: generate a correlation ID per request + emit a parent span
@trace_function(name="crawl_request")
async def handle_crawl(url: str, input_data: dict, instructions: dict):
    crawler = FormCrawler(headless=True, browser_config=None, logger=logger)
    return await crawler.crawl(
        url=url, input_data=input_data, instructions=instructions, output_dir=None,
    )

All log records emitted by the crawler will carry the correlation ID and span context of whatever span is active when crawl() runs.

Local / scripts: stdlib logger

If you don't need App Insights, a standard library logger is enough:

import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
crawler = FormCrawler(
    headless=True,
    browser_config=None,
    logger=logging.getLogger("my_crawler"),
)

Any logger that satisfies the LoggerLike Protocol is accepted — there is no hard runtime dependency on azpaddypy.

Core Concepts

The framework operates on two primary inputs:

Instructions (instructions.json): Defines the form structure, selectors, navigation flow, and field types
Data Points (input_data): Contains the actual values to fill into form fields

The crawler processes each step sequentially:

Wait for the step's wait_for selector to become visible
Fill all fields defined in the step using values from input_data
Execute the next_action to navigate to the next step
Repeat until all steps are complete
Wait for and capture the final page

Instructions Schema

Top-Level Structure

{
  "url": "https://example.com/form",
  "browser_config": { ... },
  "cookie_consent": { ... },
  "steps": [ ... ],
  "final_page": { ... },
  "data_extraction": { ... }
}

Field	Type	Required	Description
`url`	string	Yes	Starting URL for the form
`browser_config`	object	Yes	Browser engine, viewport, and user agent settings
`cookie_consent`	object	No	Cookie banner handling configuration
`captcha`	object	No	CAPTCHA handling configuration (e.g., Cloudflare Turnstile)
`steps`	array	Yes	Ordered list of form steps
`final_page`	object	Yes	Configuration for the result page
`data_extraction`	object	No	Configuration for extracting data from final page
`profiler`	object	No	Browser profile building configuration (visit sites to accumulate cookies before crawling)

Browser Configuration

{
  "browser_config": {
    "viewport_width": 1920,
    "viewport_height": 1080,
    "user_agent": "Mozilla/5.0 ..."
  }
}

Field	Type	Required	Description
`browser_type`	string	Yes	Browser engine: `camoufox` (anti-detect Firefox) or `chromium` (standard Playwright)
`viewport_width`	integer	Yes	Browser viewport width in pixels
`viewport_height`	integer	Yes	Browser viewport height in pixels
`user_agent`	string	No	Custom user agent string
`blocked_url_patterns`	array	No	URL glob patterns to block via `page.route()` (e.g., `/analytics/`)

Cookie Consent Handling

The framework supports two modes for handling cookie consent banners:

Standard Mode (regular DOM elements):

{
  "cookie_consent": {
    "banner_selector": "dialog:has-text('cookies')",
    "accept_selector": "button:has-text('Accept')"
  }
}

Shadow DOM Mode (for Usercentrics, OneTrust, etc.):

{
  "cookie_consent": {
    "banner_selector": "#usercentrics-cmp-ui",
    "shadow_host_selector": "#usercentrics-cmp-ui",
    "accept_button_texts": ["Accept All", "Alle akzeptieren"]
  }
}

Field	Type	Required	Description
`banner_selector`	string	Yes	CSS selector for the banner container
`accept_selector`	string	No	CSS selector for accept button (standard mode)
`shadow_host_selector`	string	No	CSS selector for shadow DOM host
`accept_button_texts`	array	No	Text patterns to match accept buttons in shadow DOM
`banner_settle_delay_ms`	integer	No	Wait time before checking for banner
`banner_visible_timeout_ms`	integer	No	Timeout for banner visibility
`accept_button_timeout_ms`	integer	No	Timeout for accept button visibility
`post_consent_delay_ms`	integer	No	Wait time after handling consent
`js_fallback_texts`	array	No	Custom text patterns for JS fallback button matching (overrides defaults)

The JS fallback matches button text against: klar, akzept, accept, agree, ok, verstanden, zustimm. Set js_fallback_texts to override this list with site-specific patterns.

Step Definitions

Each step represents a form page or section:

{
  "name": "personal_info",
  "wait_for": "input[name='firstName']",
  "timeout_ms": 15000,
  "fields": [ ... ],
  "next_action": { ... }
}

Field	Type	Required	Description
`name`	string	Yes	Unique identifier for the step
`wait_for`	string	Yes	CSS selector to wait for before processing
`timeout_ms`	integer	Yes	Timeout in milliseconds for wait condition
`fields`	array	Yes	List of field definitions
`next_action`	object	Yes	Action to navigate to next step
`data_extraction`	array	No	Data to extract BEFORE field handling in this step
`post_field_extraction`	array	No	Data to extract AFTER field handling (for modal results, dynamic values)
`optional`	boolean	No	Skip step gracefully if wait_for selector is not found (see Optional Steps)
`restart_crawl`	boolean	No	Trigger a full browser restart from the start URL when wait_for times out (see Restart Crawl). Requires `max_restarts`.
`max_restarts`	integer	No	Maximum number of full crawl restart attempts when `restart_crawl` is true (must be >= 1). Each step tracks its own restart budget.

Restart Crawl

Steps marked with "restart_crawl": true raise an internal CrawlRestartError when the wait_for selector times out. The crawler then closes the browser, recreates the context (reusing any profiler storage state so cookie consent does not re-run), and replays the entire crawl from the start URL. Each step tracks its own restart budget against max_restarts; once exhausted, the original timeout error is raised.

{
  "name": "results_page",
  "wait_for": "[data-cy='quote-result']",
  "timeout_ms": 15000,
  "restart_crawl": true,
  "max_restarts": 2,
  "fields": [],
  "next_action": { "type": "noop" }
}

restart_crawl takes priority over optional and strict. Use it for steps where the page may render broken or never finalize due to flaky SPA/backend behavior, where a fresh browser context is the most reliable recovery. Prefer retry_config on individual fields/actions for transient interaction failures; reserve restart_crawl for full-page load failures.

Optional Steps

Steps marked with "optional": true are skipped gracefully when the wait_for selector is not found within timeout_ms. The entire step (fields, data extraction, next_action) is skipped with an info log. This is useful for conditional workflow pages that only appear depending on prior input values or dynamic form behavior.

{
  "name": "additional_driver_details",
  "wait_for": "#additional-driver-form",
  "timeout_ms": 5000,
  "optional": true,
  "fields": [ ... ],
  "next_action": { "type": "click", "selector": "#next" }
}

The optional check takes precedence over the strict parameter. When a step is optional and the wait_for selector times out, the step is always skipped regardless of strict mode. Non-optional steps follow the existing behavior: raise CrawlerTimeoutError when strict=True, or log a warning and continue into the step when strict=False.

Tip: Use a shorter timeout_ms (e.g., 3000-5000ms) on optional steps to avoid waiting the full timeout when the step is absent.

Field Types

TEXT

For text inputs, email fields, phone numbers, and similar:

{
  "type": "text",
  "selector": "input[name='email']",
  "data_key": "email"
}

TEXTAREA

For multi-line text areas:

{
  "type": "textarea",
  "selector": "textarea[name='message']",
  "data_key": "message"
}

DROPDOWN / SELECT

For native <select> elements:

{
  "type": "dropdown",
  "selector": "select[name='country']",
  "data_key": "country",
  "type_config": {
    "select_by": "text"
  }
}

type_config Parameter	Values	Description
`select_by`	`text`, `value`, `index`	How to match the option
`option_visible_timeout_ms`	integer	Timeout in ms for option visibility

RADIO

For radio button groups:

{
  "type": "radio",
  "selector": "input[type='radio'][value='${value}']",
  "data_key": "payment_method"
}

Pattern A - Value-driven selector: Use ${value} placeholder in selector, data provides the value:

{
  "type": "radio",
  "selector": "input[type='radio'][value='${value}']",
  "data_key": "gender"
}
// data: { "gender": "male" }

Pattern B - Boolean flags: Use explicit selectors with boolean data values:

{
  "type": "radio",
  "selector": "[role='radio']:has-text('Yes')",
  "data_key": "accept_terms",
  "force_click": true
}
// data: { "accept_terms": true }  // clicks if truthy, skips if null

force_click is supported on radio, click_only, and click_select field types. When set, the click fallback chain tries force click before normal click.

CHECKBOX

For checkbox inputs:

{
  "type": "checkbox",
  "selector": "input[type='checkbox'][name='newsletter']",
  "data_key": "subscribe_newsletter"
}

Data value true checks the box, false unchecks it, and null skips the field entirely (no interaction).

DATE

For date inputs with format conversion:

{
  "type": "date",
  "selector": "input[name='birthdate']",
  "data_key": "birthdate",
  "type_config": {
    "format": "DD.MM.YYYY"
  }
}

Supported formats (mapped to strftime internally):

Format	Example	Description
`DD.MM.YYYY`	15.06.1985	Day.Month.Year
`MM/DD/YYYY`	06/15/1985	Month/Day/Year
`YYYY-MM-DD`	1985-06-15	ISO format
`DD/MM/YYYY`	15/06/1985	Day/Month/Year
`YYYY/MM/DD`	1985/06/15	Year/Month/Day
`DD-MM-YYYY`	15-06-1985	Day-Month-Year
`MM-DD-YYYY`	06-15-1985	Month-Day-Year

Data must be provided in ISO format (YYYY-MM-DD) in input_data. The type_config.format specifies the output format for typing into the field. Native <input type="date"> fields are auto-detected and use the value as-is, no type_config needed.

SLIDER

For range inputs:

{
  "type": "slider",
  "selector": "input[type='range'][name='coverage']",
  "data_key": "coverage_amount"
}

FILE

For file upload fields:

{
  "type": "file",
  "selector": "input[type='file']",
  "data_key": "document_path"
}

Data value should be the absolute file path.

COMBOBOX

For autocomplete/typeahead inputs:

{
  "type": "combobox",
  "selector": "input[aria-label='City']",
  "data_key": "city",
  "type_config": {
    "option_selector": ".autocomplete-option",
    "type_delay_ms": 50,
    "wait_after_type_ms": 500,
    "press_enter": true
  }
}

type_config Parameter	Description
`option_selector`	CSS selector for dropdown options (required)
`type_delay_ms`	Delay between keystrokes (simulates human typing)
`wait_after_type_ms`	Wait time for options to appear
`press_enter`	Press Enter after selecting option
`clear_before_type`	Clear field before typing
`option_visible_timeout_ms`	Timeout in ms for option visibility

CLICK_SELECT

For custom dropdowns requiring click-then-select:

{
  "type": "click_select",
  "selector": ".custom-dropdown-trigger",
  "data_key": "option_value",
  "post_click_delay_ms": 300,
  "type_config": {
    "option_selector": ".dropdown-item:has-text('${value}')"
  }
}

CLICK_ONLY

For elements that only need clicking (no data input):

{
  "type": "click_only",
  "selector": "button.expand-section"
}

With conditional clicking based on data:

{
  "type": "click_only",
  "selector": "button:has-text('${value}')",
  "data_key": "selected_option"
}

IFRAME_FIELD

For fields inside iframes (alternative to iframe_selector):

{
  "type": "iframe_field",
  "selector": "input[name='card_number']",
  "iframe_selector": "iframe#payment-frame",
  "data_key": "card_number"
}

Common Field Parameters

Parameter	Type	Description
`data_key`	string	Key in input_data to get value from
`selector`	string	CSS/Playwright selector for the element
`type_config`	object	Type-specific configuration (see field type sections above)
`iframe_selector`	string	Selector for parent iframe if field is embedded
`field_visible_timeout_ms`	integer	Timeout for field to become visible
`post_click_delay_ms`	integer	Wait after clicking the field
`skip_verification`	boolean	Skip value verification after filling
`force_click`	boolean	Use force click to bypass overlays (`click_only`, `radio`, and `click_select`)
`optional`	boolean	Skip field gracefully if element is not found or interaction fails
`retry_config`	object	Retry configuration for transient failures (see Retry and Resilience)

Optional Fields

Fields marked with "optional": true are skipped gracefully when the element is not found on the page or when interaction fails. This is useful for elements that may or may not appear depending on dynamic page behavior, A/B tests, or conditional rendering that cannot be predicted by data alone.

{
  "type": "click_only",
  "selector": "#promotional-banner button.dismiss",
  "data_key": "dismiss_promo",
  "optional": true,
  "field_visible_timeout_ms": 2000
}

How optional differs from null values:

Mechanism	Element lookup	Use case
Value = `null` in input_data	No (skipped immediately)	Field exists but should not be filled for this data row
`"optional": true`	Yes (waits for visibility)	Field may or may not exist on the page

When optional is set and the element is not found or interaction fails, the handler logs an info message and continues to the next field. No error is raised regardless of the strict mode.

Tip: Set a low field_visible_timeout_ms (e.g., 1000-2000ms) on optional fields to avoid waiting the full default timeout when the element is absent.

Note: For skipping entire steps (all fields + next_action), use Optional Steps instead.

Action Types

CLICK

Click a button or link:

{
  "type": "click",
  "selector": "button[type='submit']"
}

With iframe support:

{
  "type": "click",
  "selector": "button:has-text('Next')",
  "iframe_selector": "iframe#form-frame"
}

WAIT

Wait for an element to appear:

{
  "type": "wait",
  "selector": ".loading-complete"
}

WAIT_HIDDEN

Wait for an element to disappear:

{
  "type": "wait_hidden",
  "selector": ".loading-spinner"
}

SCROLL

Scroll to an element:

{
  "type": "scroll",
  "selector": "#section-bottom"
}

DELAY

Wait for a fixed time:

{
  "type": "delay",
  "delay_ms": 2000
}

CONDITIONAL

Execute actions based on conditions:

{
  "type": "conditional",
  "condition": {
    "type": "selector_visible",
    "selector": ".error-message"
  },
  "actions": [
    {
      "type": "click",
      "selector": "button.dismiss-error"
    }
  ]
}

Condition types:

selector_visible: True if selector is visible on the page
selector_hidden: True if selector is NOT visible on the page
data_equals: True if input_data[key] equals value
data_exists: True if input_data[key] is truthy

Common Action Parameters

Parameter	Type	Description
`selector`	string	Target element selector
`iframe_selector`	string	Selector for parent iframe
`delay_ms`	integer	Delay/timeout in ms (for `delay`, `wait`, `wait_hidden` actions)
`condition`	object	Condition definition (for `conditional` actions)
`actions`	array	Nested actions to execute if condition is met (for `conditional` actions)
`pre_action_delay_ms`	integer	Wait before executing action
`post_action_delay_ms`	integer	Wait after executing action
`retry_config`	object	Retry configuration for transient failures (see Retry and Resilience)

Final Page Configuration

{
  "final_page": {
    "wait_for": ".result-container, .confirmation",
    "timeout_ms": 60000,
    "post_wait_delay_ms": 2000,
    "screenshot_selector": ".result-panel"
  }
}

Field	Type	Required	Description
`wait_for`	string	Conditional	CSS selector to wait for (required if `wait_for_data_key` not set)
`wait_for_data_key`	string	Conditional	Data key for exact text match (uses `text="{value}"` selector). Required if `wait_for` not set.
`timeout_ms`	integer	Yes	Timeout in milliseconds for waiting
`post_wait_delay_ms`	integer	No	Delay in ms after selector found, for SPA content to render (default: 0)
`screenshot_selector`	string	No	Element to screenshot (null for full page)

Only one of wait_for or wait_for_data_key can be set (not both).

Data Extraction Configuration

Extract structured data from the final page using CSS selectors:

{
  "data_extraction": {
    "fields": {
      "tier_prices": {
        "selector": ".price-value",
        "attribute": null,
        "regex": "([0-9]+[.,][0-9]{2})",
        "multiple": true,
        "iframe_selector": "iframe#form-frame"
      },
      "selected_price": {
        "selector": "#total-amount",
        "attribute": "data-value",
        "regex": null,
        "multiple": false
      }
    }
  }
}

Field	Type	Required	Description
`selector`	string	Yes	CSS selector to locate element(s)
`attribute`	string	No	Element attribute to extract (null for text content)
`regex`	string	No	Regex pattern to apply (uses first capture group if present)
`multiple`	boolean	Yes	True for list of all matches, False for first match only
`iframe_selector`	string	No	CSS selector for iframe if element is inside one

The data_extraction config also supports nested_fields to combine flat extracted arrays into structured outputs (paired_dict or object_list). See docs/README.md for details.

Extracted data is available in the crawl result:

result = await crawler.crawl(...)
print(result.extracted_data)
# {'tier_prices': ['32,28', '35,26', '50,34'], 'selected_price': '35,26'}

Data Points (input_data)

The input_data dictionary provides values for form fields. Keys must match data_key values in the instructions.

Structure

{
  "email": "user@example.com",
  "first_name": "John",
  "last_name": "Doe",
  "birthdate": "1985-06-15",
  "country": "Germany",
  "accept_terms": true,
  "newsletter": false,
  "premium_option": null
}

Value Types

Type	Description	Example
String	Text values, dropdown selections	`"John"`
Boolean	Checkbox/radio toggle	`true`, `false`
Null	Skip this field entirely (no DOM interaction)	`null`
Integer/Float	Numeric inputs, sliders	`12000`, `99.99`

Note: Setting a value to null skips the field without any DOM interaction. This is different from "optional": true on the field definition, which attempts the interaction but tolerates failure. See Optional Fields.

Radio Button Patterns

Pattern A - Mutually exclusive options with value selector:

{
  "gender": "male"
}

Selector uses ${value} placeholder: input[value='${value}']

Pattern B - Boolean flags for each option:

{
  "option_a": true,
  "option_b": null,
  "option_c": null
}

Only the option with true gets clicked.

Date Handling

Dates in input_data should use ISO format (YYYY-MM-DD):

{
  "birthdate": "1985-06-15",
  "start_date": "2024-01-01"
}

The framework converts to the format specified in the field definition.

Element Discovery

The ElementDiscovery class scans web pages to identify interactive elements, helping build instructions.json files.

from pathlib import Path
from azcrawlerpy import ElementDiscovery

async def discover_elements():
    discovery = ElementDiscovery(headless=False)

    report = await discovery.discover(
        url="https://example.com/form",
        output_dir=Path("./discovery_output"),
        cookie_consent={
            "banner_selector": "#cookie-banner",
            "accept_selector": "button.accept"
        },
        explore_iframes=True,
        screenshot=True,
    )

    print(f"Found {report.total_elements} elements")

    for text_input in report.text_inputs:
        print(f"Text input: {text_input.selector}")
        print(f"  Suggested type: {text_input.suggested_field_type}")

    for dropdown in report.selects:
        print(f"Dropdown: {dropdown.selector}")
        print(f"  Options: {dropdown.options}")

    for radio_group in report.radio_groups:
        print(f"Radio group: {radio_group.name}")
        for option in radio_group.options:
            print(f"  - {option.label}: {option.selector}")

Discovery Report Contents

text_inputs: Text, email, phone, password fields
textareas: Multi-line text areas
selects: Native dropdown elements with options
radio_groups: Grouped radio buttons
checkboxes: Checkbox inputs
buttons: Clickable buttons
links: Anchor elements
date_inputs: Date picker fields
file_inputs: File upload fields
sliders: Range inputs
custom_components: Non-standard interactive elements
iframes: Discovered iframes with their elements

AI Agent Guidance

This section provides instructions for AI agents tasked with creating instructions.json and input_data files.

Workflow for Creating Instructions

Discovery Phase: Use ElementDiscovery to scan each page/step of the form
Mapping Phase: Map discovered elements to field definitions
Flow Definition: Define step transitions and actions
Data Schema: Create the input_data structure

Step-by-Step Process

1. Analyze the Form Structure

Identify how many pages/steps the form has
Note the URL pattern changes (if any)
Identify what element appears when each step loads

2. For Each Step, Define:

{
  "name": "<descriptive_step_name>",
  "wait_for": "<selector_that_confirms_step_loaded>",
  "timeout_ms": 15000,
  "fields": [...],
  "next_action": {...}
}

Naming conventions:

Use snake_case for step names: personal_info, payment_details
Use descriptive data_keys: first_name, email_address, accepts_terms

3. Selector Priority

When choosing selectors, prefer in order:

[data-testid='...'] or [data-cy='...'] - Most stable
[aria-label='...'] or [aria-labelledby='...'] - Accessible and stable
input[name='...'] - Form field names
:has-text('...') - Text content (use for buttons/labels)
CSS class selectors - Least stable, avoid if possible

4. Handle Dynamic Content

For AJAX-loaded content:

Use wait action before interacting
Add field_visible_timeout_ms to field definitions
Use post_click_delay_ms for fields that trigger updates

5. Radio Button Strategy

Option A - When radio values are meaningful:

{
  "type": "radio",
  "selector": "input[type='radio'][value='${value}']",
  "data_key": "payment_type"
}
// data: { "payment_type": "credit_card" }

Option B - When you need individual control:

{
  "type": "radio",
  "selector": "[role='radio']:has-text('Credit Card')",
  "data_key": "payment_credit_card",
  "force_click": true
},
{
  "type": "radio",
  "selector": "[role='radio']:has-text('PayPal')",
  "data_key": "payment_paypal",
  "force_click": true
}
// data: { "payment_credit_card": true, "payment_paypal": null }

6. Iframe Handling

When elements are inside iframes:

{
  "type": "text",
  "selector": "input[name='card_number']",
  "iframe_selector": "iframe#payment-iframe",
  "data_key": "card_number"
}

Creating input_data

1. Analyze Required Fields

From the instructions, extract all unique data_key values:

data_keys = set()
for step in instructions["steps"]:
    for field in step["fields"]:
        if field.get("data_key"):
            data_keys.add(field["data_key"])

2. Determine Value Types

Field Type	Data Type	Example
text, textarea	string	`"John Doe"`
dropdown	string	`"Germany"`
radio (value-driven)	string	`"option_a"`
radio (boolean)	boolean/null	`true` or `null`
checkbox	boolean	`true` / `false`
date	string (ISO)	`"1985-06-15"`
slider	number	`50000`
file	string (path)	`"/path/to/file.pdf"`

3. Handle Mutually Exclusive Options

For radio groups with boolean flags, only ONE should be true:

{
  "employment_fulltime": true,
  "employment_parttime": null,
  "employment_selfemployed": null,
  "employment_unemployed": null
}

4. Date Format

Always provide dates in ISO format in input_data:

{
  "birthdate": "1985-06-15",
  "policy_start": "2024-01-01"
}

The instructions specify the output format for the specific form.

Common Patterns

Multi-Step Wizard

{
  "steps": [
    {
      "name": "step_1_personal",
      "wait_for": "input[name='firstName']",
      "fields": [...],
      "next_action": { "type": "click", "selector": "button:has-text('Next')" }
    },
    {
      "name": "step_2_address",
      "wait_for": "input[name='street']",
      "fields": [...],
      "next_action": { "type": "click", "selector": "button:has-text('Next')" }
    }
  ]
}

Form with Loading States

{
  "next_action": {
    "type": "click",
    "selector": "button[type='submit']",
    "post_action_delay_ms": 1000
  }
}

Conditional Fields

{
  "type": "conditional",
  "condition": {
    "type": "data_equals",
    "key": "has_additional_driver",
    "value": true
  },
  "actions": [
    {
      "type": "click",
      "selector": "button:has-text('Add Driver')"
    }
  ]
}

Retry and Resilience

The framework provides built-in retry and fallback mechanisms for handling transient failures during web interactions.

Retry Configuration

Add retry_config to any field, action, step extraction, or data extraction to enable automatic retry with exponential backoff:

{
  "type": "text",
  "selector": "input[name='email']",
  "data_key": "email",
  "retry_config": {
    "max_attempts": 3,
    "base_delay_ms": 500,
    "backoff_multiplier": 1.5
  }
}

Parameter	Type	Required	Description
`max_attempts`	integer	Yes	Total attempts (1 = no retry, 2 = one retry, etc.)
`base_delay_ms`	integer	Yes	Base delay between retries in milliseconds
`backoff_multiplier`	float	No	Multiplier per retry (default: 1.5). Delay = `base_delay_ms * backoff_multiplier^(attempt-1)`

Retryable errors include Playwright TimeoutError and Error (element not found, intercepted clicks, etc.).

Click Fallback Chain

All click interactions (click actions, click_only, radio, click_select fields) use an escalating fallback strategy when a click fails:

Normal click (or force click if force_first=True)
Force click (or normal click if force_first=True) -- bypasses actionability checks
JS dispatchEvent -- fires a MouseEvent via element.dispatchEvent()
JS element.click() -- calls element.click() via document.querySelector()

This handles common real-world issues like overlay interception, sticky headers, and elements that pass Playwright's visibility checks but fail to receive pointer events.

Value Verification

text and textarea fields verify the actual input value after filling. If the value written to the field does not match the expected value, a FieldInteractionError is raised. This catches silent data corruption from autofill interference, input masks, or JavaScript reformatting.

Set skip_verification: true on the field to disable this check for fields where the site intentionally transforms the input (e.g., phone number formatting).

ComboBox Degraded Retry

When retry_config is set on a combobox field, retries use degraded parameters to handle slow autocomplete responses:

Typing delay scales by 1.5^(attempt-1) (slower typing gives autocomplete more time)
The field is always cleared before retyping on retry attempts

Error Handling and Diagnostics

The framework provides detailed error information when failures occur.

Exception Types

Exception	When Raised
`FieldNotFoundError`	Selector doesn't match any element (not raised for `optional` fields)
`FieldInteractionError`	Element found but interaction failed, or value mismatch after fill (not raised for `optional` fields)
`CrawlerTimeoutError`	Wait condition not met within timeout
`NavigationError`	Navigation action failed
`MissingDataError`	Required data_key not in input_data
`InvalidInstructionError`	Malformed instructions JSON
`UnsupportedFieldTypeError`	Unknown field type specified
`UnsupportedActionTypeError`	Unknown action type specified
`IframeNotFoundError`	Specified iframe not found
`DataExtractionError`	Data extraction from final page failed

Debug Mode

Enable debug mode to capture screenshots at various stages:

from azcrawlerpy import DebugMode

result = await crawler.crawl(
    ...,
    debug_mode=DebugMode.ALL,  # Capture all screenshots
)

Mode	Description
`NONE`	No debug screenshots
`START`	Screenshot at form start
`END`	Screenshot at form end
`ALL`	Screenshots after every field and action

AI Diagnostics

When errors occur with debug mode enabled, the framework captures:

Current page URL and title
Available data-cy and data-testid selectors
Visible buttons and input fields
Similar selectors (fuzzy matching suggestions)
Console errors and warnings
Failed network requests
XHR/fetch API responses (URL, status, method) for debugging SPA extraction failures
HTML snippet of the form area
Error screenshot (saved to disk or captured in-memory)

This information is included in the exception message and saved to error_diagnostics.json (when output_dir is set).

In-Memory Error Diagnostics

When running with output_dir=None, error diagnostics are fully available in-memory. The error screenshot is captured as bytes and accessible in two ways:

On the exception's diagnostics: exception.diagnostics.screenshot_bytes
As the last entry in the partial result: exception.partial_result.screenshots[-1]

try:
    result = await crawler.crawl(
        url=url,
        input_data=input_data,
        instructions=instructions,
        output_dir=None,
        debug_mode=DebugMode.ALL,
    )
except CrawlerError as e:
    # Access the error screenshot bytes directly from diagnostics
    if e.diagnostics and e.diagnostics.screenshot_bytes:
        error_screenshot = e.diagnostics.screenshot_bytes

    # Or from the partial result (includes all screenshots taken during the crawl)
    if e.partial_result:
        all_screenshots = e.partial_result.screenshots  # includes error screenshot as last item
        extracted_so_far = e.partial_result.extracted_data
        steps_done = e.partial_result.steps_completed

Browser Profile Building (Profiler)

The profiler visits random sites before the main crawl to accumulate cookies and storage, making the browser appear more natural. This is configured via the profiler field in instructions.json.

Profiler Modes

The profiler supports two storage modes:

Disk mode (storage_path set): Persists the browser profile to disk.

Camoufox: saves as a Firefox user data directory ({storage_path}.camoufox/)
Chromium: saves as a JSON file ({storage_path}.chromium.json)

In-memory mode (storage_path omitted or null): No permanent files written.

Chromium: returns storage state as a dict, passed directly to browser.new_context(storage_state=dict)
Camoufox: uses a temporary directory (cleaned up on error, passed as user_data_dir during crawl)

Profiler Configuration

{
  "profiler": {
    "sites": [
      {
        "url": "https://www.google.de",
        "cookie_consent": {
          "banner_selector": "[role='dialog'], .cookie-banner",
          "accept_selector": "button[id*='accept'], button[id*='agree']",
          "js_fallback_texts": ["akzept", "accept", "zustimm"]
        },
        "browse_delay_ms": 2000
      },
      {
        "url": "https://en.wikipedia.org",
        "browse_delay_ms": 3000
      }
    ],
    "visit_count": 3,
    "ignore_errors": true,
    "storage_path": null,
    "inter_site_delay_ms": 1000
  }
}

ProfilerConfig Parameters

Field	Type	Required	Description
`sites`	array	Yes	List of `ProfilerSiteConfig` objects to randomly select from
`visit_count`	integer	Yes	Number of sites to randomly visit (must be <= length of `sites`)
`ignore_errors`	boolean	Yes	If true, continue profiling when a single site visit fails
`storage_path`	string or null	No	Base path for profile storage. When null (default), runs in-memory mode
`inter_site_delay_ms`	integer	No	Delay in ms between visiting each site

ProfilerSiteConfig Parameters

Each site in the sites array supports per-site cookie consent handling:

Field	Type	Required	Description
`url`	string	Yes	URL to visit for profile building
`cookie_consent`	object	No	Cookie consent config for this specific site (same schema as top-level `cookie_consent`)
`browse_delay_ms`	integer	No	Time in ms to linger on the page after cookie consent handling

ProfilerResult

The profiler returns a ProfilerResult with:

Field	Type	Description
`visited_urls`	list[str]	URLs that were visited during profiling
`storage_state_path`	Path or None	Path to persisted profile (disk mode), or temp dir path (Camoufox in-memory)
`storage_state`	dict or None	In-memory storage state dict (Chromium in-memory mode only)

The crawler automatically passes the profiler result to the browser context creation, so no manual wiring is needed.

Examples

Insurance Quote Form

instructions.json:

{
  "url": "https://insurance.example.com/quote",
  "browser_config": {
    "viewport_width": 1920,
    "viewport_height": 1080
  },
  "cookie_consent": {
    "banner_selector": "#cookie-banner",
    "accept_selector": "button:has-text('Accept')"
  },
  "steps": [
    {
      "name": "vehicle_info",
      "wait_for": "input[name='hsn']",
      "timeout_ms": 15000,
      "fields": [
        {
          "type": "text",
          "selector": "input[name='hsn']",
          "data_key": "vehicle_hsn"
        },
        {
          "type": "text",
          "selector": "input[name='tsn']",
          "data_key": "vehicle_tsn"
        },
        {
          "type": "date",
          "selector": "input[name='registration_date']",
          "data_key": "first_registration",
          "type_config": {
            "format": "MM.YYYY"
          }
        }
      ],
      "next_action": {
        "type": "click",
        "selector": "button:has-text('Continue')"
      }
    },
    {
      "name": "personal_info",
      "wait_for": "input[name='birthdate']",
      "timeout_ms": 15000,
      "fields": [
        {
          "type": "date",
          "selector": "input[name='birthdate']",
          "data_key": "birthdate",
          "type_config": {
            "format": "DD.MM.YYYY"
          }
        },
        {
          "type": "text",
          "selector": "input[name='zipcode']",
          "data_key": "postal_code"
        }
      ],
      "next_action": {
        "type": "click",
        "selector": "button:has-text('Get Quote')"
      }
    }
  ],
  "final_page": {
    "wait_for": ".quote-result",
    "timeout_ms": 60000,
    "screenshot_selector": ".quote-panel"
  }
}

data_row.json:

{
  "vehicle_hsn": "0603",
  "vehicle_tsn": "AKZ",
  "first_registration": "2020-03-15",
  "birthdate": "1985-06-20",
  "postal_code": "80331"
}

Form with Iframes

{
  "steps": [
    {
      "name": "embedded_form",
      "wait_for": "iframe#form-frame",
      "timeout_ms": 15000,
      "fields": [
        {
          "type": "text",
          "selector": "input[name='email']",
          "iframe_selector": "iframe#form-frame",
          "data_key": "email"
        },
        {
          "type": "dropdown",
          "selector": "select[name='plan']",
          "iframe_selector": "iframe#form-frame",
          "data_key": "selected_plan",
          "type_config": {
            "select_by": "text"
          }
        }
      ],
      "next_action": {
        "type": "click",
        "selector": "button:has-text('Submit')",
        "iframe_selector": "iframe#form-frame"
      }
    }
  ]
}

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.1

Apr 27, 2026

1.0.9

Apr 26, 2026

1.0.8

Apr 25, 2026

1.0.7

Apr 24, 2026

1.0.6

Apr 24, 2026

1.0.5

Apr 24, 2026

1.0.4

Apr 23, 2026

1.0.3

Apr 23, 2026

1.0.2

Apr 23, 2026

1.0.1

Apr 23, 2026

This version

1.0.0

Apr 22, 2026

0.5.1

Apr 22, 2026

0.5.0

Apr 14, 2026

0.4.9

Apr 12, 2026

0.4.8

Feb 27, 2026

0.4.7

Feb 24, 2026

0.4.6

Feb 24, 2026

0.4.5

Feb 24, 2026

0.4.4

Feb 23, 2026

0.4.3

Feb 20, 2026

0.4.2

Feb 19, 2026

0.4.1

Feb 16, 2026

0.4.0

Feb 16, 2026

0.3.9

Feb 13, 2026

0.3.8

Feb 12, 2026

0.3.7

Feb 11, 2026

0.3.6

Feb 10, 2026

0.3.5

Feb 10, 2026

0.3.4

Feb 9, 2026

0.3.3

Jan 18, 2026

0.3.0

Jan 18, 2026

0.2.9

Jan 17, 2026

0.2.8

Jan 17, 2026

0.2.7

Jan 17, 2026

0.2.5

Jan 17, 2026

0.2.4

Jan 11, 2026

0.2.3

Jan 11, 2026

0.2.2

Jan 10, 2026

0.2.1

Jan 8, 2026

0.2.0

Jan 5, 2026

0.1.9

Jan 5, 2026

0.1.8

Jan 4, 2026

0.1.7

Jan 4, 2026

0.1.6

Jan 4, 2026

0.1.5

Jan 4, 2026

0.1.4

Jan 4, 2026

0.1.3

Jan 4, 2026

0.1.2

Jan 4, 2026

0.1.1

Jan 4, 2026

0.1.0

Jan 4, 2026

0.0.9

Jan 3, 2026

0.0.8

Jan 3, 2026

0.0.7

Jan 3, 2026

0.0.5

Jan 2, 2026

0.0.4

Jan 2, 2026

0.0.3

Jan 2, 2026

0.0.2

Jan 2, 2026

0.0.1

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azcrawlerpy-1.0.0.tar.gz (100.0 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

azcrawlerpy-1.0.0-py3-none-any.whl (87.6 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file azcrawlerpy-1.0.0.tar.gz.

File metadata

Download URL: azcrawlerpy-1.0.0.tar.gz
Upload date: Apr 22, 2026
Size: 100.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for azcrawlerpy-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`1871c15a91219464a5de47c8e6b7f0b88beacc29607da34a9044baac3dd06f56`
MD5	`53fcf0426d3ff29ea95d3b25f26ed23a`
BLAKE2b-256	`493e8d421f3d39daac0167a416bac9b098e14c3a80b333264e3159055d29398c`

See more details on using hashes here.

File details

Details for the file azcrawlerpy-1.0.0-py3-none-any.whl.

File metadata

Download URL: azcrawlerpy-1.0.0-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 87.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for azcrawlerpy-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d4125cde04844680c197268e8b16f634d29ecd96730217b184aa5bf830833e9`
MD5	`bc044f8aced1584f8f029c8b3a0e7130`
BLAKE2b-256	`1ca80fad777bd7c3cfdcba018a8a41ad7b919faa9304f02df4730786df42c575`

See more details on using hashes here.

azcrawlerpy 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

azcrawlerpy

Table of Contents

Installation

Quick Start

Logging & Telemetry

Production: AzureLogger

Local / scripts: stdlib logger

Core Concepts

Instructions Schema

Top-Level Structure

Browser Configuration

Cookie Consent Handling

Step Definitions

Restart Crawl

Optional Steps

Field Types

TEXT

TEXTAREA

DROPDOWN / SELECT

RADIO

CHECKBOX

DATE

SLIDER

FILE

COMBOBOX

CLICK_SELECT

CLICK_ONLY

IFRAME_FIELD

Common Field Parameters

Optional Fields

Action Types

CLICK

WAIT

WAIT_HIDDEN

SCROLL

DELAY

CONDITIONAL

Common Action Parameters

Final Page Configuration

Data Extraction Configuration

Data Points (input_data)

Structure

Value Types

Radio Button Patterns

Date Handling

Element Discovery

Discovery Report Contents

AI Agent Guidance

Workflow for Creating Instructions

Step-by-Step Process

1. Analyze the Form Structure

2. For Each Step, Define:

3. Selector Priority

4. Handle Dynamic Content

5. Radio Button Strategy

6. Iframe Handling

Creating input_data

1. Analyze Required Fields

2. Determine Value Types

3. Handle Mutually Exclusive Options

4. Date Format

Common Patterns

Multi-Step Wizard

Form with Loading States

Conditional Fields

Retry and Resilience

Retry Configuration

Click Fallback Chain

Value Verification

ComboBox Degraded Retry

Error Handling and Diagnostics

Exception Types