Skip to main content

Agentic Crawler Discovery Framework.

Project description

azcrawlerpy

A framework for navigating and filling multi-step web forms programmatically. Supports Camoufox anti-detect browser (built on Firefox with C++ level fingerprint spoofing) and standard Chromium via Playwright. Uses JSON instruction files to define form navigation workflows, making it ideal for automated form submission, web scraping, and AI agent-driven web interactions.

Table of Contents

Installation

uv add azcrawlerpy

Or install from source:

uv pip install -e .

Quick Start

FormCrawler requires a logger. In production, pass an AzureLogger from azpaddypy so crawl telemetry (spans, correlation IDs, structured records) flows into Application Insights. For local scripts and notebooks, any stdlib-shaped logger works — anything that implements debug/info/warning/error/exception/critical.

import asyncio
from pathlib import Path

from azcrawlerpy import CrawlerBrowserConfig, DebugMode, FormCrawler, HumanizeConfig
from azpaddypy.mgmt.logging import AzureLogger, bootstrap_azure_monitor


async def main():
    # 1. Wire telemetry once per process (idempotent; no-op without a
    #    connection string so it's safe in local runs).
    bootstrap_azure_monitor(service_name="crawler-demo")
    logger = AzureLogger(__name__)

    # 2. CrawlerBrowserConfig controls runtime browser settings
    #    (proxy, stealth, humanize).
    browser_config = CrawlerBrowserConfig(humanize=HumanizeConfig(enabled=True))

    # 3. FormCrawler now requires a logger -- pass AzureLogger for App Insights
    #    integration, or a stdlib `logging.getLogger(__name__)` for local work.
    crawler = FormCrawler(
        headless=True,
        browser_config=browser_config,
        logger=logger,
    )

    instructions = {
        "url": "https://example.com/form",
        "browser_config": {
            "browser_type": "camoufox",
            "viewport_width": 1920,
            "viewport_height": 1080
        },
        "steps": [
            {
                "name": "step_1",
                "wait_for": "input[name='email']",
                "timeout_ms": 15000,
                "fields": [
                    {
                        "type": "text",
                        "selector": "input[name='email']",
                        "data_key": "email"
                    }
                ],
                "next_action": {
                    "type": "click",
                    "selector": "button[type='submit']"
                }
            }
        ],
        "final_page": {
            "wait_for": ".success-message",
            "timeout_ms": 60000
        }
    }

    input_data = {
        "email": "user@example.com"
    }

    result = await crawler.crawl(
        url=instructions["url"],
        input_data=input_data,
        instructions=instructions,
        output_dir=Path("./output"),
        debug_mode=DebugMode.ALL,
        # Optional wall-clock cap on the entire crawl. Exceeding it raises
        # CrawlerTimeoutError with a partial result attached. Omit or set
        # to None to disable.
        global_timeout_ms=300_000,
    )

    print(f"Final URL: {result.final_url}")
    print(f"Steps completed: {result.steps_completed}")
    print(f"Screenshot saved: {result.screenshot_path}")
    print(f"Extracted data: {result.extracted_data}")

asyncio.run(main())

Logging & Telemetry

FormCrawler accepts any logger that matches the stdlib logging.Logger shape (structural LoggerLike protocol in crawling/utils.py): debug, info, warning, error, exception, critical.

Production: AzureLogger

Pairing the crawler with AzureLogger from azpaddypy gives you, with no code in the crawler itself:

  • Correlation IDs propagated via contextvars so one crawl request is grep-able end-to-end.
  • Trace context (trace_id, span_id) attached to every log record.
  • Application Insights export of all crawl log records.
  • OpenTelemetry spans on methods you decorate with @trace_function.
from azpaddypy.mgmt.logging import AzureLogger, bootstrap_azure_monitor, trace_function

bootstrap_azure_monitor(service_name="crawler-api", service_version="0.5.1")
logger = AzureLogger(__name__)

# Optional: generate a correlation ID per request + emit a parent span
@trace_function(name="crawl_request")
async def handle_crawl(url: str, input_data: dict, instructions: dict):
    crawler = FormCrawler(headless=True, browser_config=None, logger=logger)
    return await crawler.crawl(
        url=url, input_data=input_data, instructions=instructions, output_dir=None,
    )

All log records emitted by the crawler will carry the correlation ID and span context of whatever span is active when crawl() runs.

Local / scripts: stdlib logger

If you don't need App Insights, a standard library logger is enough:

import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
crawler = FormCrawler(
    headless=True,
    browser_config=None,
    logger=logging.getLogger("my_crawler"),
)

Any logger that satisfies the LoggerLike Protocol is accepted — there is no hard runtime dependency on azpaddypy.

Core Concepts

The framework operates on two primary inputs:

  1. Instructions (instructions.json): Defines the form structure, selectors, navigation flow, and field types
  2. Data Points (input_data): Contains the actual values to fill into form fields

The crawler processes each step sequentially:

  1. Wait for the step's wait_for selector to become visible
  2. Fill all fields defined in the step using values from input_data
  3. Execute the next_action to navigate to the next step
  4. Repeat until all steps are complete
  5. Wait for and capture the final page

Instructions Schema

Top-Level Structure

{
  "url": "https://example.com/form",
  "browser_config": { ... },
  "cookie_consent": { ... },
  "steps": [ ... ],
  "final_page": { ... },
  "data_extraction": { ... }
}
Field Type Required Description
url string Yes Starting URL for the form
browser_config object Yes Browser engine, viewport, and user agent settings
cookie_consent object No Cookie banner handling configuration
captcha object No CAPTCHA handling configuration (e.g., Cloudflare Turnstile)
steps array Yes Ordered list of form steps
final_page object Yes Configuration for the result page
data_extraction object No Configuration for extracting data from final page
profiler object No Browser profile building configuration (visit sites to accumulate cookies before crawling)

Browser Configuration

{
  "browser_config": {
    "viewport_width": 1920,
    "viewport_height": 1080,
    "user_agent": "Mozilla/5.0 ..."
  }
}
Field Type Required Description
browser_type string Yes Browser engine: camoufox (anti-detect Firefox) or chromium (standard Playwright)
viewport_width integer Yes Browser viewport width in pixels
viewport_height integer Yes Browser viewport height in pixels
user_agent string No Custom user agent string
blocked_url_patterns array No URL glob patterns to block via page.route() (e.g., **/analytics/**)

Cookie Consent Handling

The framework supports two modes for handling cookie consent banners:

Standard Mode (regular DOM elements):

{
  "cookie_consent": {
    "banner_selector": "dialog:has-text('cookies')",
    "accept_selector": "button:has-text('Accept')"
  }
}

Shadow DOM Mode (for Usercentrics, OneTrust, etc.):

{
  "cookie_consent": {
    "banner_selector": "#usercentrics-cmp-ui",
    "shadow_host_selector": "#usercentrics-cmp-ui",
    "accept_button_texts": ["Accept All", "Alle akzeptieren"]
  }
}
Field Type Required Description
banner_selector string Yes CSS selector for the banner container
accept_selector string No CSS selector for accept button (standard mode)
shadow_host_selector string No CSS selector for shadow DOM host
accept_button_texts array No Text patterns to match accept buttons in shadow DOM
banner_settle_delay_ms integer No Wait time before checking for banner
banner_visible_timeout_ms integer No Timeout for banner visibility
accept_button_timeout_ms integer No Timeout for accept button visibility
post_consent_delay_ms integer No Wait time after handling consent
js_fallback_texts array No Custom text patterns for JS fallback button matching (overrides defaults)

The JS fallback matches button text against: klar, akzept, accept, agree, ok, verstanden, zustimm. Set js_fallback_texts to override this list with site-specific patterns.

Step Definitions

Each step represents a form page or section:

{
  "name": "personal_info",
  "wait_for": "input[name='firstName']",
  "timeout_ms": 15000,
  "fields": [ ... ],
  "next_action": { ... }
}
Field Type Required Description
name string Yes Unique identifier for the step
wait_for string Yes CSS selector to wait for before processing
timeout_ms integer Yes Timeout in milliseconds for wait condition
fields array Yes List of field definitions
next_action object Yes Action to navigate to next step
data_extraction array No Data to extract BEFORE field handling in this step
post_field_extraction array No Data to extract AFTER field handling (for modal results, dynamic values)
optional boolean No Skip step gracefully if wait_for selector is not found (see Optional Steps)
restart_crawl boolean No Trigger a full browser restart from the start URL when wait_for times out (see Restart Crawl). Requires max_restarts.
max_restarts integer No Maximum number of full crawl restart attempts when restart_crawl is true (must be >= 1). Each step tracks its own restart budget.

Restart Crawl

Steps marked with "restart_crawl": true raise an internal CrawlRestartError when the wait_for selector times out. The crawler then closes the browser, recreates the context (reusing any profiler storage state so cookie consent does not re-run), and replays the entire crawl from the start URL. Each step tracks its own restart budget against max_restarts; once exhausted, the original timeout error is raised.

{
  "name": "results_page",
  "wait_for": "[data-cy='quote-result']",
  "timeout_ms": 15000,
  "restart_crawl": true,
  "max_restarts": 2,
  "fields": [],
  "next_action": { "type": "noop" }
}

restart_crawl takes priority over optional and strict. Use it for steps where the page may render broken or never finalize due to flaky SPA/backend behavior, where a fresh browser context is the most reliable recovery. Prefer retry_config on individual fields/actions for transient interaction failures; reserve restart_crawl for full-page load failures.

Optional Steps

Steps marked with "optional": true are skipped gracefully when the wait_for selector is not found within timeout_ms. The entire step (fields, data extraction, next_action) is skipped with an info log. This is useful for conditional workflow pages that only appear depending on prior input values or dynamic form behavior.

{
  "name": "additional_driver_details",
  "wait_for": "#additional-driver-form",
  "timeout_ms": 5000,
  "optional": true,
  "fields": [ ... ],
  "next_action": { "type": "click", "selector": "#next" }
}

The optional check takes precedence over the strict parameter. When a step is optional and the wait_for selector times out, the step is always skipped regardless of strict mode. Non-optional steps follow the existing behavior: raise CrawlerTimeoutError when strict=True, or log a warning and continue into the step when strict=False.

Tip: Use a shorter timeout_ms (e.g., 3000-5000ms) on optional steps to avoid waiting the full timeout when the step is absent.

Field Types

TEXT

For text inputs, email fields, phone numbers, and similar:

{
  "type": "text",
  "selector": "input[name='email']",
  "data_key": "email"
}

TEXTAREA

For multi-line text areas:

{
  "type": "textarea",
  "selector": "textarea[name='message']",
  "data_key": "message"
}

DROPDOWN / SELECT

For native <select> elements:

{
  "type": "dropdown",
  "selector": "select[name='country']",
  "data_key": "country",
  "type_config": {
    "select_by": "text"
  }
}
type_config Parameter Values Description
select_by text, value, index How to match the option
option_visible_timeout_ms integer Timeout in ms for option visibility

RADIO

For radio button groups:

{
  "type": "radio",
  "selector": "input[type='radio'][value='${value}']",
  "data_key": "payment_method"
}

Pattern A - Value-driven selector: Use ${value} placeholder in selector, data provides the value:

{
  "type": "radio",
  "selector": "input[type='radio'][value='${value}']",
  "data_key": "gender"
}
// data: { "gender": "male" }

Pattern B - Boolean flags: Use explicit selectors with boolean data values:

{
  "type": "radio",
  "selector": "[role='radio']:has-text('Yes')",
  "data_key": "accept_terms",
  "force_click": true
}
// data: { "accept_terms": true }  // clicks if truthy, skips if null

force_click is supported on radio, click_only, and click_select field types. When set, the click fallback chain tries force click before normal click.

CHECKBOX

For checkbox inputs:

{
  "type": "checkbox",
  "selector": "input[type='checkbox'][name='newsletter']",
  "data_key": "subscribe_newsletter"
}

Data value true checks the box, false unchecks it, and null skips the field entirely (no interaction).

DATE

For date inputs with format conversion:

{
  "type": "date",
  "selector": "input[name='birthdate']",
  "data_key": "birthdate",
  "type_config": {
    "format": "DD.MM.YYYY"
  }
}

Supported formats (mapped to strftime internally):

Format Example Description
DD.MM.YYYY 15.06.1985 Day.Month.Year
MM/DD/YYYY 06/15/1985 Month/Day/Year
YYYY-MM-DD 1985-06-15 ISO format
DD/MM/YYYY 15/06/1985 Day/Month/Year
YYYY/MM/DD 1985/06/15 Year/Month/Day
DD-MM-YYYY 15-06-1985 Day-Month-Year
MM-DD-YYYY 06-15-1985 Month-Day-Year

Data must be provided in ISO format (YYYY-MM-DD) in input_data. The type_config.format specifies the output format for typing into the field. Native <input type="date"> fields are auto-detected and use the value as-is, no type_config needed.

SLIDER

For range inputs:

{
  "type": "slider",
  "selector": "input[type='range'][name='coverage']",
  "data_key": "coverage_amount"
}

FILE

For file upload fields:

{
  "type": "file",
  "selector": "input[type='file']",
  "data_key": "document_path"
}

Data value should be the absolute file path.

COMBOBOX

For autocomplete/typeahead inputs:

{
  "type": "combobox",
  "selector": "input[aria-label='City']",
  "data_key": "city",
  "type_config": {
    "option_selector": ".autocomplete-option",
    "type_delay_ms": 50,
    "wait_after_type_ms": 500,
    "press_enter": true
  }
}
type_config Parameter Description
option_selector CSS selector for dropdown options (required)
type_delay_ms Delay between keystrokes (simulates human typing)
wait_after_type_ms Wait time for options to appear
press_enter Press Enter after selecting option
clear_before_type Clear field before typing
option_visible_timeout_ms Timeout in ms for option visibility

CLICK_SELECT

For custom dropdowns requiring click-then-select:

{
  "type": "click_select",
  "selector": ".custom-dropdown-trigger",
  "data_key": "option_value",
  "post_click_delay_ms": 300,
  "type_config": {
    "option_selector": ".dropdown-item:has-text('${value}')"
  }
}

CLICK_ONLY

For elements that only need clicking (no data input):

{
  "type": "click_only",
  "selector": "button.expand-section"
}

With conditional clicking based on data:

{
  "type": "click_only",
  "selector": "button:has-text('${value}')",
  "data_key": "selected_option"
}

IFRAME_FIELD

For fields inside iframes (alternative to iframe_selector):

{
  "type": "iframe_field",
  "selector": "input[name='card_number']",
  "iframe_selector": "iframe#payment-frame",
  "data_key": "card_number"
}

Common Field Parameters

Parameter Type Description
data_key string Key in input_data to get value from
selector string CSS/Playwright selector for the element
type_config object Type-specific configuration (see field type sections above)
iframe_selector string Selector for parent iframe if field is embedded
field_visible_timeout_ms integer Timeout for field to become visible
post_click_delay_ms integer Wait after clicking the field
skip_verification boolean Skip value verification after filling
force_click boolean Use force click to bypass overlays (click_only, radio, and click_select)
optional boolean Skip field gracefully if element is not found or interaction fails
retry_config object Retry configuration for transient failures (see Retry and Resilience)

Optional Fields

Fields marked with "optional": true are skipped gracefully when the element is not found on the page or when interaction fails. This is useful for elements that may or may not appear depending on dynamic page behavior, A/B tests, or conditional rendering that cannot be predicted by data alone.

{
  "type": "click_only",
  "selector": "#promotional-banner button.dismiss",
  "data_key": "dismiss_promo",
  "optional": true,
  "field_visible_timeout_ms": 2000
}

How optional differs from null values:

Mechanism Element lookup Use case
Value = null in input_data No (skipped immediately) Field exists but should not be filled for this data row
"optional": true Yes (waits for visibility) Field may or may not exist on the page

When optional is set and the element is not found or interaction fails, the handler logs an info message and continues to the next field. No error is raised regardless of the strict mode.

Tip: Set a low field_visible_timeout_ms (e.g., 1000-2000ms) on optional fields to avoid waiting the full default timeout when the element is absent.

Note: For skipping entire steps (all fields + next_action), use Optional Steps instead.

Action Types

CLICK

Click a button or link:

{
  "type": "click",
  "selector": "button[type='submit']"
}

With iframe support:

{
  "type": "click",
  "selector": "button:has-text('Next')",
  "iframe_selector": "iframe#form-frame"
}

WAIT

Wait for an element to appear:

{
  "type": "wait",
  "selector": ".loading-complete"
}

WAIT_HIDDEN

Wait for an element to disappear:

{
  "type": "wait_hidden",
  "selector": ".loading-spinner"
}

SCROLL

Scroll to an element:

{
  "type": "scroll",
  "selector": "#section-bottom"
}

DELAY

Wait for a fixed time:

{
  "type": "delay",
  "delay_ms": 2000
}

CONDITIONAL

Execute actions based on conditions:

{
  "type": "conditional",
  "condition": {
    "type": "selector_visible",
    "selector": ".error-message"
  },
  "actions": [
    {
      "type": "click",
      "selector": "button.dismiss-error"
    }
  ]
}

Condition types:

  • selector_visible: True if selector is visible on the page
  • selector_hidden: True if selector is NOT visible on the page
  • data_equals: True if input_data[key] equals value
  • data_exists: True if input_data[key] is truthy

Common Action Parameters

Parameter Type Description
selector string Target element selector
iframe_selector string Selector for parent iframe
delay_ms integer Delay/timeout in ms (for delay, wait, wait_hidden actions)
condition object Condition definition (for conditional actions)
actions array Nested actions to execute if condition is met (for conditional actions)
pre_action_delay_ms integer Wait before executing action
post_action_delay_ms integer Wait after executing action
retry_config object Retry configuration for transient failures (see Retry and Resilience)

Final Page Configuration

{
  "final_page": {
    "wait_for": ".result-container, .confirmation",
    "timeout_ms": 60000,
    "post_wait_delay_ms": 2000,
    "screenshot_selector": ".result-panel"
  }
}
Field Type Required Description
wait_for string Conditional CSS selector to wait for (required if wait_for_data_key not set)
wait_for_data_key string Conditional Data key for exact text match (uses text="{value}" selector). Required if wait_for not set.
timeout_ms integer Yes Timeout in milliseconds for waiting
post_wait_delay_ms integer No Delay in ms after selector found, for SPA content to render (default: 0)
screenshot_selector string No Element to screenshot (null for full page)

Only one of wait_for or wait_for_data_key can be set (not both).

Data Extraction Configuration

Extract structured data from the final page using CSS selectors:

{
  "data_extraction": {
    "fields": {
      "tier_prices": {
        "selector": ".price-value",
        "attribute": null,
        "regex": "([0-9]+[.,][0-9]{2})",
        "multiple": true,
        "iframe_selector": "iframe#form-frame"
      },
      "selected_price": {
        "selector": "#total-amount",
        "attribute": "data-value",
        "regex": null,
        "multiple": false
      }
    }
  }
}
Field Type Required Description
selector string Yes CSS selector to locate element(s)
attribute string No Element attribute to extract (null for text content)
regex string No Regex pattern to apply (uses first capture group if present)
multiple boolean Yes True for list of all matches, False for first match only
iframe_selector string No CSS selector for iframe if element is inside one

The data_extraction config also supports nested_fields to combine flat extracted arrays into structured outputs (paired_dict or object_list). See docs/README.md for details.

Extracted data is available in the crawl result:

result = await crawler.crawl(...)
print(result.extracted_data)
# {'tier_prices': ['32,28', '35,26', '50,34'], 'selected_price': '35,26'}

Data Points (input_data)

The input_data dictionary provides values for form fields. Keys must match data_key values in the instructions.

Structure

{
  "email": "user@example.com",
  "first_name": "John",
  "last_name": "Doe",
  "birthdate": "1985-06-15",
  "country": "Germany",
  "accept_terms": true,
  "newsletter": false,
  "premium_option": null
}

Value Types

Type Description Example
String Text values, dropdown selections "John"
Boolean Checkbox/radio toggle true, false
Null Skip this field entirely (no DOM interaction) null
Integer/Float Numeric inputs, sliders 12000, 99.99

Note: Setting a value to null skips the field without any DOM interaction. This is different from "optional": true on the field definition, which attempts the interaction but tolerates failure. See Optional Fields.

Radio Button Patterns

Pattern A - Mutually exclusive options with value selector:

{
  "gender": "male"
}

Selector uses ${value} placeholder: input[value='${value}']

Pattern B - Boolean flags for each option:

{
  "option_a": true,
  "option_b": null,
  "option_c": null
}

Only the option with true gets clicked.

Date Handling

Dates in input_data should use ISO format (YYYY-MM-DD):

{
  "birthdate": "1985-06-15",
  "start_date": "2024-01-01"
}

The framework converts to the format specified in the field definition.

Element Discovery

The ElementDiscovery class scans web pages to identify interactive elements, helping build instructions.json files.

from pathlib import Path
from azcrawlerpy import ElementDiscovery

async def discover_elements():
    discovery = ElementDiscovery(headless=False)

    report = await discovery.discover(
        url="https://example.com/form",
        output_dir=Path("./discovery_output"),
        cookie_consent={
            "banner_selector": "#cookie-banner",
            "accept_selector": "button.accept"
        },
        explore_iframes=True,
        screenshot=True,
    )

    print(f"Found {report.total_elements} elements")

    for text_input in report.text_inputs:
        print(f"Text input: {text_input.selector}")
        print(f"  Suggested type: {text_input.suggested_field_type}")

    for dropdown in report.selects:
        print(f"Dropdown: {dropdown.selector}")
        print(f"  Options: {dropdown.options}")

    for radio_group in report.radio_groups:
        print(f"Radio group: {radio_group.name}")
        for option in radio_group.options:
            print(f"  - {option.label}: {option.selector}")

Discovery Report Contents

  • text_inputs: Text, email, phone, password fields
  • textareas: Multi-line text areas
  • selects: Native dropdown elements with options
  • radio_groups: Grouped radio buttons
  • checkboxes: Checkbox inputs
  • buttons: Clickable buttons
  • links: Anchor elements
  • date_inputs: Date picker fields
  • file_inputs: File upload fields
  • sliders: Range inputs
  • custom_components: Non-standard interactive elements
  • iframes: Discovered iframes with their elements

AI Agent Guidance

This section provides instructions for AI agents tasked with creating instructions.json and input_data files.

Workflow for Creating Instructions

  1. Discovery Phase: Use ElementDiscovery to scan each page/step of the form
  2. Mapping Phase: Map discovered elements to field definitions
  3. Flow Definition: Define step transitions and actions
  4. Data Schema: Create the input_data structure

Step-by-Step Process

1. Analyze the Form Structure

  • Identify how many pages/steps the form has
  • Note the URL pattern changes (if any)
  • Identify what element appears when each step loads

2. For Each Step, Define:

{
  "name": "<descriptive_step_name>",
  "wait_for": "<selector_that_confirms_step_loaded>",
  "timeout_ms": 15000,
  "fields": [...],
  "next_action": {...}
}

Naming conventions:

  • Use snake_case for step names: personal_info, payment_details
  • Use descriptive data_keys: first_name, email_address, accepts_terms

3. Selector Priority

When choosing selectors, prefer in order:

  1. [data-testid='...'] or [data-cy='...'] - Most stable
  2. [aria-label='...'] or [aria-labelledby='...'] - Accessible and stable
  3. input[name='...'] - Form field names
  4. :has-text('...') - Text content (use for buttons/labels)
  5. CSS class selectors - Least stable, avoid if possible

4. Handle Dynamic Content

For AJAX-loaded content:

  • Use wait action before interacting
  • Add field_visible_timeout_ms to field definitions
  • Use post_click_delay_ms for fields that trigger updates

5. Radio Button Strategy

Option A - When radio values are meaningful:

{
  "type": "radio",
  "selector": "input[type='radio'][value='${value}']",
  "data_key": "payment_type"
}
// data: { "payment_type": "credit_card" }

Option B - When you need individual control:

{
  "type": "radio",
  "selector": "[role='radio']:has-text('Credit Card')",
  "data_key": "payment_credit_card",
  "force_click": true
},
{
  "type": "radio",
  "selector": "[role='radio']:has-text('PayPal')",
  "data_key": "payment_paypal",
  "force_click": true
}
// data: { "payment_credit_card": true, "payment_paypal": null }

6. Iframe Handling

When elements are inside iframes:

{
  "type": "text",
  "selector": "input[name='card_number']",
  "iframe_selector": "iframe#payment-iframe",
  "data_key": "card_number"
}

Creating input_data

1. Analyze Required Fields

From the instructions, extract all unique data_key values:

data_keys = set()
for step in instructions["steps"]:
    for field in step["fields"]:
        if field.get("data_key"):
            data_keys.add(field["data_key"])

2. Determine Value Types

Field Type Data Type Example
text, textarea string "John Doe"
dropdown string "Germany"
radio (value-driven) string "option_a"
radio (boolean) boolean/null true or null
checkbox boolean true / false
date string (ISO) "1985-06-15"
slider number 50000
file string (path) "/path/to/file.pdf"

3. Handle Mutually Exclusive Options

For radio groups with boolean flags, only ONE should be true:

{
  "employment_fulltime": true,
  "employment_parttime": null,
  "employment_selfemployed": null,
  "employment_unemployed": null
}

4. Date Format

Always provide dates in ISO format in input_data:

{
  "birthdate": "1985-06-15",
  "policy_start": "2024-01-01"
}

The instructions specify the output format for the specific form.

Common Patterns

Multi-Step Wizard

{
  "steps": [
    {
      "name": "step_1_personal",
      "wait_for": "input[name='firstName']",
      "fields": [...],
      "next_action": { "type": "click", "selector": "button:has-text('Next')" }
    },
    {
      "name": "step_2_address",
      "wait_for": "input[name='street']",
      "fields": [...],
      "next_action": { "type": "click", "selector": "button:has-text('Next')" }
    }
  ]
}

Form with Loading States

{
  "next_action": {
    "type": "click",
    "selector": "button[type='submit']",
    "post_action_delay_ms": 1000
  }
}

Conditional Fields

{
  "type": "conditional",
  "condition": {
    "type": "data_equals",
    "key": "has_additional_driver",
    "value": true
  },
  "actions": [
    {
      "type": "click",
      "selector": "button:has-text('Add Driver')"
    }
  ]
}

Retry and Resilience

The framework provides built-in retry and fallback mechanisms for handling transient failures during web interactions.

Retry Configuration

Add retry_config to any field, action, step extraction, or data extraction to enable automatic retry with exponential backoff:

{
  "type": "text",
  "selector": "input[name='email']",
  "data_key": "email",
  "retry_config": {
    "max_attempts": 3,
    "base_delay_ms": 500,
    "backoff_multiplier": 1.5
  }
}
Parameter Type Required Description
max_attempts integer Yes Total attempts (1 = no retry, 2 = one retry, etc.)
base_delay_ms integer Yes Base delay between retries in milliseconds
backoff_multiplier float No Multiplier per retry (default: 1.5). Delay = base_delay_ms * backoff_multiplier^(attempt-1)

Retryable errors include Playwright TimeoutError and Error (element not found, intercepted clicks, etc.).

Click Fallback Chain

All click interactions (click actions, click_only, radio, click_select fields) use an escalating fallback strategy when a click fails:

  1. Normal click (or force click if force_first=True)
  2. Force click (or normal click if force_first=True) -- bypasses actionability checks
  3. JS dispatchEvent -- fires a MouseEvent via element.dispatchEvent()
  4. JS element.click() -- calls element.click() via document.querySelector()

This handles common real-world issues like overlay interception, sticky headers, and elements that pass Playwright's visibility checks but fail to receive pointer events.

Value Verification

text and textarea fields verify the actual input value after filling. If the value written to the field does not match the expected value, a FieldInteractionError is raised. This catches silent data corruption from autofill interference, input masks, or JavaScript reformatting.

Set skip_verification: true on the field to disable this check for fields where the site intentionally transforms the input (e.g., phone number formatting).

ComboBox Degraded Retry

When retry_config is set on a combobox field, retries use degraded parameters to handle slow autocomplete responses:

  • Typing delay scales by 1.5^(attempt-1) (slower typing gives autocomplete more time)
  • The field is always cleared before retyping on retry attempts

Error Handling and Diagnostics

The framework provides detailed error information when failures occur.

Exception Types

All CrawlerError subclasses can carry a partial_result attribute populated via with_partial_result(). When the crawl was started with output_dir=None, this is an InMemoryCrawlResult holding whatever state had been collected before the failure (screenshots, extracted_data, steps_completed, profiler_visited_urls, platform_logs). Catch the exception and read e.partial_result to recover that data — see In-Memory Error Diagnostics below.

Exception When Raised
FieldNotFoundError Selector doesn't match any element (not raised for optional fields)
FieldInteractionError Element found but interaction failed, or value mismatch after fill (not raised for optional fields)
CrawlerTimeoutError A step's timeout_ms budget was exhausted (initial wait_for OR the rest of the step body — extractions, fields, action, retries), a restart budget was exhausted, OR the crawl's global_timeout_ms wall-clock cap was hit
NavigationError Navigation action failed
MissingDataError Required data_key not in input_data
InvalidInstructionError Malformed instructions JSON
UnsupportedFieldTypeError Unknown field type specified
UnsupportedActionTypeError Unknown action type specified
IframeNotFoundError Specified iframe not found
DataExtractionError Data extraction from final page failed

Timeout Budgets

The crawler enforces two independent wall-clock budgets. Both translate their exhaustion into CrawlerTimeoutError (or CrawlRestartError when restart is configured) and attach a partial result.

1. Per-step budget — step.timeout_ms

Hard upper bound on the total wall-clock duration of one step. Covers wait_for, data_extraction, field handlers (including retries), next_action (including retries and delays) and post_field_extraction. The same field is also used as the timeout for the initial wait_for_selector call, so a step that uses its full budget waiting for the selector leaves no time for the rest of the body. On exhaustion, the priority chain restart_crawl > optional > strict decides the outcome:

  • restart_crawl=TrueCrawlRestartError → full browser restart (up to max_restarts).
  • optional=True → step is skipped silently.
  • strict=TrueCrawlerTimeoutError.
  • strict=False → warning logged, step returns.

2. Crawl-level budget — crawler.crawl(..., global_timeout_ms=...)

Optional wall-clock cap on the entire crawl (profiler, navigation, all steps and the final-page capture). None (the default) disables it. When the deadline fires, the crawl task is cancelled and CrawlerTimeoutError(step_name="<global_timeout>", selector="<global>", timeout_ms=...) is raised with a partial InMemoryCrawlResult attached. The partial carries screenshots, extracted_data, steps_completed, profiler_visited_urls and platform_logs collected up to the moment of cancellation. html is empty and final_url falls back to the start URL because the live page is being torn down by the cancellation and cannot be read.

from azcrawlerpy.crawling.exceptions import CrawlerTimeoutError

try:
    result = await crawler.crawl(
        url=url,
        input_data=input_data,
        instructions=instructions,
        output_dir=None,
        debug_mode=DebugMode.ALL,
        global_timeout_ms=300_000,  # 5 minutes
    )
except CrawlerTimeoutError as e:
    if e.step_name == "<global_timeout>":
        # Global budget hit -- inspect partial_result for what was collected.
        partial = e.partial_result
        if partial is not None:
            print(f"Timed out after {partial.steps_completed} step(s)")
            print(f"Screenshots: {len(partial.screenshots)}")
    else:
        # Per-step budget or restart-budget exhaustion.
        raise

Debug Mode

Enable debug mode to capture screenshots at various stages:

from azcrawlerpy import DebugMode

result = await crawler.crawl(
    ...,
    debug_mode=DebugMode.ALL,  # Capture all screenshots
)
Mode Description
NONE No debug screenshots
START Screenshot at form start
END Screenshot at form end
ALL Screenshots after every field and action

AI Diagnostics

When errors occur with debug mode enabled, the framework captures:

  • Current page URL and title
  • Available data-cy and data-testid selectors
  • Visible buttons and input fields
  • Similar selectors (fuzzy matching suggestions)
  • Console errors and warnings
  • Failed network requests
  • XHR/fetch API responses (URL, status, method) for debugging SPA extraction failures
  • HTML snippet of the form area
  • Error screenshot (saved to disk or captured in-memory)

This information is included in the exception message and saved to error_diagnostics.json (when output_dir is set).

In-Memory Error Diagnostics

When running with output_dir=None, error diagnostics are fully available in-memory. The error screenshot is captured as bytes and accessible in two ways:

  1. On the exception's diagnostics: exception.diagnostics.screenshot_bytes
  2. As the last entry in the partial result: exception.partial_result.screenshots[-1]
try:
    result = await crawler.crawl(
        url=url,
        input_data=input_data,
        instructions=instructions,
        output_dir=None,
        debug_mode=DebugMode.ALL,
    )
except CrawlerError as e:
    # Access the error screenshot bytes directly from diagnostics
    if e.diagnostics and e.diagnostics.screenshot_bytes:
        error_screenshot = e.diagnostics.screenshot_bytes

    # Or from the partial result (includes all screenshots taken during the crawl)
    if e.partial_result:
        all_screenshots = e.partial_result.screenshots  # includes error screenshot as last item
        extracted_so_far = e.partial_result.extracted_data
        steps_done = e.partial_result.steps_completed

Azure App Service Log Capture

When the crawler runs on an Azure App Service Plan target (Web App, Function App, or Web App / Function App in a container), debug_mode=DebugMode.ALL automatically captures the platform log tree so it can be attached to the run's output artifacts. This is where Playwright's own DEBUG=pw:api,pw:browser* stderr ends up in Azure, so it is often the only way to see low-level browser-launch failures in post-mortem.

When it activates (two gating conditions):

  • debug_mode=DebugMode.ALL, AND
  • $HOME is set and $HOME/LogFiles/ is a directory.

output_dir is no longer a gating condition — it selects the sink (see below). If either gating check fails, the crawl is otherwise untouched and a clear reason is logged (see "Log output" below).

Two sinks, chosen by output_dir:

output_dir Sink Where to find the logs
A Path (disk mode) Full recursive copy to <output_dir>/platform_logs/ On the filesystem alongside result.html, result.png, etc.
None (in-memory mode) Full read into a dict[str, bytes] InMemoryCrawlResult.platform_logs (keys are forward-slash relative paths like "Application/Functions/Worker/worker.log"; values are raw bytes). Also attached to CrawlerError.partial_result.platform_logs on graceful failures so harnesses that catch CrawlerError can still recover them.

What gets copied:

The entire $HOME/LogFiles/ tree, recursively. No filtering. Files include:

  • Application/Functions/Worker/*.log (Python worker stdout/stderr on Functions)
  • Application/Functions/Host/ (Functions host logs)
  • <instance>_default_docker.log (Web App container stdout/stderr)
  • kudu/, Nginx/, docker.log, etc.

On Linux App Service that path is /home/LogFiles/; on Windows App Service it is typically C:\home\LogFiles\ or D:\home\LogFiles\. The detection uses $HOME so both work without configuration.

In-memory usage example (harness uploads to its own storage):

result = await crawler.crawl(
    url=url,
    input_data=input_data,
    instructions=instructions,
    output_dir=None,           # in-memory mode
    debug_mode=DebugMode.ALL,  # activates platform log capture
)

# result.platform_logs is a dict[str, bytes]
for rel_path, content in result.platform_logs.items():
    blob_client.upload_blob(
        name=f"{run_id}/platform_logs/{rel_path}",
        data=content,
        overwrite=True,
    )

Crash-safe:

The copy is wrapped in try / finally around the entire crawl body, so it runs on:

  • Successful return
  • CrawlerTimeoutError from an exhausted restart budget
  • Any CrawlerError wrapping a browser crash
  • Profiler-phase failures (the profiler is inside the try)
  • KeyboardInterrupt / task cancellation

Per-file errors (Windows log rotation, permission issues, races) are logged as warnings and counted; they never mask the crawl's original exception.

Log output:

Every invocation emits one start line plus one outcome line via the configured logger (flows to Application Insights through azpaddypy):

Situation Level Message
Helper invoked INFO Platform log capture starting: dest='<output_dir>'
$HOME not set (local dev, non-Azure) INFO Platform log capture skipped: $HOME not set (not an Azure App Service target)
$HOME/LogFiles missing INFO Platform log capture skipped: '<path>' not found (not an Azure App Service target)
$HOME/LogFiles exists but is a file WARNING Platform log capture skipped: '<path>' exists but is not a directory
Cannot stat $HOME/LogFiles (permissions) WARNING Platform log capture skipped: cannot stat '<path>' error=<exc>
Source found but empty WARNING Platform log capture found source but copied nothing: source='<path>' (directory is empty or contains no regular files)
Files copied INFO Platform logs copied: source='<s>' dest='<d>' files=<N> bytes=<B> errors=<E>

Deployment requirements per target:

Target Works out of the box? Notes
Azure Functions on App Service Plan Yes Functions host always writes to /home/LogFiles/Application/Functions/Worker/. Playwright Node stderr inherits that fd.
Web App for Containers on App Service Plan Yes, with two app settings WEBSITES_ENABLE_APP_SERVICE_STORAGE=true (default) so /home is the Azure Files mount, and App Service Logs → Filesystem enabled so stdout/stderr is written there. Enable via az webapp log config --docker-container-logging filesystem --level information.
Pure Azure Container Apps No Container Apps does not mount /home/LogFiles/; logs go to Log Analytics only. The helper correctly INFO-logs "not found" and skips.
Local dev (docker run, native Python) No (by design) $HOME/LogFiles/ doesn't exist; helper skips and logs the reason.

In-memory mode delivery:

When output_dir=None the collected bytes land in InMemoryCrawlResult.platform_logs on success. On graceful failure (any CrawlerError, including CrawlerTimeoutError from an exhausted restart budget), they land on exception.partial_result.platform_logs. On hard asyncio.CancelledError from a harness timeout, the bytes are collected into state but cannot be attached to any result object because no result exists at that point — use disk mode if your harness cancels and you need the logs.

Shared-worker noise:

/home/LogFiles/Application/Functions/Worker/*.log is shared across all concurrent function invocations on the same worker instance. The copied snapshot therefore includes any overlapping activity — there is no per-invocation filtering. This is an intentional "full copy always" trade for forensic completeness.

Browser Profile Building (Profiler)

The profiler visits random sites before the main crawl to accumulate cookies and storage, making the browser appear more natural. This is configured via the profiler field in instructions.json.

Profiler Modes

The profiler supports two storage modes:

Disk mode (storage_path set): Persists the browser profile to disk.

  • Camoufox: saves as a Firefox user data directory ({storage_path}.camoufox/)
  • Chromium: saves as a JSON file ({storage_path}.chromium.json)

In-memory mode (storage_path omitted or null): No permanent files written.

  • Chromium: returns storage state as a dict, passed directly to browser.new_context(storage_state=dict)
  • Camoufox: uses a temporary directory (cleaned up on error, passed as user_data_dir during crawl)

Profiler Configuration

{
  "profiler": {
    "sites": [
      {
        "url": "https://www.google.de",
        "cookie_consent": {
          "banner_selector": "[role='dialog'], .cookie-banner",
          "accept_selector": "button[id*='accept'], button[id*='agree']",
          "js_fallback_texts": ["akzept", "accept", "zustimm"]
        },
        "browse_delay_ms": 2000
      },
      {
        "url": "https://en.wikipedia.org",
        "browse_delay_ms": 3000
      }
    ],
    "visit_count": 3,
    "ignore_errors": true,
    "storage_path": null,
    "inter_site_delay_ms": 1000
  }
}

ProfilerConfig Parameters

Field Type Required Description
sites array Yes List of ProfilerSiteConfig objects to randomly select from
visit_count integer Yes Number of sites to randomly visit (must be <= length of sites)
ignore_errors boolean Yes If true, continue profiling when a single site visit fails
storage_path string or null No Base path for profile storage. When null (default), runs in-memory mode
inter_site_delay_ms integer No Delay in ms between visiting each site

ProfilerSiteConfig Parameters

Each site in the sites array supports per-site cookie consent handling:

Field Type Required Description
url string Yes URL to visit for profile building
cookie_consent object No Cookie consent config for this specific site (same schema as top-level cookie_consent)
browse_delay_ms integer No Time in ms to linger on the page after cookie consent handling

ProfilerResult

The profiler returns a ProfilerResult with:

Field Type Description
visited_urls list[str] URLs that were visited during profiling
storage_state_path Path or None Path to persisted profile (disk mode), or temp dir path (Camoufox in-memory)
storage_state dict or None In-memory storage state dict (Chromium in-memory mode only)

The crawler automatically passes the profiler result to the browser context creation, so no manual wiring is needed.

Examples

Insurance Quote Form

instructions.json:

{
  "url": "https://insurance.example.com/quote",
  "browser_config": {
    "viewport_width": 1920,
    "viewport_height": 1080
  },
  "cookie_consent": {
    "banner_selector": "#cookie-banner",
    "accept_selector": "button:has-text('Accept')"
  },
  "steps": [
    {
      "name": "vehicle_info",
      "wait_for": "input[name='hsn']",
      "timeout_ms": 15000,
      "fields": [
        {
          "type": "text",
          "selector": "input[name='hsn']",
          "data_key": "vehicle_hsn"
        },
        {
          "type": "text",
          "selector": "input[name='tsn']",
          "data_key": "vehicle_tsn"
        },
        {
          "type": "date",
          "selector": "input[name='registration_date']",
          "data_key": "first_registration",
          "type_config": {
            "format": "MM.YYYY"
          }
        }
      ],
      "next_action": {
        "type": "click",
        "selector": "button:has-text('Continue')"
      }
    },
    {
      "name": "personal_info",
      "wait_for": "input[name='birthdate']",
      "timeout_ms": 15000,
      "fields": [
        {
          "type": "date",
          "selector": "input[name='birthdate']",
          "data_key": "birthdate",
          "type_config": {
            "format": "DD.MM.YYYY"
          }
        },
        {
          "type": "text",
          "selector": "input[name='zipcode']",
          "data_key": "postal_code"
        }
      ],
      "next_action": {
        "type": "click",
        "selector": "button:has-text('Get Quote')"
      }
    }
  ],
  "final_page": {
    "wait_for": ".quote-result",
    "timeout_ms": 60000,
    "screenshot_selector": ".quote-panel"
  }
}

data_row.json:

{
  "vehicle_hsn": "0603",
  "vehicle_tsn": "AKZ",
  "first_registration": "2020-03-15",
  "birthdate": "1985-06-20",
  "postal_code": "80331"
}

Form with Iframes

{
  "steps": [
    {
      "name": "embedded_form",
      "wait_for": "iframe#form-frame",
      "timeout_ms": 15000,
      "fields": [
        {
          "type": "text",
          "selector": "input[name='email']",
          "iframe_selector": "iframe#form-frame",
          "data_key": "email"
        },
        {
          "type": "dropdown",
          "selector": "select[name='plan']",
          "iframe_selector": "iframe#form-frame",
          "data_key": "selected_plan",
          "type_config": {
            "select_by": "text"
          }
        }
      ],
      "next_action": {
        "type": "click",
        "selector": "button:has-text('Submit')",
        "iframe_selector": "iframe#form-frame"
      }
    }
  ]
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azcrawlerpy-1.0.6.tar.gz (118.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azcrawlerpy-1.0.6-py3-none-any.whl (99.0 kB view details)

Uploaded Python 3

File details

Details for the file azcrawlerpy-1.0.6.tar.gz.

File metadata

  • Download URL: azcrawlerpy-1.0.6.tar.gz
  • Upload date:
  • Size: 118.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for azcrawlerpy-1.0.6.tar.gz
Algorithm Hash digest
SHA256 82ce5115e627d00be0b3e0d9624cce4df4bba3a652703b3171f74cb8e0e3a600
MD5 59061c64cad2dbeec7635371c3820f6a
BLAKE2b-256 64c10ee0d93994b9beb1c3a4f56ef9287f293ed60cb4859800de933d99fad701

See more details on using hashes here.

File details

Details for the file azcrawlerpy-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: azcrawlerpy-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 99.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for azcrawlerpy-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2f2aab4f20033c72baaa22a8682bbff5762096fbca4beb4b289fab2174fb5b2e
MD5 410561f676d4a59dced6808def48bd61
BLAKE2b-256 0c5afb6e978e0bf7eacdab02afe74ab4442773453187f98d54ad3f363c290c31

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page