A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction
Project description
StepWright
A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.
Features
- ๐ Declarative Scraping: Define scraping workflows using Python dictionaries or dataclasses
- ๐ Pagination Support: Built-in support for next button and scroll-based pagination
- ๐ Data Collection: Extract text, HTML, values, and files from web pages
- ๐ Multi-tab Support: Handle multiple tabs and complex navigation flows
- ๐ PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
- ๐ฅ File Downloads: Download files with automatic directory creation
- ๐ Looping & Iteration: ForEach loops for processing multiple elements
- ๐ก Streaming Results: Real-time result processing with callbacks
- ๐ฏ Error Handling: Graceful error handling with configurable termination
- ๐ง Flexible Selectors: Support for ID, class, tag, and XPath selectors
- ๐ Retry Logic: Automatic retry on failure with configurable delays
- ๐๏ธ Conditional Execution: Skip or execute steps based on JavaScript conditions
- โณ Smart Waiting: Wait for selectors before actions with configurable timeouts
- ๐ Fallback Selectors: Multiple selector fallbacks for increased robustness
- ๐ฑ๏ธ Enhanced Clicks: Double-click, right-click, modifier keys, and force clicks
- โจ๏ธ Input Enhancements: Clear before input, human-like typing delays
- ๐ Data Transformations: Regex extraction, JavaScript transformations, default values
- ๐ Page Actions: Reload, get URL/title, meta tags, cookies, localStorage, viewport
- ๐ค Human-like Behavior: Random delays to mimic human interaction
- โ Element State Checks: Require visible/enabled before actions
- ๐ผ๏ธ IFrame Support: Interact with elements inside nested IFrames
- ๐ Virtual Scroll: Efficiently collect data from infinite-scroll or virtualized lists
- ๐ฑ๏ธ Advanced Interactions: Support for hover, drag & drop, and multi-select actions
- ๐ค File Uploads: Native support for uploading files to input elements
- ๐ Parallel Execution: Run multiple scraping tasks concurrently with
ParallelTemplateandParameterizedTemplate - ๐ Advanced Data Flows: Read/Write data from JSON, CSV, Excel, and Text files directly in the flow
- ๐ ๏ธ Custom Callbacks: Extend functionality with custom Python closure functions for actions and file formats
Installation
# Using pip
pip install stepwright
# Using pip with development dependencies
pip install stepwright[dev]
# From source
git clone https://github.com/lablnet/stepwright.git
cd stepwright
pip install -e .
Quick Start
Basic Usage
import asyncio
from stepwright import run_scraper, TabTemplate, BaseStep
async def main():
templates = [
TabTemplate(
tab="example",
steps=[
BaseStep(
id="navigate",
action="navigate",
value="https://example.com"
),
BaseStep(
id="get_title",
action="data",
object_type="tag",
object="h1",
key="title",
data_type="text"
)
]
)
]
results = await run_scraper(templates)
print(results)
if __name__ == "__main__":
asyncio.run(main())
API Reference
Core Functions
run_scraper(templates, options=None)
Main function to execute scraping templates.
Parameters:
templates: List ofTabTemplateobjectsoptions: OptionalRunOptionsobject
Returns: List[Dict[str, Any]]
results = await run_scraper(templates, RunOptions(
browser={"headless": True}
))
run_scraper_with_callback(templates, on_result, options=None)
Execute scraping with streaming results via callback.
Parameters:
templates: List ofTabTemplateobjectson_result: Callback function for each result (can be sync or async)options: OptionalRunOptionsobject
async def process_result(result, index):
print(f"Result {index}: {result}")
await run_scraper_with_callback(templates, process_result)
Types
TabTemplate
@dataclass
class TabTemplate:
tab: str
initSteps: Optional[List[BaseStep]] = None # Steps executed once before pagination
perPageSteps: Optional[List[BaseStep]] = None # Steps executed for each page
steps: Optional[List[BaseStep]] = None # Single steps array
pagination: Optional[PaginationConfig] = None
BaseStep
@dataclass
class BaseStep:
id: str
description: Optional[str] = None
object_type: Optional[SelectorType] = None # 'id' | 'class' | 'tag' | 'xpath'
object: Optional[str] = None
action: Literal[
"navigate", "input", "click", "data", "scroll",
"eventBaseDownload", "foreach", "open", "savePDF",
"printToPDF", "downloadPDF", "downloadFile",
"reload", "getUrl", "getTitle", "getMeta", "getCookies",
"setCookies", "getLocalStorage", "setLocalStorage",
"getSessionStorage", "setSessionStorage", "getViewportSize",
"setViewportSize", "screenshot", "waitForSelector", "evaluate",
"hover", "select", "dragAndDrop", "uploadFile", "virtualScroll"
] = "navigate"
value: Optional[str] = None
key: Optional[str] = None
index_key: Optional[str] = None # Custom index placeholder char (default: 'i')
data_type: Optional[DataType] = None # 'text' | 'html' | 'value' | 'default' | 'attribute'
wait: Optional[int] = None
terminateonerror: Optional[bool] = None
subSteps: Optional[List["BaseStep"]] = None
autoScroll: Optional[bool] = None
# IFrame scoping
frameSelector: Optional[str] = None
frameSelectorType: Optional[SelectorType] = None
# Virtual Scroll settings
virtualScrollOffset: Optional[int] = None
virtualScrollDelay: Optional[int] = None
virtualScrollUniqueKey: Optional[str] = None
virtualScrollLimit: Optional[int] = None
virtualScrollContainer: Optional[str] = None
virtualScrollContainerType: Optional[SelectorType] = None
# Retry configuration
retry: Optional[int] = None # Number of retries on failure (default: 0)
retryDelay: Optional[int] = None # Delay between retries in ms (default: 1000)
# Conditional execution
skipIf: Optional[str] = None # JavaScript expression - skip step if true
onlyIf: Optional[str] = None # JavaScript expression - execute only if true
# Element waiting and state
waitForSelector: Optional[str] = None # Wait for selector before action
waitForSelectorTimeout: Optional[int] = None # Timeout for waitForSelector in ms (default: 30000)
waitForSelectorState: Optional[Literal["visible", "hidden", "attached", "detached"]] = None
# Multiple selector fallbacks
fallbackSelectors: Optional[List[Dict[str, str]]] = None # List of {object_type, object}
# Click enhancements
clickModifiers: Optional[List[ClickModifier]] = None # ['Control', 'Meta', 'Shift', 'Alt']
doubleClick: Optional[bool] = None # Perform double click
forceClick: Optional[bool] = None # Force click even if not visible/actionable
rightClick: Optional[bool] = None # Perform right click
# Input enhancements
clearBeforeInput: Optional[bool] = None # Clear input before typing (default: True)
inputDelay: Optional[int] = None # Delay between keystrokes in ms
# Data extraction enhancements
required: Optional[bool] = None # Raise error if extraction returns None/empty
defaultValue: Optional[str] = None # Default value if extraction fails
regex: Optional[str] = None # Regex pattern to extract from data
regexGroup: Optional[int] = None # Regex group to extract (default: 0)
transform: Optional[str] = None # JavaScript expression to transform data
# Timeout configuration
timeout: Optional[int] = None # Step-specific timeout in ms
# Navigation enhancements
waitUntil: Optional[Literal["load", "domcontentloaded", "networkidle", "commit"]] = None
# Human-like behavior
randomDelay: Optional[Dict[str, int]] = None # {min: ms, max: ms} for random delay
# Element state checks
requireVisible: Optional[bool] = None # Require element visible (default: True for click)
requireEnabled: Optional[bool] = None # Require element enabled
# Skip/continue logic
skipOnError: Optional[bool] = None # Skip step if error occurs (default: False)
continueOnEmpty: Optional[bool] = None # Continue if element not found (default: True)
# Drag and Drop settings
targetObject: Optional[str] = None
targetObjectType: Optional[SelectorType] = None
RunOptions
@dataclass
class RunOptions:
browser: Optional[dict] = None # Playwright launch options
onResult: Optional[Callable] = None
Step Actions
IFrame Support
All actions can be scoped within IFrames by providing frameSelector and frameSelectorType.
BaseStep(
id="iframe_action",
action="click",
frameSelector="my-iframe-id",
frameSelectorType="id",
object_type="tag",
object="button"
)
Navigate
Navigate to a URL.
BaseStep(
id="go_to_page",
action="navigate",
value="https://example.com"
)
Input
Fill form fields.
BaseStep(
id="search",
action="input",
object_type="id",
object="search-box",
value="search term"
)
Click
Click on elements.
BaseStep(
id="submit",
action="click",
object_type="class",
object="submit-button"
)
Data Extraction
Extract data from elements.
BaseStep(
id="get_title",
action="data",
object_type="tag",
object="h1",
key="title",
data_type="text"
)
BaseStep( id="process_categories", action="foreach", object_type="class", object="category", index_key="i", # Default index char subSteps=[ BaseStep( id="get_category_name", action="data", object_type="tag", object="h1", key="category", data_type="text" ), BaseStep( id="process_sub_items", action="foreach", object_type="xpath", # Use index placeholder in nested selector object="(//div[@class='category'])[{{i_plus1}}]//li[@class='item']", index_key="j", # Custom index for nested loop key="products", # Results collected into an array under this key subSteps=[ BaseStep( id="get_item_name", action="data", object_type="tag", object="span", key="name", data_type="text" ) ] ) ] )
#### Context Merging in Nested Loops
StepWright automatically handles context merging in nested loops.
- If you **do not** specify a `key` for an inner loop, the result will be flattened, and all parent data (like `category`) will be merged into every child record.
- If you **do** specify a `key` (e.g., `key="products"`), the items will be collected into a structured array, keeping your data hierarchical.
### Virtual Scroll
Extract results from infinite-scroll or virtualized lists.
```python
BaseStep(
id="collect_virtual_items",
action="virtualScroll",
object_type="class",
object="list-item",
virtualScrollUniqueKey="id", # Field to use for deduplication
virtualScrollLimit=100, # Max items to collect
virtualScrollOffset=500, # Scroll increment in pixels
virtualScrollDelay=1000, # Delay after each scroll in ms
virtualScrollContainer="container", # Optional: element to scroll
virtualScrollContainerType="id", # Optional: container selector type
key="items",
subSteps=[
BaseStep(id="name", action="data", object_type="tag", object="h3", key="name")
]
)
File Operations
Event-Based Download
BaseStep(
id="download_file",
action="eventBaseDownload",
object_type="class",
object="download-link",
value="./downloads/file.pdf",
key="downloaded_file"
)
Download PDF/File
BaseStep(
id="download_pdf",
action="downloadPDF",
object_type="class",
object="pdf-link",
value="./output/document.pdf",
key="pdf_file"
)
Save PDF
BaseStep(
id="save_pdf",
action="savePDF",
value="./output/page.pdf",
key="pdf_file"
)
Pagination
Next Button Pagination
PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(
object_type="class",
object="next-page",
wait=2000
),
maxPages=10
)
Scroll Pagination
PaginationConfig(
strategy="scroll",
scroll=ScrollConfig(
offset=800,
delay=1500
),
maxPages=5
)
Pagination Strategies
paginationFirst
Paginate first, then collect data from each page:
TabTemplate(
tab="news",
initSteps=[...],
perPageSteps=[...], # Collect data from each page
pagination=PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(...),
paginationFirst=True # Go to next page before collecting
)
)
paginateAllFirst
Paginate through all pages first, then collect all data at once:
TabTemplate(
tab="articles",
initSteps=[...],
perPageSteps=[...], # Collect all data after all pagination
pagination=PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(...),
paginateAllFirst=True # Load all pages first
)
)
Advanced Features
Proxy Support
from stepwright import run_scraper, RunOptions
results = await run_scraper(templates, RunOptions(
browser={
"proxy": {
"server": "http://proxy-server:8080",
"username": "user",
"password": "pass"
}
}
))
Custom Browser Options
results = await run_scraper(templates, RunOptions(
browser={
"headless": False,
"slow_mo": 1000,
"args": ["--no-sandbox", "--disable-setuid-sandbox"]
}
))
Streaming Results
async def process_result(result, index):
print(f"Result {index}: {result}")
# Process result immediately (e.g., save to database)
await save_to_database(result)
await run_scraper_with_callback(
templates,
process_result,
RunOptions(browser={"headless": True})
)
Data Placeholders
Use collected data in subsequent steps:
BaseStep(
id="get_title",
action="data",
object_type="id",
object="page-title",
key="page_title",
data_type="text"
),
BaseStep(
id="save_with_title",
action="savePDF",
value="./output/{{page_title}}.pdf", # Uses collected page_title
key="pdf_file"
)
BaseStep( id="process_categories", action="foreach", object_type="class", object="category", index_key="i", # You can define custom index letters subSteps=[ BaseStep( id="process_items", action="foreach", object_type="xpath", # Use outer loop index 'i' in inner selector object="(//div[@class='category'])[{{i_plus1}}]//li", index_key="j", # Inner loop uses 'j' subSteps=[ BaseStep( id="save_item", action="savePDF", # Access inner index 'j' and outer index 'i' value="./output/cat_{{i}}/item_{{j}}.pdf" ) ] ) ] )
## Error Handling
Steps can be configured to terminate on error:
```python
BaseStep(
id="critical_step",
action="click",
object_type="id",
object="important-button",
terminateonerror=True # Stop execution if this fails
)
Without terminateonerror=True, errors are logged but execution continues.
Advanced Step Options
Retry Logic
Automatically retry failed steps with configurable delays:
BaseStep(
id="click_button",
action="click",
object_type="id",
object="flaky-button",
retry=3, # Retry up to 3 times
retryDelay=1000 # Wait 1 second between retries
)
Conditional Execution
Execute or skip steps based on JavaScript conditions:
# Skip step if condition is true
BaseStep(
id="optional_click",
action="click",
object_type="id",
object="optional-button",
skipIf="document.querySelector('.modal').classList.contains('hidden')"
)
# Execute only if condition is true
BaseStep(
id="conditional_data",
action="data",
object_type="id",
object="dynamic-content",
key="content",
onlyIf="document.querySelector('#dynamic-content') !== null"
)
Wait for Selector
Wait for elements to appear before performing actions:
BaseStep(
id="click_after_load",
action="click",
object_type="id",
object="target-button",
waitForSelector="#loading-indicator", # Wait for this selector
waitForSelectorTimeout=5000, # Timeout: 5 seconds
waitForSelectorState="hidden" # Wait until hidden
)
Fallback Selectors
Provide multiple selector options for increased robustness:
BaseStep(
id="click_with_fallback",
action="click",
object_type="id",
object="primary-button", # Try this first
fallbackSelectors=[
{"object_type": "class", "object": "btn-primary"},
{"object_type": "class", "object": "submit-btn"},
{"object_type": "xpath", "object": "//button[contains(text(), 'Submit')]"}
]
)
Click Enhancements
Advanced click options for different interaction types:
# Double click
BaseStep(
id="double_click",
action="click",
object_type="id",
object="item",
doubleClick=True
)
# Right click (context menu)
BaseStep(
id="right_click",
action="click",
object_type="id",
object="context-menu-trigger",
rightClick=True
)
# Click with modifier keys (Ctrl/Cmd+Click)
BaseStep(
id="multi_select",
action="click",
object_type="class",
object="item",
clickModifiers=["Control"] # or ["Meta"] for Mac
)
# Force click (click hidden elements)
BaseStep(
id="force_click",
action="click",
object_type="id",
object="hidden-button",
forceClick=True
)
Input Enhancements
More control over input behavior:
# Clear input before typing (default: True)
BaseStep(
id="clear_and_input",
action="input",
object_type="id",
object="search-box",
value="new search term",
clearBeforeInput=True # Clear existing value first
)
# Human-like typing with delays
BaseStep(
id="human_like_input",
action="input",
object_type="id",
object="form-field",
value="slowly typed text",
inputDelay=100 # 100ms delay between each character
)
Data Extraction Enhancements
Advanced data extraction and transformation options:
# Extract with regex
BaseStep(
id="extract_price",
action="data",
object_type="id",
object="price",
key="price",
regex=r"\$(\d+\.\d+)", # Extract dollar amount
regexGroup=1 # Get first capture group
)
# Transform extracted data with JavaScript
BaseStep(
id="transform_data",
action="data",
object_type="id",
object="raw-data",
key="processed",
transform="value.toUpperCase().trim()" # JavaScript transformation
)
# Required field with default value
BaseStep(
id="get_required_data",
action="data",
object_type="id",
object="important-field",
key="important",
required=True, # Raise error if not found
defaultValue="N/A" # Use if extraction fails
)
# Continue even if element not found
BaseStep(
id="optional_data",
action="data",
object_type="id",
object="optional-content",
key="optional",
continueOnEmpty=True # Don't raise error if not found
)
Element State Checks
Validate element state before actions:
BaseStep(
id="click_visible",
action="click",
object_type="id",
object="button",
requireVisible=True, # Ensure element is visible
requireEnabled=True # Ensure element is enabled
)
Random Delays
Add human-like random delays to actions:
BaseStep(
id="human_like_action",
action="click",
object_type="id",
object="button",
randomDelay={"min": 500, "max": 2000} # Random delay between 500-2000ms
)
Skip on Error
Skip steps that fail instead of stopping execution:
BaseStep(
id="optional_step",
action="click",
object_type="id",
object="optional-button",
skipOnError=True # Continue even if this step fails
)
Page Actions
Reload Page
Reload the current page with optional wait conditions:
BaseStep(
id="reload",
action="reload",
waitUntil="networkidle" # Wait for network to be idle
)
Get Current URL
BaseStep(
id="get_url",
action="getUrl",
key="current_url" # Store in collector
)
Get Page Title
BaseStep(
id="get_title",
action="getTitle",
key="page_title"
)
Get Meta Tags
# Get specific meta tag
BaseStep(
id="get_description",
action="getMeta",
object="description", # Meta name or property
key="meta_description"
)
# Get all meta tags
BaseStep(
id="get_all_meta",
action="getMeta",
key="all_meta_tags" # Returns dictionary of all meta tags
)
Cookies Management
# Get all cookies
BaseStep(
id="get_cookies",
action="getCookies",
key="cookies"
)
# Get specific cookie
BaseStep(
id="get_session_cookie",
action="getCookies",
object="session_id",
key="session"
)
# Set cookie
BaseStep(
id="set_cookie",
action="setCookies",
object="preference",
value="dark_mode"
)
LocalStorage & SessionStorage
# Get localStorage value
BaseStep(
id="get_storage",
action="getLocalStorage",
object="user_preference",
key="preference"
)
# Set localStorage value
BaseStep(
id="set_storage",
action="setLocalStorage",
object="theme",
value="dark"
)
# Get all localStorage items
BaseStep(
id="get_all_storage",
action="getLocalStorage",
key="all_storage"
)
# SessionStorage (same pattern)
BaseStep(
id="get_session",
action="getSessionStorage",
object="temp_data",
key="data"
)
Viewport Operations
# Get viewport size
BaseStep(
id="get_viewport",
action="getViewportSize",
key="viewport"
)
# Set viewport size
BaseStep(
id="set_viewport",
action="setViewportSize",
value="1920x1080" # or "1920,1080" or "1920 1080"
)
Screenshot
# Full page screenshot
BaseStep(
id="screenshot",
action="screenshot",
value="./screenshots/page.png",
data_type="full" # Full page, omit for viewport only
)
# Element screenshot
BaseStep(
id="element_screenshot",
action="screenshot",
object_type="id",
object="content-area",
value="./screenshots/element.png",
key="screenshot_path"
)
Wait for Selector
Explicit wait for element state:
BaseStep(
id="wait_for_element",
action="waitForSelector",
object_type="id",
object="dynamic-content",
value="visible", # visible, hidden, attached, detached
wait=5000, # Timeout in ms
key="wait_result" # Stores True/False
)
Evaluate JavaScript
Execute custom JavaScript:
BaseStep(
id="custom_js",
action="evaluate",
value="() => document.querySelector('.counter').textContent",
key="counter_value"
)
Advanced UI Interactions
Hover
BaseStep(
id="hover_menu",
action="hover",
object_type="id",
object="menu-trigger"
)
Select (Standard & Multi-select)
BaseStep(
id="select_colors",
action="select",
object_type="id",
object="color-picker",
value="red,green,blue" # Comma separated for multi-select
)
Drag and Drop
BaseStep(
id="drag_item",
action="dragAndDrop",
object_type="id",
object="source-item",
targetObject="drop-zone",
targetObjectType="id"
)
File Upload
BaseStep(
id="upload_cv",
action="uploadFile",
object_type="id",
object="file-input",
value="/path/to/my-cv.pdf"
)
Complete Example
import asyncio
from pathlib import Path
from stepwright import (
run_scraper,
TabTemplate,
BaseStep,
PaginationConfig,
NextButtonConfig,
RunOptions
)
async def main():
templates = [
TabTemplate(
tab="news_scraper",
initSteps=[
BaseStep(
id="navigate",
action="navigate",
value="https://news-site.com"
),
BaseStep(
id="search",
action="input",
object_type="id",
object="search-box",
value="technology"
)
],
perPageSteps=[
BaseStep(
id="collect_articles",
action="foreach",
object_type="class",
object="article",
subSteps=[
BaseStep(
id="get_title",
action="data",
object_type="tag",
object="h2",
key="title",
data_type="text"
),
BaseStep(
id="get_content",
action="data",
object_type="tag",
object="p",
key="content",
data_type="text"
),
BaseStep(
id="get_link",
action="data",
object_type="tag",
object="a",
key="link",
data_type="value"
)
]
)
],
pagination=PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(
object_type="id",
object="next-page",
wait=2000
),
maxPages=5
)
)
]
# Run scraper
results = await run_scraper(templates, RunOptions(
browser={"headless": True}
))
# Process results
for i, article in enumerate(results):
print(f"\nArticle {i + 1}:")
print(f"Title: {article.get('title')}")
print(f"Content: {article.get('content')[:100]}...")
print(f"Link: {article.get('link')}")
if __name__ == "__main__":
asyncio.run(main())
Advanced Features
Parallelism and Concurrency
StepWright allows you to run multiple templates concurrently to significantly speed up your scraping tasks.
ParameterizedTemplate
Run the same scraping logic for multiple parameters (e.g., keywords) simultaneously.
import asyncio
from stepwright import run_scraper, TabTemplate, BaseStep, ParameterizedTemplate
async def main():
# Define a base template with a placeholder
search_base = TabTemplate(
tab="search_{{keyword}}",
steps=[
BaseStep(id="nav", action="navigate", value="https://example.com/search?q={{keyword}}"),
BaseStep(id="data", action="data", object="h1", key="title")
]
)
# Run for multiple keywords in parallel
task = ParameterizedTemplate(
template=search_base,
parameter_key="keyword",
values=["laptop", "phone", "monitor"]
)
results = await run_scraper([task])
print(f"Scraped {len(results)} items concurrently.")
if __name__ == "__main__":
asyncio.run(main())
Advanced Data Flows
You can read input data from files and write results directly to various formats (JSON, CSV, Excel, Text).
steps = [
# Read search terms from a JSON file
BaseStep(id="load", action="readData", value="keywords.json", data_type="json", key="queue"),
# Loop over the loaded list
BaseStep(id="loop", action="foreach", value="{{queue}}", subSteps=[
BaseStep(id="nav", action="navigate", value="https://example.com/item/{{item}}"),
BaseStep(id="extract", action="data", object=".name", key="name")
], key="results"),
# Write all results to a CSV
BaseStep(id="save", action="writeData", value="results.csv", data_type="csv", key="results")
]
Custom Callbacks
Extend StepWright by providing your own Python logic for actions or file parsing.
Custom Action Callback
Perform complex interactions or calculations directly with the Playwright page and collector.
def my_custom_logic(page, collector, step):
# Calculate something or modify collector
current_url = page.url
return f"Processed {current_url} at step {step.id}"
step = BaseStep(
id="custom-hook",
action="custom",
callback=my_custom_logic,
key="status_msg"
)
Custom File Format Callback
Handle proprietary or unsupported file formats by providing a custom reader/writer.
def my_xml_reader(path, step):
import xml.etree.ElementTree as ET
tree = ET.parse(path)
return [el.text for el in tree.findall('.//item')]
step = BaseStep(
id="load-xml",
action="readData",
value="data.xml",
data_type="custom",
callback=my_xml_reader,
key="items"
)
Development
Setup
# Clone repository
git clone https://github.com/lablnet/stepwright.git
cd stepwright
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Install Playwright browsers
playwright install chromium
Running Tests
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_scraper.py
# Run specific test class
pytest tests/test_scraper.py::TestGetBrowser
# Run specific test
pytest tests/test_scraper.py::TestGetBrowser::test_create_browser_instance
# Run with coverage
pytest --cov=src --cov-report=html
# Run integration tests only
pytest tests/test_integration.py
Project Structure
stepwright/
โโโ src/
โ โโโ __init__.py
โ โโโ step_types.py # Type definitions and dataclasses
โ โโโ helpers.py # Utility functions
โ โโโ executor.py # Core step execution logic
โ โโโ parser.py # Public API (run_scraper)
โ โโโ scraper.py # Low-level browser automation
โ โโโ handlers/ # Action-specific handlers
โ โ โโโ __init__.py
โ โ โโโ data_handlers.py # Data extraction handlers
โ โ โโโ file_handlers.py # File download/PDF handlers
โ โ โโโ loop_handlers.py # Foreach/open handlers
โ โ โโโ page_actions.py # Page-related actions (reload, getUrl, etc.)
โ โ โโโ interaction_handlers.py # Hover, select, drag&drop, virtual scroll
โ โ โโโ data_flow_handlers.py # NEW: File I/O and custom callbacks
โ โโโ scraper_parser.py # Backward compatibility
โโโ tests/
โ โโโ __init__.py
โ โโโ conftest.py # Pytest configuration
โ โโโ parallel_demo.html # Fixture for parallel/flow testing
โ โโโ test_parallel_flows.py # NEW: Parallelism and data flow tests
โ โโโ test_page.html # Test HTML page
โ โโโ test_page_enhanced.html # Enhanced test page for expansion
โ โโโ advanced_demo.html # Demo page for advanced interactions
โ โโโ test_scraper.py # Core scraper tests
โ โโโ test_parser.py # Parser function tests
โ โโโ test_nested_loops.py # Tests for nested FOREACH loops
โ โโโ test_advanced_features.py # Tests for hover, drag&drop, vscroll, iframe
โ โโโ test_integration.py # Integration tests
โโโ pyproject.toml # Package configuration
โโโ setup.py # Setup script
โโโ pytest.ini # Pytest configuration
โโโ README.md # This file
โโโ README_TESTS.md # Detailed test documentation
Code Quality
# Format code with black
black src/ tests/
# Lint with flake8
flake8 src/ tests/
# Type checking with mypy
mypy src/
Module Organization
The codebase follows separation of concerns:
- step_types.py: All type definitions (BaseStep, TabTemplate, etc.)
- helpers.py: Utility functions (placeholder replacement, locator creation, condition evaluation)
- executor.py: Core execution logic (execute steps, handle pagination, retry logic)
- parser.py: Public API (run_scraper, run_scraper_with_callback)
- scraper.py: Low-level Playwright wrapper (navigate, click, get_data)
- handlers/: Action-specific handlers organized by functionality
- data_handlers.py: Data extraction logic with transformations
- file_handlers.py: File download and PDF operations
- loop_handlers.py: Foreach loops and new tab/window handling
- page_actions.py: Page-related actions (reload, getUrl, cookies, storage, etc.)
- interaction_handlers.py: Hover, select, drag and drop, file upload, and virtual scroll logic
- scraper_parser.py: Backward compatibility wrapper
You can import from the main module or specific submodules:
# From main module (recommended)
from stepwright import run_scraper, TabTemplate, BaseStep
# From specific modules
from stepwright.step_types import TabTemplate, BaseStep
from stepwright.parser import run_scraper
from stepwright.helpers import replace_data_placeholders
Testing
See README_TESTS.md for detailed testing documentation.
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure all tests pass (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
MIT License - see LICENSE file for details.
Support
- ๐ Issues: GitHub Issues
- ๐ Documentation: README.md and README_TESTS.md
- ๐ฌ Discussions: GitHub Discussions
Acknowledgments
- Built with Playwright
- Inspired by declarative web scraping patterns
- Original TypeScript version: framework-Island/stepwright
Author
Muhammad Umer Farooq (@lablnet)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stepwright-1.1.0.tar.gz.
File metadata
- Download URL: stepwright-1.1.0.tar.gz
- Upload date:
- Size: 74.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7e123b7c83769db81b3143b3ea106257d8e0cf50a9cface0f871e110ba2f4b1
|
|
| MD5 |
1b407edfc0ee738337de1d8c98520109
|
|
| BLAKE2b-256 |
11849f137fd13bc02ebeff20a935a3c7fc352e3ec987cf0258ae8a3b4fdf775e
|
File details
Details for the file stepwright-1.1.0-py3-none-any.whl.
File metadata
- Download URL: stepwright-1.1.0-py3-none-any.whl
- Upload date:
- Size: 42.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f16393cdff964bb780d10e917ae559b8ce14e4a04aeb15f6c7e4cd6cc18be3a
|
|
| MD5 |
13d3494a2bfc104e7160b0afd4f4628c
|
|
| BLAKE2b-256 |
9f53108bfacd7307b43402cfe6d358aad89ed454fdeb4c9e9392adc489117fbf
|