A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction
Project description
StepWright
A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.
Features
- ๐ Declarative Scraping: Define scraping workflows using Python dictionaries or dataclasses
- ๐ Pagination Support: Built-in support for next button and scroll-based pagination
- ๐ Data Collection: Extract text, HTML, values, and files from web pages
- ๐ Multi-tab Support: Handle multiple tabs and complex navigation flows
- ๐ PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
- ๐ฅ File Downloads: Download files with automatic directory creation
- ๐ Looping & Iteration: ForEach loops for processing multiple elements
- ๐ก Streaming Results: Real-time result processing with callbacks
- ๐ฏ Error Handling: Graceful error handling with configurable termination
- ๐ง Flexible Selectors: Support for ID, class, tag, and XPath selectors
Installation
# Using pip
pip install stepwright
# Using pip with development dependencies
pip install stepwright[dev]
# From source
git clone https://github.com/lablnet/stepwright.git
cd stepwright
pip install -e .
Quick Start
Basic Usage
import asyncio
from stepwright import run_scraper, TabTemplate, BaseStep
async def main():
templates = [
TabTemplate(
tab="example",
steps=[
BaseStep(
id="navigate",
action="navigate",
value="https://example.com"
),
BaseStep(
id="get_title",
action="data",
object_type="tag",
object="h1",
key="title",
data_type="text"
)
]
)
]
results = await run_scraper(templates)
print(results)
if __name__ == "__main__":
asyncio.run(main())
API Reference
Core Functions
run_scraper(templates, options=None)
Main function to execute scraping templates.
Parameters:
templates: List ofTabTemplateobjectsoptions: OptionalRunOptionsobject
Returns: List[Dict[str, Any]]
results = await run_scraper(templates, RunOptions(
browser={"headless": True}
))
run_scraper_with_callback(templates, on_result, options=None)
Execute scraping with streaming results via callback.
Parameters:
templates: List ofTabTemplateobjectson_result: Callback function for each result (can be sync or async)options: OptionalRunOptionsobject
async def process_result(result, index):
print(f"Result {index}: {result}")
await run_scraper_with_callback(templates, process_result)
Types
TabTemplate
@dataclass
class TabTemplate:
tab: str
initSteps: Optional[List[BaseStep]] = None # Steps executed once before pagination
perPageSteps: Optional[List[BaseStep]] = None # Steps executed for each page
steps: Optional[List[BaseStep]] = None # Single steps array
pagination: Optional[PaginationConfig] = None
BaseStep
@dataclass
class BaseStep:
id: str
description: Optional[str] = None
object_type: Optional[SelectorType] = None # 'id' | 'class' | 'tag' | 'xpath'
object: Optional[str] = None
action: Literal[
"navigate", "input", "click", "data", "scroll",
"eventBaseDownload", "foreach", "open", "savePDF",
"printToPDF", "downloadPDF", "downloadFile"
] = "navigate"
value: Optional[str] = None
key: Optional[str] = None
data_type: Optional[DataType] = None # 'text' | 'html' | 'value' | 'default' | 'attribute'
wait: Optional[int] = None
terminateonerror: Optional[bool] = None
subSteps: Optional[List["BaseStep"]] = None
autoScroll: Optional[bool] = None
RunOptions
@dataclass
class RunOptions:
browser: Optional[dict] = None # Playwright launch options
onResult: Optional[Callable] = None
Step Actions
Navigate
Navigate to a URL.
BaseStep(
id="go_to_page",
action="navigate",
value="https://example.com"
)
Input
Fill form fields.
BaseStep(
id="search",
action="input",
object_type="id",
object="search-box",
value="search term"
)
Click
Click on elements.
BaseStep(
id="submit",
action="click",
object_type="class",
object="submit-button"
)
Data Extraction
Extract data from elements.
BaseStep(
id="get_title",
action="data",
object_type="tag",
object="h1",
key="title",
data_type="text"
)
ForEach Loop
Process multiple elements.
BaseStep(
id="process_items",
action="foreach",
object_type="class",
object="item",
subSteps=[
BaseStep(
id="get_item_title",
action="data",
object_type="tag",
object="h2",
key="title",
data_type="text"
)
]
)
File Operations
Event-Based Download
BaseStep(
id="download_file",
action="eventBaseDownload",
object_type="class",
object="download-link",
value="./downloads/file.pdf",
key="downloaded_file"
)
Download PDF/File
BaseStep(
id="download_pdf",
action="downloadPDF",
object_type="class",
object="pdf-link",
value="./output/document.pdf",
key="pdf_file"
)
Save PDF
BaseStep(
id="save_pdf",
action="savePDF",
value="./output/page.pdf",
key="pdf_file"
)
Pagination
Next Button Pagination
PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(
object_type="class",
object="next-page",
wait=2000
),
maxPages=10
)
Scroll Pagination
PaginationConfig(
strategy="scroll",
scroll=ScrollConfig(
offset=800,
delay=1500
),
maxPages=5
)
Pagination Strategies
paginationFirst
Paginate first, then collect data from each page:
TabTemplate(
tab="news",
initSteps=[...],
perPageSteps=[...], # Collect data from each page
pagination=PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(...),
paginationFirst=True # Go to next page before collecting
)
)
paginateAllFirst
Paginate through all pages first, then collect all data at once:
TabTemplate(
tab="articles",
initSteps=[...],
perPageSteps=[...], # Collect all data after all pagination
pagination=PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(...),
paginateAllFirst=True # Load all pages first
)
)
Advanced Features
Proxy Support
from stepwright import run_scraper, RunOptions
results = await run_scraper(templates, RunOptions(
browser={
"proxy": {
"server": "http://proxy-server:8080",
"username": "user",
"password": "pass"
}
}
))
Custom Browser Options
results = await run_scraper(templates, RunOptions(
browser={
"headless": False,
"slow_mo": 1000,
"args": ["--no-sandbox", "--disable-setuid-sandbox"]
}
))
Streaming Results
async def process_result(result, index):
print(f"Result {index}: {result}")
# Process result immediately (e.g., save to database)
await save_to_database(result)
await run_scraper_with_callback(
templates,
process_result,
RunOptions(browser={"headless": True})
)
Data Placeholders
Use collected data in subsequent steps:
BaseStep(
id="get_title",
action="data",
object_type="id",
object="page-title",
key="page_title",
data_type="text"
),
BaseStep(
id="save_with_title",
action="savePDF",
value="./output/{{page_title}}.pdf", # Uses collected page_title
key="pdf_file"
)
Index Placeholders
Use loop index in foreach steps:
BaseStep(
id="process_items",
action="foreach",
object_type="class",
object="item",
subSteps=[
BaseStep(
id="save_item",
action="savePDF",
value="./output/item_{{i}}.pdf", # i = 0, 1, 2, ...
# or
value="./output/item_{{i_plus1}}.pdf" # i_plus1 = 1, 2, 3, ...
)
]
)
Error Handling
Steps can be configured to terminate on error:
BaseStep(
id="critical_step",
action="click",
object_type="id",
object="important-button",
terminateonerror=True # Stop execution if this fails
)
Without terminateonerror=True, errors are logged but execution continues.
Complete Example
import asyncio
from pathlib import Path
from stepwright import (
run_scraper,
TabTemplate,
BaseStep,
PaginationConfig,
NextButtonConfig,
RunOptions
)
async def main():
templates = [
TabTemplate(
tab="news_scraper",
initSteps=[
BaseStep(
id="navigate",
action="navigate",
value="https://news-site.com"
),
BaseStep(
id="search",
action="input",
object_type="id",
object="search-box",
value="technology"
)
],
perPageSteps=[
BaseStep(
id="collect_articles",
action="foreach",
object_type="class",
object="article",
subSteps=[
BaseStep(
id="get_title",
action="data",
object_type="tag",
object="h2",
key="title",
data_type="text"
),
BaseStep(
id="get_content",
action="data",
object_type="tag",
object="p",
key="content",
data_type="text"
),
BaseStep(
id="get_link",
action="data",
object_type="tag",
object="a",
key="link",
data_type="value"
)
]
)
],
pagination=PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(
object_type="id",
object="next-page",
wait=2000
),
maxPages=5
)
)
]
# Run scraper
results = await run_scraper(templates, RunOptions(
browser={"headless": True}
))
# Process results
for i, article in enumerate(results):
print(f"\nArticle {i + 1}:")
print(f"Title: {article.get('title')}")
print(f"Content: {article.get('content')[:100]}...")
print(f"Link: {article.get('link')}")
if __name__ == "__main__":
asyncio.run(main())
Development
Setup
# Clone repository
git clone https://github.com/lablnet/stepwright.git
cd stepwright
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Install Playwright browsers
playwright install chromium
Running Tests
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_scraper.py
# Run specific test class
pytest tests/test_scraper.py::TestGetBrowser
# Run specific test
pytest tests/test_scraper.py::TestGetBrowser::test_create_browser_instance
# Run with coverage
pytest --cov=src --cov-report=html
# Run integration tests only
pytest tests/test_integration.py
Project Structure
stepwright/
โโโ src/
โ โโโ __init__.py
โ โโโ step_types.py # Type definitions and dataclasses
โ โโโ helpers.py # Utility functions
โ โโโ executor.py # Core step execution logic
โ โโโ parser.py # Public API (run_scraper)
โ โโโ scraper.py # Low-level browser automation
โ โโโ scraper_parser.py # Backward compatibility
โโโ tests/
โ โโโ __init__.py
โ โโโ conftest.py # Pytest configuration
โ โโโ test_page.html # Test HTML page
โ โโโ test_scraper.py # Core scraper tests
โ โโโ test_parser.py # Parser function tests
โ โโโ test_integration.py # Integration tests
โโโ pyproject.toml # Package configuration
โโโ setup.py # Setup script
โโโ pytest.ini # Pytest configuration
โโโ README.md # This file
โโโ README_TESTS.md # Detailed test documentation
Code Quality
# Format code with black
black src/ tests/
# Lint with flake8
flake8 src/ tests/
# Type checking with mypy
mypy src/
Module Organization
The codebase follows separation of concerns:
- step_types.py: All type definitions (BaseStep, TabTemplate, etc.)
- helpers.py: Utility functions (placeholder replacement, locator creation)
- executor.py: Core execution logic (execute steps, handle pagination)
- parser.py: Public API (run_scraper, run_scraper_with_callback)
- scraper.py: Low-level Playwright wrapper (navigate, click, get_data)
- scraper_parser.py: Backward compatibility wrapper
You can import from the main module or specific submodules:
# From main module (recommended)
from stepwright import run_scraper, TabTemplate, BaseStep
# From specific modules
from stepwright.step_types import TabTemplate, BaseStep
from stepwright.parser import run_scraper
from stepwright.helpers import replace_data_placeholders
Testing
See README_TESTS.md for detailed testing documentation.
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure all tests pass (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
MIT License - see LICENSE file for details.
Support
- ๐ Issues: GitHub Issues
- ๐ Documentation: README.md and README_TESTS.md
- ๐ฌ Discussions: GitHub Discussions
Acknowledgments
- Built with Playwright
- Inspired by declarative web scraping patterns
- Original TypeScript version: framework-Island/stepwright
Author
Muhammad Umer Farooq (@lablnet)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stepwright-0.1.1.tar.gz.
File metadata
- Download URL: stepwright-0.1.1.tar.gz
- Upload date:
- Size: 29.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a09298d204403c710b60c88a60290eb87965f2be2dbc1166109b9b1584d66ae
|
|
| MD5 |
c1d3e30cfcd641644c3aeaeae1fade0d
|
|
| BLAKE2b-256 |
3ee4bd044bdef8d52bf1e474f8c9f3e4f9c897e3a0f6dd1daeb82b6a5028fef2
|
File details
Details for the file stepwright-0.1.1-py3-none-any.whl.
File metadata
- Download URL: stepwright-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6af6f911c59c16996212853304949b119bebefacc3af71c021f89675197c45f
|
|
| MD5 |
688a3827278adc8091b8254d918f5d49
|
|
| BLAKE2b-256 |
e72ae9ef21df7bf2e6a406552862b251c465fcb711dc3220287cb8d6d60c32e4
|