An opinionated scraping workflow engine built on Playwright
Project description
ScrapeFlow
An opinionated scraping workflow engine built on Playwright
ScrapeFlow is a production-ready Python library that transforms Playwright into a powerful, enterprise-grade web scraping framework. It handles the common challenges of web scraping: retries, rate limiting, anti-detection, error recovery, and workflow orchestration.
🚀 Features
- 🔄 Intelligent Retry Logic: Automatic retries with exponential backoff and jitter
- ⚡ Rate Limiting: Token bucket algorithm to respect server limits
- 🕵️ Anti-Detection: Stealth mode, user agent rotation, and proxy support
- 📊 Workflow Engine: Define complex scraping workflows with steps and conditions
- 📈 Monitoring & Metrics: Built-in performance monitoring and logging
- 🛠️ Data Extraction: Powerful utilities for extracting structured data
- 🔧 Error Handling: Comprehensive error classification and recovery
- 📝 Type Hints: Full type support for better IDE experience
📦 Installation
pip install scrapeflow-py
Or install from source:
git clone https://github.com/irfanalidv/ScrapeFlow.git
cd ScrapeFlow
pip install -e .
Note: After installation, install Playwright browsers:
playwright install
🎯 Quick Start
Basic Usage
import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig
async def main():
config = ScrapeFlowConfig()
config.browser.headless = False
async with ScrapeFlow(config) as scraper:
await scraper.navigate("https://example.com")
title = await scraper.page.title()
print(f"Page title: {title}")
asyncio.run(main())
Workflow Example
import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.extractors import Extractor
async def extract_data(page, context):
return {
"title": await Extractor.extract_text(page, "h1"),
"links": await Extractor.extract_links(page, "a"),
}
async def main():
async with ScrapeFlow() as scraper:
workflow = Workflow(name="my_scraper")
workflow.add_step("navigate", lambda page, context: scraper.navigate("https://example.com"))
workflow.add_step("extract", extract_data)
result = await scraper.run_workflow(workflow)
print(result.final_data)
asyncio.run(main())
📚 Documentation
Configuration
ScrapeFlow is highly configurable:
from scrapeflow.config import (
ScrapeFlowConfig,
AntiDetectionConfig,
RateLimitConfig,
RetryConfig,
BrowserConfig,
BrowserType,
)
config = ScrapeFlowConfig(
browser=BrowserConfig(
browser_type=BrowserType.CHROMIUM,
headless=True,
timeout=30000,
),
retry=RetryConfig(
max_retries=5,
initial_delay=1.0,
max_delay=60.0,
exponential_base=2.0,
),
rate_limit=RateLimitConfig(
requests_per_second=2.0,
burst_size=5,
),
anti_detection=AntiDetectionConfig(
rotate_user_agents=True,
stealth_mode=True,
viewport_width=1920,
viewport_height=1080,
),
log_level="INFO",
)
Anti-Detection
ScrapeFlow includes several anti-detection features:
from scrapeflow.config import AntiDetectionConfig
# User agent rotation
anti_detection = AntiDetectionConfig(
rotate_user_agents=True,
user_agents=[
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
# Add your custom user agents
],
)
# Proxy rotation
anti_detection = AntiDetectionConfig(
rotate_proxies=True,
proxies=[
{"server": "http://proxy1:8080"},
{"server": "http://proxy2:8080"},
],
)
# Stealth mode (removes automation indicators)
anti_detection = AntiDetectionConfig(stealth_mode=True)
Rate Limiting
Control request frequency to avoid being blocked:
from scrapeflow.config import RateLimitConfig
rate_limit = RateLimitConfig(
requests_per_second=1.0, # 1 request per second
requests_per_minute=60.0, # Or 60 per minute
burst_size=5, # Allow bursts of 5 requests
)
Retry Logic
Automatic retries with exponential backoff:
from scrapeflow.config import RetryConfig
retry = RetryConfig(
max_retries=5,
initial_delay=1.0, # Start with 1 second
max_delay=60.0, # Cap at 60 seconds
exponential_base=2.0, # Double delay each retry
jitter=True, # Add randomness to avoid thundering herd
)
Data Extraction
ScrapeFlow provides powerful extraction utilities:
from scrapeflow.extractors import Extractor, StructuredExtractor
# Simple extraction
title = await Extractor.extract_text(page, "h1")
links = await Extractor.extract_links(page, "a")
images = await Extractor.extract_images(page, "img")
# Table extraction
table_data = await Extractor.extract_table(page, "table")
# Structured extraction with schema
schema = {
"title": "h1",
"description": ".description",
"items": {
"items": ".item",
"schema": {
"name": ".name",
"price": ".price",
},
},
}
extractor = StructuredExtractor(schema)
data = await extractor.extract(page)
Workflows
Build complex scraping workflows:
from scrapeflow import Workflow
workflow = Workflow(name="product_scraper")
# Add steps
workflow.add_step(
name="navigate",
func=lambda page, context: scraper.navigate("https://example.com"),
required=True, # Stop workflow if this fails
)
workflow.add_step(
name="extract",
func=extract_data,
retryable=True,
on_success=save_data, # Callback on success
on_error=handle_error, # Callback on error
condition=lambda ctx: ctx.get("should_extract", True), # Conditional execution
)
# Execute
result = await scraper.run_workflow(workflow)
Monitoring & Metrics
Track scraping performance:
# Get metrics
metrics = scraper.get_metrics()
print(f"Success rate: {metrics.get_success_rate():.2f}%")
print(f"Total requests: {metrics.total_requests}")
print(f"Average response time: {metrics.average_response_time:.2f}s")
print(f"Errors by type: {metrics.errors_by_type}")
# Reset metrics
scraper.reset_metrics()
Error Handling
ScrapeFlow provides custom exceptions:
from scrapeflow.exceptions import (
ScrapeFlowError,
ScrapeFlowRetryError,
ScrapeFlowTimeoutError,
ScrapeFlowBlockedError,
)
try:
await scraper.navigate("https://example.com")
except ScrapeFlowBlockedError as e:
print(f"Blocked! Retry after {e.retry_after} seconds")
except ScrapeFlowTimeoutError:
print("Request timed out")
except ScrapeFlowRetryError as e:
print(f"Failed after {e.retry_count} retries")
🎨 Examples
Check out the examples/ directory for more examples:
basic_usage.py- Simple scraping exampleworkflow_example.py- Workflow orchestrationadvanced_example.py- All features combined
🏗️ Architecture
ScrapeFlow is built with a modular architecture:
scrapeflow/
├── engine.py # Main ScrapeFlow engine
├── workflow.py # Workflow definition and execution
├── config.py # Configuration classes
├── anti_detection.py # Anti-detection utilities
├── rate_limiter.py # Rate limiting implementation
├── retry.py # Retry logic and error classification
├── monitoring.py # Metrics and logging
├── extractors.py # Data extraction utilities
└── exceptions.py # Custom exceptions
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built on top of Playwright - an amazing browser automation library
- Inspired by the need for production-ready scraping solutions
📧 Contact
Irfan Ali - GitHub
Project Link: https://github.com/irfanalidv/ScrapeFlow
Made with ❤️ for the scraping community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapeflow_py-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scrapeflow_py-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9680282a1c8bb67c01b4ca210fe604499d7269934095eadbb6157f0b0d8361ab
|
|
| MD5 |
c1e231b3d41d44bccc991a4c65c322e7
|
|
| BLAKE2b-256 |
b3ad407bd070fdc93bb62eb6648b1de74125c7334d418ebb68ed81d28df21143
|