An opinionated scraping workflow engine built on Playwright
Project description
ScrapeFlow
An opinionated scraping workflow engine built on Playwright
ScrapeFlow is a production-ready Python library that transforms Playwright into a powerful, enterprise-grade web scraping framework. It handles the common challenges of web scraping: retries, rate limiting, anti-detection, error recovery, and workflow orchestration.
๐ Features
- ๐ Specification-Driven Extraction: Declarative Pydantic models define fields, types, and validationโdecouple field definitions from page structure
- ๐ค robots.txt Compliance: Built-in robots.txt parsing and enforcement; ethical crawling by design
- โ๏ธ Ethical Crawling (GDPR/CCPA): Configurable data retention, anonymization, and consent options in the specification layer
- ๐ฆ Component Registry: Shared, versioned selectors, pagination handlers, and login flowsโplatform thinking over one-off scrapers
- ๐ Monitoring & Alerting: Alert callbacks on failure thresholds; rollback hooks for failed extraction runs
- ๐ MCP Extensibility: Pluggable backends for Scrapy MCP Server, Playwright MCP, or LLM-based semantic extraction
- ๐ Intelligent Retry Logic: Automatic retries with exponential backoff and jitter
- โก Rate Limiting: Token bucket algorithm to respect server limits
- ๐ต๏ธ Anti-Detection: Stealth mode, user agent rotation, and proxy support
- ๐ Workflow Engine: Define complex scraping workflows with steps and conditions
- ๐ Monitoring & Metrics: Built-in performance monitoring and logging
- ๐ ๏ธ Data Extraction: Powerful utilities for extracting structured data
- ๐ง Error Handling: Comprehensive error classification and recovery
- ๐ Type Hints: Full type support for better IDE experience
๐ฆ Installation
pip install scrapeflow-py
Or install from source:
git clone https://github.com/irfanalidv/scrapeflow-py.git
cd scrapeflow-py
pip install -e .
Note: After installation, install Playwright browsers:
playwright install
๐ฏ Real-World Use Cases
ScrapeFlow is used in production for:
- ๐ฐ E-commerce Price Monitoring - Track competitor prices, monitor deals, and optimize pricing strategies
- ๐ฐ News & Content Aggregation - Collect articles from multiple sources for content platforms
- ๐ผ Job Listings Scraping - Aggregate job postings from various job boards
- ๐ Real Estate Data Collection - Monitor property listings, prices, and market trends
- โญ Product Review Analysis - Extract and analyze customer reviews for market research
- ๐ Market Research - Gather competitor data, customer sentiment, and industry trends
- ๐ Lead Generation - Extract contact information from business directories
- ๐ Financial Data Collection - Monitor stock prices, cryptocurrency data, and market indicators
๐ Quick Start
Use Case 1: Scraping Quotes with Retry & Rate Limiting
Real-world scenario: Collecting inspirational quotes from quotes.toscrape.com - a real website designed for scraping practice.
import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RateLimitConfig, RetryConfig
from scrapeflow.extractors import Extractor
async def main():
# Configure for production scraping
config = ScrapeFlowConfig(
rate_limit=RateLimitConfig(requests_per_second=2.0), # Respect server limits
retry=RetryConfig(max_retries=3, initial_delay=1.0), # Auto-retry on failures
)
async with ScrapeFlow(config) as scraper:
await scraper.navigate("https://quotes.toscrape.com/")
# Extract all quotes from the page
quotes = []
quote_elements = scraper.page.locator(".quote")
count = await quote_elements.count()
for i in range(count):
quote_elem = quote_elements.nth(i)
text = await quote_elem.locator(".text").text_content()
author = await quote_elem.locator(".author").text_content()
tags = await Extractor.extract_texts(quote_elem, ".tag")
quotes.append({
"quote": text.strip() if text else "",
"author": author.strip() if author else "",
"tags": tags
})
print(f"Scraped {len(quotes)} quotes")
for quote in quotes[:3]: # Show first 3
print(f"\n{quote['quote']}\nโ {quote['author']}")
asyncio.run(main())
Real Output:
Scraped 10 quotes from quotes.toscrape.com
1. Quote: "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
Author: Albert Einstein
Tags: ['change', 'deep-thoughts', 'thinking', 'world']
2. Quote: "It is our choices, Harry, that show what we truly are, far more than our abilities."
Author: J.K. Rowling
Tags: ['abilities', 'choices']
3. Quote: "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
Author: Albert Einstein
Tags: ['inspirational', 'life', 'live', 'miracle', 'miracles']
Use Case 2: E-commerce Book Scraping Workflow
Real-world scenario: Scraping book data from books.toscrape.com - a real e-commerce site designed for scraping practice.
import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.config import ScrapeFlowConfig
from scrapeflow.extractors import StructuredExtractor
async def scrape_books(page, context):
"""Extract book listings from the page."""
schema = {
"books": {
"items": "article.product_pod",
"schema": {
"title": "h3 a",
"price": ".price_color",
"availability": ".instock.availability"
}
}
}
extractor = StructuredExtractor(schema)
return await extractor.extract(page)
async def check_affordable_books(data, context):
"""Callback to find affordable books."""
for book in data.get("books", []):
price_str = book.get("price", "").replace("ยฃ", "").strip()
try:
price = float(price_str)
if price < 20.0: # Books under ยฃ20
print(f"๐ฐ Affordable: {book['title'][:50]}... - ยฃ{price}")
except ValueError:
pass
async def main():
config = ScrapeFlowConfig()
async with ScrapeFlow(config) as scraper:
workflow = Workflow(name="book_scraper")
# Step 1: Navigate to books page
async def navigate_to_books(page, context):
scraper = context["scraper"]
await scraper.navigate("https://books.toscrape.com/")
await scraper.wait_for_selector("article.product_pod", timeout=10000)
# Step 2: Extract book data
workflow.add_step("navigate", navigate_to_books, required=True)
workflow.add_step("extract", scrape_books, on_success=check_affordable_books)
# Execute workflow
result = await scraper.run_workflow(workflow)
print(f"โ
Scraped {len(result.final_data.get('books', []))} books")
asyncio.run(main())
Real Output:
Workflow 'book_scraper' completed. Success: True, Steps: 2/2
โ
Scraped 20 books
Use Case 3: Quote Aggregation with Anti-Detection
Real-world scenario: Collecting quotes from quotes.toscrape.com while avoiding detection using stealth mode and user agent rotation.
import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import (
ScrapeFlowConfig,
AntiDetectionConfig,
RateLimitConfig
)
from scrapeflow.extractors import Extractor
async def scrape_quotes_with_stealth():
"""Scrape quotes with anti-detection enabled."""
config = ScrapeFlowConfig(
anti_detection=AntiDetectionConfig(
rotate_user_agents=True, # Rotate user agents
stealth_mode=True, # Remove automation indicators
viewport_width=1920,
viewport_height=1080
),
rate_limit=RateLimitConfig(requests_per_second=1.0) # Be respectful
)
async with ScrapeFlow(config) as scraper:
# Navigate to quotes site
await scraper.navigate("https://quotes.toscrape.com/")
# Verify stealth mode is working
user_agent = await scraper.page.evaluate("() => navigator.userAgent")
page_title = await scraper.page.title()
# Extract quote data
quotes = []
quote_elements = scraper.page.locator(".quote")
count = await quote_elements.count()
for i in range(count):
quote_elem = quote_elements.nth(i)
text = await quote_elem.locator(".text").text_content()
author = await quote_elem.locator(".author").text_content()
quotes.append({
"quote": text.strip() if text else "",
"author": author.strip() if author else "",
"url": await scraper.page.url
})
return quotes, user_agent, page_title
# Run the scraper
quotes, ua, title = asyncio.run(scrape_quotes_with_stealth())
print(f"๐ฐ Collected {len(quotes)} quotes from {title}")
print(f"๐ต๏ธ User Agent: {ua[:60]}...")
print(f"\nFirst quote: {quotes[0]['quote'][:80]}...")
Real Output:
๐ฐ Collected 10 quotes from Quotes to Scrape
๐ต๏ธ User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101...
First quote: "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
Use Case 4: Multi-Page Scraping with Error Handling & Metrics
Real-world scenario: Scraping multiple pages from quotes.toscrape.com with comprehensive error handling and performance monitoring.
import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RetryConfig
from scrapeflow.exceptions import (
ScrapeFlowBlockedError,
ScrapeFlowTimeoutError,
ScrapeFlowRetryError
)
from scrapeflow.extractors import Extractor
async def scrape_multiple_pages():
config = ScrapeFlowConfig(
retry=RetryConfig(max_retries=5, initial_delay=2.0),
log_level="INFO"
)
try:
async with ScrapeFlow(config) as scraper:
# Scrape multiple pages
all_quotes = []
pages = [
"https://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
]
for url in pages:
await scraper.navigate(url)
# Extract quotes
quote_elements = scraper.page.locator(".quote")
count = await quote_elements.count()
for i in range(count):
quote_elem = quote_elements.nth(i)
text = await quote_elem.locator(".text").text_content()
author = await quote_elem.locator(".author").text_content()
all_quotes.append({
"quote": text.strip() if text else "",
"author": author.strip() if author else "",
})
# Get performance metrics
metrics = scraper.get_metrics()
print(f"๐ Success rate: {metrics.get_success_rate():.2f}%")
print(f"๐ Total requests: {metrics.total_requests}")
print(f"๐ Average response time: {metrics.average_response_time:.2f}s")
return all_quotes
except ScrapeFlowBlockedError as e:
print(f"๐ซ Blocked! Retry after {e.retry_after} seconds")
return []
except ScrapeFlowTimeoutError:
print("โฑ๏ธ Request timed out")
return []
except ScrapeFlowRetryError as e:
print(f"โ Failed after {e.retry_count} retries")
return []
quotes = asyncio.run(scrape_multiple_pages())
print(f"๐ผ Found {len(quotes)} quotes across pages")
Real Output:
๐ Success rate: 100.00%
๐ Total requests: 2
๐ Average response time: 1.10s
๐ผ Found 20 quotes across pages
๐ Documentation
Configuration
Use Case: Setting up a production-ready scraper for monitoring competitor prices across multiple sites.
from scrapeflow.config import (
ScrapeFlowConfig,
AntiDetectionConfig,
RateLimitConfig,
RetryConfig,
BrowserConfig,
BrowserType,
)
# Production configuration for price monitoring
config = ScrapeFlowConfig(
browser=BrowserConfig(
browser_type=BrowserType.CHROMIUM,
headless=True, # Run in background
timeout=30000, # 30 second timeout
),
retry=RetryConfig(
max_retries=5, # Retry up to 5 times
initial_delay=1.0, # Start with 1 second delay
max_delay=60.0, # Cap at 60 seconds
exponential_base=2.0, # Double delay each retry
jitter=True, # Add randomness
),
rate_limit=RateLimitConfig(
requests_per_second=2.0, # Max 2 requests/second
burst_size=5, # Allow bursts of 5
),
anti_detection=AntiDetectionConfig(
rotate_user_agents=True, # Rotate user agents
stealth_mode=True, # Remove automation traces
viewport_width=1920,
viewport_height=1080,
),
log_level="INFO", # Log important events
)
Specification-Driven Extraction (Pydantic)
Use Case: Declarative extraction with validationโfields, types, and rules in specs, not fragile XPaths.
from pydantic import BaseModel
from scrapeflow import ScrapeFlow, SpecificationExtractor
from scrapeflow.specifications import FieldSpec, ItemSpec, ProductPriceSpec
# Model for list of products
class BookListing(BaseModel):
books: list[ProductPriceSpec]
# Schema maps fields to selectors
schema = {
"books": ItemSpec(
items_selector="article.product_pod",
fields={
"title": FieldSpec(selector="h3 a"),
"price": FieldSpec(selector=".price_color"),
"availability": FieldSpec(selector=".instock.availability", default=""),
"url": FieldSpec(selector="h3 a", type="attribute", attribute="href"),
},
)
}
async with ScrapeFlow() as scraper:
await scraper.navigate("https://books.toscrape.com/")
extractor = SpecificationExtractor(BookListing, schema=schema)
# Extract and validate in one step
data = await extractor.extract(scraper.page)
for book in data.books:
print(f"{book.title}: {book.price}")
Ethical Crawling & robots.txt
Use Case: GDPR/CCPA compliance and robots.txt respect built into the specification layer.
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, EthicalCrawlingConfig
config = ScrapeFlowConfig(
ethical_crawling=EthicalCrawlingConfig(
respect_robots_txt=True, # Check robots.txt before each request
user_agent_for_robots="ScrapeFlow",
anonymize_ip=True, # GDPR: minimize personal data
data_retention_days=30, # Document retention policy
)
)
async with ScrapeFlow(config) as scraper:
# navigate() automatically checks robots.txt
await scraper.navigate("https://example.com/page")
Anti-Detection
Use Case: Scraping protected e-commerce sites that block automated access.
from scrapeflow.config import ScrapeFlowConfig, AntiDetectionConfig
# Configure anti-detection for protected sites
config = ScrapeFlowConfig(
anti_detection=AntiDetectionConfig(
# Rotate user agents to appear as different browsers
rotate_user_agents=True,
user_agents=[
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Firefox/120.0",
],
# Enable stealth mode to remove automation indicators
stealth_mode=True, # Removes webdriver property, mocks plugins, etc.
# Use realistic viewport sizes
viewport_width=1920,
viewport_height=1080,
# Optional: Rotate proxies for additional protection
rotate_proxies=True,
proxies=[
{"server": "http://proxy1.example.com:8080"},
{"server": "http://proxy2.example.com:8080"},
],
)
)
async with ScrapeFlow(config) as scraper:
# This will use stealth techniques automatically
await scraper.navigate("https://protected-site.com")
# Your scraping code here...
Rate Limiting
Use Case: Respecting API rate limits when scraping multiple pages to avoid getting blocked.
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RateLimitConfig
config = ScrapeFlowConfig(
rate_limit=RateLimitConfig(
requests_per_second=1.0, # Max 1 request per second
requests_per_minute=60.0, # Or 60 requests per minute
burst_size=5, # Allow bursts of 5 requests
)
)
async with ScrapeFlow(config) as scraper:
# Scrape multiple pages - rate limiter ensures we don't exceed limits
urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
]
for url in urls:
await scraper.navigate(url) # Automatically rate-limited
# Extract data...
print(f"Scraped: {url}")
# Rate limiter ensures proper delays between requests
Retry Logic
Use Case: Handling network failures and temporary server errors when scraping unreliable sources.
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RetryConfig
config = ScrapeFlowConfig(
retry=RetryConfig(
max_retries=5, # Retry up to 5 times
initial_delay=1.0, # Start with 1 second delay
max_delay=60.0, # Cap at 60 seconds
exponential_base=2.0, # Double delay each retry (1s, 2s, 4s, 8s...)
jitter=True, # Add randomness to avoid thundering herd
)
)
async with ScrapeFlow(config) as scraper:
# If this fails, it will automatically retry with exponential backoff
await scraper.navigate("https://unreliable-site.com/products")
# Retry logic handles:
# - Network timeouts
# - 500/502/503 server errors
# - Connection errors
# - Temporary blocks
Data Extraction
Use Case: Extracting structured data from quotes.toscrape.com and books.toscrape.com.
import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.extractors import Extractor, StructuredExtractor
async def main():
async with ScrapeFlow() as scraper:
await scraper.navigate("https://quotes.toscrape.com/")
# Method 1: Simple extraction
page_title = await Extractor.extract_text(scraper.page, "h1")
all_links = await Extractor.extract_links(scraper.page, "a")
# Method 2: Structured extraction with schema (Best for complex pages)
schema = {
"page_title": "h1",
"quotes": {
"items": ".quote", # Find all quote elements
"schema": {
"text": ".text", # Extract quote text
"author": ".author", # Extract author
"tags": ".tag", # Extract all tags
},
},
}
extractor = StructuredExtractor(schema)
structured_data = await extractor.extract(scraper.page)
print(f"Page: {structured_data['page_title']}")
print(f"Quotes found: {len(structured_data['quotes'])}")
if structured_data['quotes']:
first = structured_data['quotes'][0]
print(f"First quote: {first['text'][:60]}...")
print(f"Author: {first['author']}")
print(f"Tags: {first['tags']}")
asyncio.run(main())
Real Output:
Page: Quotes to Scrape
Quotes found: 10
First quote: "The world as we have created it is a process of our thinkin...
Author: Albert Einstein
Tags: ['change', 'deep-thoughts', 'thinking', 'world']
Workflows
Use Case: Building a multi-step scraper that navigates, extracts, and processes data with error handling.
import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.extractors import Extractor
async def login_step(page, context):
"""Step 1: Login to the site"""
# Scraper is automatically available in context
scraper = context["scraper"]
await scraper.navigate("https://example.com/login")
await scraper.fill("#username", context["username"])
await scraper.fill("#password", context["password"])
await scraper.click("button[type='submit']")
await scraper.wait_for_selector(".dashboard", timeout=10000)
async def extract_products(page, context):
"""Step 2: Extract product data"""
products = []
product_elements = page.locator(".product")
count = await product_elements.count()
for i in range(count):
product = product_elements.nth(i)
products.append({
"name": await Extractor.extract_text(product, ".name"),
"price": await Extractor.extract_text(product, ".price"),
})
return products
async def save_to_database(data, context):
"""Callback: Save extracted data"""
print(f"๐พ Saving {len(data)} products to database...")
# Your database save logic here
async def handle_error(error, context):
"""Callback: Handle errors"""
print(f"โ Error in workflow: {error}")
# Your error handling logic here
async def main():
async with ScrapeFlow() as scraper:
workflow = Workflow(name="product_scraper")
# Step 1: Login (required - stops workflow if fails)
workflow.add_step(
name="login",
func=login_step,
required=True,
retryable=True,
)
# Step 2: Extract products (only if login succeeded)
workflow.add_step(
name="extract",
func=extract_products,
retryable=True,
on_success=save_to_database,
on_error=handle_error,
condition=lambda ctx: ctx.get("logged_in", False), # Conditional
)
# Set context
workflow.set_context("username", "user@example.com")
workflow.set_context("password", "secret123")
# Execute workflow
result = await scraper.run_workflow(workflow)
print(f"โ
Workflow completed: {result.success}")
asyncio.run(main())
Monitoring & Metrics
Use Case: Monitoring scraping performance when scraping multiple pages from quotes.toscrape.com.
import asyncio
from scrapeflow import ScrapeFlow
async def main():
async with ScrapeFlow() as scraper:
# Perform multiple scraping operations
urls = [
"https://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
]
for url in urls:
await scraper.navigate(url)
# Extract data...
# Get comprehensive metrics
metrics = scraper.get_metrics()
print(f"๐ Performance Metrics:")
print(f" Success Rate: {metrics.get_success_rate():.2f}%")
print(f" Total Requests: {metrics.total_requests}")
print(f" Successful: {metrics.successful_requests}")
print(f" Failed: {metrics.failed_requests}")
print(f" Retries: {metrics.retry_count}")
print(f" Avg Response Time: {metrics.average_response_time:.2f}s")
print(f" Total Duration: {metrics.total_duration:.2f}s")
# Reset metrics for next batch
scraper.reset_metrics()
asyncio.run(main())
Real Output:
๐ Performance Metrics:
Success Rate: 100.00%
Total Requests: 2
Successful: 2
Failed: 0
Retries: 0
Avg Response Time: 1.10s
Total Duration: 2.20s
Error Handling
Use Case: Gracefully handling different types of errors when scraping quotes.toscrape.com.
import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.exceptions import (
ScrapeFlowError,
ScrapeFlowRetryError,
ScrapeFlowTimeoutError,
ScrapeFlowBlockedError,
)
async def scrape_with_error_handling():
async with ScrapeFlow() as scraper:
try:
await scraper.navigate("https://quotes.toscrape.com/")
title = await scraper.page.title()
print(f"โ
Successfully scraped: {title}")
except ScrapeFlowBlockedError as e:
# Site blocked us - wait and retry later
print(f"๐ซ Blocked! Retry after {e.retry_after} seconds")
except ScrapeFlowTimeoutError:
# Request took too long
print("โฑ๏ธ Request timed out - site may be slow")
except ScrapeFlowRetryError as e:
# All retries exhausted
print(f"โ Failed after {e.retry_count} retries")
except ScrapeFlowError as e:
# Generic ScrapeFlow error
print(f"โ ๏ธ ScrapeFlow error: {e}")
except Exception as e:
# Other unexpected errors
print(f"๐ฅ Unexpected error: {e}")
asyncio.run(scrape_with_error_handling())
Real Output:
โ
Successfully scraped: Quotes to Scrape
๐จ Complete Examples
Check out the examples/ directory for more examples:
basic_usage.py- Simple scraping exampleworkflow_example.py- Workflow orchestrationadvanced_example.py- All features combined
Example: Complete Book Scraper with All Features
Real-world scenario: Complete example scraping books from books.toscrape.com using all ScrapeFlow features.
import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.config import (
ScrapeFlowConfig,
AntiDetectionConfig,
RateLimitConfig,
RetryConfig
)
from scrapeflow.extractors import StructuredExtractor
async def scrape_books_complete():
"""Complete book scraping solution with all ScrapeFlow features."""
config = ScrapeFlowConfig(
anti_detection=AntiDetectionConfig(
rotate_user_agents=True,
stealth_mode=True,
),
rate_limit=RateLimitConfig(requests_per_second=1.0),
retry=RetryConfig(max_retries=3),
)
async with ScrapeFlow(config) as scraper:
workflow = Workflow(name="book_monitor")
async def extract_books(page, context):
schema = {
"books": {
"items": "article.product_pod",
"schema": {
"title": "h3 a",
"price": ".price_color",
}
}
}
extractor = StructuredExtractor(schema)
return await extractor.extract(page)
async def navigate_to_books(page, context):
scraper = context["scraper"]
await scraper.navigate("https://books.toscrape.com/")
await scraper.wait_for_selector("article.product_pod", timeout=10000)
workflow.add_step("navigate", navigate_to_books, required=True)
workflow.add_step("extract", extract_books)
result = await scraper.run_workflow(workflow)
# Get metrics
metrics = scraper.get_metrics()
books = result.final_data.get("books", [])
print(f"โ
Scraped {len(books)} books")
print(f"๐ Success rate: {metrics.get_success_rate():.2f}%")
print(f"๐ Average response time: {metrics.average_response_time:.2f}s")
if books:
print(f"\n๐ Sample books:")
for book in books[:3]:
print(f" - {book.get('title', '')[:40]}... - {book.get('price', '')}")
return result.final_data
asyncio.run(scrape_books_complete())
Real Output:
Workflow 'book_monitor' completed. Success: True, Steps: 2/2
โ
Scraped 20 books
๐ Success rate: 100.00%
๐ Average response time: 1.15s
๐ Sample books:
- A Light in the Attic... - ยฃ51.77
- Tipping the Velvet... - ยฃ53.74
- Soumission... - ยฃ50.10
๐๏ธ Architecture
ScrapeFlow is built with a modular architecture:
scrapeflow/
โโโ engine.py # Main ScrapeFlow engine
โโโ ports.py # Protocols for dependency inversion
โโโ browser_runtime.py # Playwright runtime adapter
โโโ workflow.py # Workflow definition entities
โโโ workflow_executor.py # Workflow execution service
โโโ config.py # Configuration classes (incl. EthicalCrawlingConfig)
โโโ specifications.py # Pydantic specification-driven extraction
โโโ schema_library.py # Reusable schema definitions
โโโ robots.py # robots.txt parsing and enforcement
โโโ registry.py # Shared selector/component registry
โโโ mcp_backend.py # MCP integration extensibility
โโโ anti_detection.py # Anti-detection utilities
โโโ rate_limiter.py # Rate limiting implementation
โโโ retry.py # Retry logic and error classification
โโโ monitoring.py # Metrics, logging, alerting
โโโ extractors.py # Data extraction utilities
โโโ exceptions.py # Custom exceptions
For deeper design details, see ARCHITECTURE.md.
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built on top of Playwright - an amazing browser automation library
- Inspired by the need for production-ready scraping solutions
๐ง Contact
Irfan Ali - GitHub
Project Link: https://github.com/irfanalidv/scrapeflow-py
Made with โค๏ธ for the scraping community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapeflow_py-0.2.0.tar.gz.
File metadata
- Download URL: scrapeflow_py-0.2.0.tar.gz
- Upload date:
- Size: 50.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
528d5bfa5f713b28189a2c3f76b32db9c6dabf8223212695740c36aa5da4f8e8
|
|
| MD5 |
2f21e779b855d6ceb2d72b04fd1bf732
|
|
| BLAKE2b-256 |
f7dc2d4a9824cc38d47dd6c5d8dc1d9dd34be74a6947612699c313ce404504a0
|
File details
Details for the file scrapeflow_py-0.2.0-py3-none-any.whl.
File metadata
- Download URL: scrapeflow_py-0.2.0-py3-none-any.whl
- Upload date:
- Size: 45.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
398bfb7ca411d9db4319d6b8da842d9617c0a78b803addea9fc018de1fe5e110
|
|
| MD5 |
91a76ca61416e911c4f28b2036e7d8ac
|
|
| BLAKE2b-256 |
749cce1b40ff1f12b65f393c0ba5e1b8db7e71f757668ee2d887516c4fe9781f
|