An opinionated scraping workflow engine built on Playwright

These details have not been verified by PyPI

Project links

Homepage

Project description

ScrapeFlow

An opinionated scraping workflow engine built on Playwright

ScrapeFlow is a production-ready Python library that transforms Playwright into a powerful, enterprise-grade web scraping framework. It handles the common challenges of web scraping: retries, rate limiting, anti-detection, error recovery, and workflow orchestration.

🚀 Features

🔄 Intelligent Retry Logic: Automatic retries with exponential backoff and jitter
⚡ Rate Limiting: Token bucket algorithm to respect server limits
🕵️ Anti-Detection: Stealth mode, user agent rotation, and proxy support
📊 Workflow Engine: Define complex scraping workflows with steps and conditions
📈 Monitoring & Metrics: Built-in performance monitoring and logging
🛠️ Data Extraction: Powerful utilities for extracting structured data
🔧 Error Handling: Comprehensive error classification and recovery
📝 Type Hints: Full type support for better IDE experience

📦 Installation

pip install scrapeflow-py

Or install from source:

git clone https://github.com/irfanalidv/ScrapeFlow.git
cd ScrapeFlow
pip install -e .

Note: After installation, install Playwright browsers:

playwright install

🎯 Quick Start

Basic Usage

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig

async def main():
    config = ScrapeFlowConfig()
    config.browser.headless = False

    async with ScrapeFlow(config) as scraper:
        await scraper.navigate("https://example.com")
        title = await scraper.page.title()
        print(f"Page title: {title}")

asyncio.run(main())

Workflow Example

import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.extractors import Extractor

async def extract_data(page, context):
    return {
        "title": await Extractor.extract_text(page, "h1"),
        "links": await Extractor.extract_links(page, "a"),
    }

async def main():
    async with ScrapeFlow() as scraper:
        workflow = Workflow(name="my_scraper")
        workflow.add_step("navigate", lambda page, context: scraper.navigate("https://example.com"))
        workflow.add_step("extract", extract_data)

        result = await scraper.run_workflow(workflow)
        print(result.final_data)

asyncio.run(main())

📚 Documentation

Configuration

ScrapeFlow is highly configurable:

from scrapeflow.config import (
    ScrapeFlowConfig,
    AntiDetectionConfig,
    RateLimitConfig,
    RetryConfig,
    BrowserConfig,
    BrowserType,
)

config = ScrapeFlowConfig(
    browser=BrowserConfig(
        browser_type=BrowserType.CHROMIUM,
        headless=True,
        timeout=30000,
    ),
    retry=RetryConfig(
        max_retries=5,
        initial_delay=1.0,
        max_delay=60.0,
        exponential_base=2.0,
    ),
    rate_limit=RateLimitConfig(
        requests_per_second=2.0,
        burst_size=5,
    ),
    anti_detection=AntiDetectionConfig(
        rotate_user_agents=True,
        stealth_mode=True,
        viewport_width=1920,
        viewport_height=1080,
    ),
    log_level="INFO",
)

Anti-Detection

ScrapeFlow includes several anti-detection features:

from scrapeflow.config import AntiDetectionConfig

# User agent rotation
anti_detection = AntiDetectionConfig(
    rotate_user_agents=True,
    user_agents=[
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
        # Add your custom user agents
    ],
)

# Proxy rotation
anti_detection = AntiDetectionConfig(
    rotate_proxies=True,
    proxies=[
        {"server": "http://proxy1:8080"},
        {"server": "http://proxy2:8080"},
    ],
)

# Stealth mode (removes automation indicators)
anti_detection = AntiDetectionConfig(stealth_mode=True)

Rate Limiting

Control request frequency to avoid being blocked:

from scrapeflow.config import RateLimitConfig

rate_limit = RateLimitConfig(
    requests_per_second=1.0,  # 1 request per second
    requests_per_minute=60.0,  # Or 60 per minute
    burst_size=5,  # Allow bursts of 5 requests
)

Retry Logic

Automatic retries with exponential backoff:

from scrapeflow.config import RetryConfig

retry = RetryConfig(
    max_retries=5,
    initial_delay=1.0,  # Start with 1 second
    max_delay=60.0,  # Cap at 60 seconds
    exponential_base=2.0,  # Double delay each retry
    jitter=True,  # Add randomness to avoid thundering herd
)

Data Extraction

ScrapeFlow provides powerful extraction utilities:

from scrapeflow.extractors import Extractor, StructuredExtractor

# Simple extraction
title = await Extractor.extract_text(page, "h1")
links = await Extractor.extract_links(page, "a")
images = await Extractor.extract_images(page, "img")

# Table extraction
table_data = await Extractor.extract_table(page, "table")

# Structured extraction with schema
schema = {
    "title": "h1",
    "description": ".description",
    "items": {
        "items": ".item",
        "schema": {
            "name": ".name",
            "price": ".price",
        },
    },
}
extractor = StructuredExtractor(schema)
data = await extractor.extract(page)

Workflows

Build complex scraping workflows:

from scrapeflow import Workflow

workflow = Workflow(name="product_scraper")

# Add steps
workflow.add_step(
    name="navigate",
    func=lambda page, context: scraper.navigate("https://example.com"),
    required=True,  # Stop workflow if this fails
)

workflow.add_step(
    name="extract",
    func=extract_data,
    retryable=True,
    on_success=save_data,  # Callback on success
    on_error=handle_error,  # Callback on error
    condition=lambda ctx: ctx.get("should_extract", True),  # Conditional execution
)

# Execute
result = await scraper.run_workflow(workflow)

Monitoring & Metrics

Track scraping performance:

# Get metrics
metrics = scraper.get_metrics()
print(f"Success rate: {metrics.get_success_rate():.2f}%")
print(f"Total requests: {metrics.total_requests}")
print(f"Average response time: {metrics.average_response_time:.2f}s")
print(f"Errors by type: {metrics.errors_by_type}")

# Reset metrics
scraper.reset_metrics()

Error Handling

ScrapeFlow provides custom exceptions:

from scrapeflow.exceptions import (
    ScrapeFlowError,
    ScrapeFlowRetryError,
    ScrapeFlowTimeoutError,
    ScrapeFlowBlockedError,
)

try:
    await scraper.navigate("https://example.com")
except ScrapeFlowBlockedError as e:
    print(f"Blocked! Retry after {e.retry_after} seconds")
except ScrapeFlowTimeoutError:
    print("Request timed out")
except ScrapeFlowRetryError as e:
    print(f"Failed after {e.retry_count} retries")

🎨 Examples

Check out the examples/ directory for more examples:

basic_usage.py - Simple scraping example
workflow_example.py - Workflow orchestration
advanced_example.py - All features combined

🏗️ Architecture

ScrapeFlow is built with a modular architecture:

scrapeflow/
├── engine.py          # Main ScrapeFlow engine
├── workflow.py        # Workflow definition and execution
├── config.py          # Configuration classes
├── anti_detection.py  # Anti-detection utilities
├── rate_limiter.py    # Rate limiting implementation
├── retry.py           # Retry logic and error classification
├── monitoring.py      # Metrics and logging
├── extractors.py      # Data extraction utilities
└── exceptions.py      # Custom exceptions

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built on top of Playwright - an amazing browser automation library
Inspired by the need for production-ready scraping solutions

📧 Contact

Irfan Ali - GitHub

Project Link: https://github.com/irfanalidv/ScrapeFlow

Made with ❤️ for the scraping community

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.0

Mar 9, 2026

0.2.0

Mar 9, 2026

0.1.2

Nov 26, 2025

0.1.1

Nov 26, 2025

This version

0.1.0

Nov 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapeflow_py-0.1.0-py3-none-any.whl (22.3 kB view details)

Uploaded Nov 26, 2025 Python 3

File details

Details for the file scrapeflow_py-0.1.0-py3-none-any.whl.

File metadata

Download URL: scrapeflow_py-0.1.0-py3-none-any.whl
Upload date: Nov 26, 2025
Size: 22.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for scrapeflow_py-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9680282a1c8bb67c01b4ca210fe604499d7269934095eadbb6157f0b0d8361ab`
MD5	`c1e231b3d41d44bccc991a4c65c322e7`
BLAKE2b-256	`b3ad407bd070fdc93bb62eb6648b1de74125c7334d418ebb68ed81d28df21143`

See more details on using hashes here.

scrapeflow-py 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ScrapeFlow

🚀 Features

📦 Installation

🎯 Quick Start

Basic Usage

Workflow Example

📚 Documentation

Configuration

Anti-Detection

Rate Limiting

Retry Logic

Data Extraction

Workflows

Monitoring & Metrics

Error Handling

🎨 Examples

🏗️ Architecture

🤝 Contributing

📝 License

🙏 Acknowledgments

📧 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes