Skip to main content

An opinionated scraping workflow engine built on Playwright

Project description

ScrapeFlow

An opinionated scraping workflow engine built on Playwright

GitHub PyPI PyPI Downloads Python Playwright Status


ScrapeFlow is a production-ready Python library that transforms Playwright into a powerful, enterprise-grade web scraping framework. It handles the common challenges of web scraping: retries, rate limiting, anti-detection, error recovery, and workflow orchestration.

๐Ÿš€ Features

  • ๐Ÿ“‹ Specification-Driven Extraction: Declarative Pydantic models define fields, types, and validationโ€”decouple field definitions from page structure
  • ๐Ÿค– robots.txt Compliance: Built-in robots.txt parsing and enforcement; ethical crawling by design
  • โš–๏ธ Ethical Crawling (GDPR/CCPA): Configurable data retention, anonymization, and consent options in the specification layer
  • ๐Ÿ“ฆ Component Registry: Shared, versioned selectors, pagination handlers, and login flowsโ€”platform thinking over one-off scrapers
  • ๐Ÿ”” Monitoring & Alerting: Alert callbacks on failure thresholds; rollback hooks for failed extraction runs
  • ๐Ÿ”Œ MCP Extensibility: Pluggable backends for Scrapy MCP Server, Playwright MCP, or LLM-based semantic extraction
  • ๐Ÿค– Mistral LLM Extraction: Generate schemas from natural language, extract without selectorsโ€”uses MISTRAL_API_KEY
  • ๐Ÿ”„ Hybrid Extraction: Selector-first, LLM fallback when validation failsโ€”self-healing spiders
  • ๐Ÿ›ก๏ธ Selector Fallback Chain: Try multiple selectors per field ([".price", ".cost"]) for resilience
  • ๐Ÿซ™ Session Persistence: Save/load cookies and localStorage via save_storage_state() and storage_state_path
  • ๐Ÿงน Content Cleaning: Strip scripts/styles before LLM extraction for better accuracy and token usage
  • ๐Ÿ”„ Intelligent Retry Logic: Automatic retries with exponential backoff and jitter
  • โšก Rate Limiting: Token bucket algorithm to respect server limits
  • ๐Ÿ•ต๏ธ Anti-Detection: Stealth mode, user agent rotation, and proxy support
  • ๐Ÿ“Š Workflow Engine: Define complex scraping workflows with steps and conditions
  • ๐Ÿ“ˆ Monitoring & Metrics: Built-in performance monitoring and logging
  • ๐Ÿ› ๏ธ Data Extraction: Powerful utilities for extracting structured data
  • ๐Ÿ”ง Error Handling: Comprehensive error classification and recovery
  • ๐Ÿ“ Type Hints: Full type support for better IDE experience

ScrapeFlow Architecture

๐Ÿ“ฆ Installation

pip install scrapeflow-py

Or install from source:

git clone https://github.com/irfanalidv/scrapeflow-py.git
cd scrapeflow-py
pip install -e .

Note: After installation, install Playwright browsers:

playwright install

๐ŸŽฏ Real-World Use Cases

ScrapeFlow is used in production for:

  • ๐Ÿ’ฐ E-commerce Price Monitoring - Track competitor prices, monitor deals, and optimize pricing strategies
  • ๐Ÿ“ฐ News & Content Aggregation - Collect articles from multiple sources for content platforms
  • ๐Ÿ’ผ Job Listings Scraping - Aggregate job postings from various job boards
  • ๐Ÿ  Real Estate Data Collection - Monitor property listings, prices, and market trends
  • โญ Product Review Analysis - Extract and analyze customer reviews for market research
  • ๐Ÿ“Š Market Research - Gather competitor data, customer sentiment, and industry trends
  • ๐Ÿ” Lead Generation - Extract contact information from business directories
  • ๐Ÿ“ˆ Financial Data Collection - Monitor stock prices, cryptocurrency data, and market indicators

๐Ÿš€ Quick Start

Use Case 1: Scraping Quotes with Retry & Rate Limiting

Real-world scenario: Collecting inspirational quotes from quotes.toscrape.com - a real website designed for scraping practice.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RateLimitConfig, RetryConfig
from scrapeflow.extractors import Extractor

async def main():
    # Configure for production scraping
    config = ScrapeFlowConfig(
        rate_limit=RateLimitConfig(requests_per_second=2.0),  # Respect server limits
        retry=RetryConfig(max_retries=3, initial_delay=1.0),  # Auto-retry on failures
    )

    async with ScrapeFlow(config) as scraper:
        await scraper.navigate("https://quotes.toscrape.com/")

        # Extract all quotes from the page
        quotes = []
        quote_elements = scraper.page.locator(".quote")
        count = await quote_elements.count()

        for i in range(count):
            quote_elem = quote_elements.nth(i)
            text = await quote_elem.locator(".text").text_content()
            author = await quote_elem.locator(".author").text_content()
            tags = await Extractor.extract_texts(quote_elem, ".tag")

            quotes.append({
                "quote": text.strip() if text else "",
                "author": author.strip() if author else "",
                "tags": tags
            })

        print(f"Scraped {len(quotes)} quotes")
        for quote in quotes[:3]:  # Show first 3
            print(f"\n{quote['quote']}\nโ€” {quote['author']}")

asyncio.run(main())

Real Output:

Scraped 10 quotes from quotes.toscrape.com

1. Quote: "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
   Author: Albert Einstein
   Tags: ['change', 'deep-thoughts', 'thinking', 'world']

2. Quote: "It is our choices, Harry, that show what we truly are, far more than our abilities."
   Author: J.K. Rowling
   Tags: ['abilities', 'choices']

3. Quote: "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
   Author: Albert Einstein
   Tags: ['inspirational', 'life', 'live', 'miracle', 'miracles']

Use Case 2: E-commerce Book Scraping Workflow

Real-world scenario: Scraping book data from books.toscrape.com - a real e-commerce site designed for scraping practice.

import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.config import ScrapeFlowConfig
from scrapeflow.extractors import StructuredExtractor

async def scrape_books(page, context):
    """Extract book listings from the page."""
    schema = {
        "books": {
            "items": "article.product_pod",
            "schema": {
                "title": "h3 a",
                "price": ".price_color",
                "availability": ".instock.availability"
            }
        }
    }
    extractor = StructuredExtractor(schema)
    return await extractor.extract(page)

async def check_affordable_books(data, context):
    """Callback to find affordable books."""
    for book in data.get("books", []):
        price_str = book.get("price", "").replace("ยฃ", "").strip()
        try:
            price = float(price_str)
            if price < 20.0:  # Books under ยฃ20
                print(f"๐Ÿ’ฐ Affordable: {book['title'][:50]}... - ยฃ{price}")
        except ValueError:
            pass

async def main():
    config = ScrapeFlowConfig()
    async with ScrapeFlow(config) as scraper:
        workflow = Workflow(name="book_scraper")

        # Step 1: Navigate to books page
        async def navigate_to_books(page, context):
            scraper = context["scraper"]
            await scraper.navigate("https://books.toscrape.com/")
            await scraper.wait_for_selector("article.product_pod", timeout=10000)

        # Step 2: Extract book data
        workflow.add_step("navigate", navigate_to_books, required=True)
        workflow.add_step("extract", scrape_books, on_success=check_affordable_books)

        # Execute workflow
        result = await scraper.run_workflow(workflow)
        print(f"โœ… Scraped {len(result.final_data.get('books', []))} books")

asyncio.run(main())

Real Output:

Workflow 'book_scraper' completed. Success: True, Steps: 2/2
โœ… Scraped 20 books

Use Case 3: Quote Aggregation with Anti-Detection

Real-world scenario: Collecting quotes from quotes.toscrape.com while avoiding detection using stealth mode and user agent rotation.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import (
    ScrapeFlowConfig,
    AntiDetectionConfig,
    RateLimitConfig
)
from scrapeflow.extractors import Extractor

async def scrape_quotes_with_stealth():
    """Scrape quotes with anti-detection enabled."""
    config = ScrapeFlowConfig(
        anti_detection=AntiDetectionConfig(
            rotate_user_agents=True,  # Rotate user agents
            stealth_mode=True,        # Remove automation indicators
            viewport_width=1920,
            viewport_height=1080
        ),
        rate_limit=RateLimitConfig(requests_per_second=1.0)  # Be respectful
    )

    async with ScrapeFlow(config) as scraper:
        # Navigate to quotes site
        await scraper.navigate("https://quotes.toscrape.com/")

        # Verify stealth mode is working
        user_agent = await scraper.page.evaluate("() => navigator.userAgent")
        page_title = await scraper.page.title()

        # Extract quote data
        quotes = []
        quote_elements = scraper.page.locator(".quote")
        count = await quote_elements.count()

        for i in range(count):
            quote_elem = quote_elements.nth(i)
            text = await quote_elem.locator(".text").text_content()
            author = await quote_elem.locator(".author").text_content()

            quotes.append({
                "quote": text.strip() if text else "",
                "author": author.strip() if author else "",
                "url": await scraper.page.url
            })

        return quotes, user_agent, page_title

# Run the scraper
quotes, ua, title = asyncio.run(scrape_quotes_with_stealth())
print(f"๐Ÿ“ฐ Collected {len(quotes)} quotes from {title}")
print(f"๐Ÿ•ต๏ธ User Agent: {ua[:60]}...")
print(f"\nFirst quote: {quotes[0]['quote'][:80]}...")

Real Output:

๐Ÿ“ฐ Collected 10 quotes from Quotes to Scrape
๐Ÿ•ต๏ธ User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101...

First quote: "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."

Use Case 4: Multi-Page Scraping with Error Handling & Metrics

Real-world scenario: Scraping multiple pages from quotes.toscrape.com with comprehensive error handling and performance monitoring.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RetryConfig
from scrapeflow.exceptions import (
    ScrapeFlowBlockedError,
    ScrapeFlowTimeoutError,
    ScrapeFlowRetryError
)
from scrapeflow.extractors import Extractor

async def scrape_multiple_pages():
    config = ScrapeFlowConfig(
        retry=RetryConfig(max_retries=5, initial_delay=2.0),
        log_level="INFO"
    )

    try:
        async with ScrapeFlow(config) as scraper:
            # Scrape multiple pages
            all_quotes = []
            pages = [
                "https://quotes.toscrape.com/",
                "https://quotes.toscrape.com/page/2/",
            ]

            for url in pages:
                await scraper.navigate(url)

                # Extract quotes
                quote_elements = scraper.page.locator(".quote")
                count = await quote_elements.count()

                for i in range(count):
                    quote_elem = quote_elements.nth(i)
                    text = await quote_elem.locator(".text").text_content()
                    author = await quote_elem.locator(".author").text_content()

                    all_quotes.append({
                        "quote": text.strip() if text else "",
                        "author": author.strip() if author else "",
                    })

            # Get performance metrics
            metrics = scraper.get_metrics()
            print(f"๐Ÿ“Š Success rate: {metrics.get_success_rate():.2f}%")
            print(f"๐Ÿ“Š Total requests: {metrics.total_requests}")
            print(f"๐Ÿ“Š Average response time: {metrics.average_response_time:.2f}s")

            return all_quotes

    except ScrapeFlowBlockedError as e:
        print(f"๐Ÿšซ Blocked! Retry after {e.retry_after} seconds")
        return []
    except ScrapeFlowTimeoutError:
        print("โฑ๏ธ Request timed out")
        return []
    except ScrapeFlowRetryError as e:
        print(f"โŒ Failed after {e.retry_count} retries")
        return []

quotes = asyncio.run(scrape_multiple_pages())
print(f"๐Ÿ’ผ Found {len(quotes)} quotes across pages")

Real Output:

๐Ÿ“Š Success rate: 100.00%
๐Ÿ“Š Total requests: 2
๐Ÿ“Š Average response time: 1.10s
๐Ÿ’ผ Found 20 quotes across pages

Use Case 5: LLM-Powered Extraction (No Selectors)

Real-world scenario: Extract structured data using Mistral AIโ€”no CSS selectors, just natural language field descriptions. Ideal for pages with inconsistent markup or when you want semantic understanding.

Requires: MISTRAL_API_KEY in .env or environment.

import asyncio
import os

try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    pass

from scrapeflow import ScrapeFlow, create_mcp_backend
from scrapeflow.config import ScrapeFlowConfig, EthicalCrawlingConfig
from scrapeflow.llm_extract import generate_schema_from_prompt, extract_with_schema

async def run_llm_extraction():
    if not os.environ.get("MISTRAL_API_KEY"):
        print("Set MISTRAL_API_KEY in .env to run this example")
        return

    config = ScrapeFlowConfig(
        ethical_crawling=EthicalCrawlingConfig(respect_robots_txt=True),
    )

    async with ScrapeFlow(config) as scraper:
        await scraper.navigate("https://quotes.toscrape.com/")

        # Option 1: Generate schema from natural language
        schema = generate_schema_from_prompt(
            "Extract quote text, author name, and list of tags"
        )

        # Option 2: Extract using schema + page content (no selectors!)
        content = await scraper.page.evaluate("() => document.body.innerText")
        data = extract_with_schema(
            content,
            schema,
            prompt="Extract all quotes from this page with text, author, and tags",
        )
        print("LLM extracted:", data)

        # Option 3: MistralLLMBackendโ€”field descriptions only
        backend = create_mcp_backend("mistral")
        semantic_data = await backend.extract_with_semantics(
            scraper.page,
            field_descriptions={
                "quote": "The quote text",
                "author": "Author name",
                "tags": "Comma-separated tags",
            },
            context="Extract the first quote on the page",
        )
        print("Semantic extraction:", semantic_data)

asyncio.run(run_llm_extraction())

Real Output:

Generated schema: {'type': 'object', 'properties': {'quote_text': {...}, 'author_name': {...}, 'tags': {...}}, 'required': ['quote_text', 'author_name', 'tags']}

LLM extracted: [{'quote_text': 'The world as we have created it is a process of our thinking...', 'author_name': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}, ...]

Semantic extraction: {'quote': 'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', 'author': 'Albert Einstein', 'tags': 'change, deep-thoughts, thinking, world'}

What you get: Schema generation from prompts, extraction without brittle selectors, and semantic understanding of page contentโ€”powerful for dynamic or inconsistent sites.

Use Case 6: Login + Session Persistence

Real-world scenario: Log in once, save cookies, then reuse the session on future runsโ€”no re-login needed.

import asyncio
import os
from scrapeflow import ScrapeFlow, SpecificationExtractor, get_registry, register_quotes_login_handler
from scrapeflow.config import ScrapeFlowConfig, EthicalCrawlingConfig, BrowserConfig
from scrapeflow.specifications import FieldSpec, ItemSpec
from pydantic import BaseModel

class QuoteItem(BaseModel):
    text: str
    author: str

class QuotesPage(BaseModel):
    quotes: list[QuoteItem]

async def run():
    state_file = "/tmp/scrapeflow_session.json"
    schema = {"quotes": ItemSpec(items_selector=".quote", fields={"text": ".text", "author": ".author"})}

    # Step 1: Login and save (if first run)
    if not os.path.exists(state_file):
        async with ScrapeFlow(ScrapeFlowConfig(ethical_crawling=EthicalCrawlingConfig(respect_robots_txt=True))) as scraper:
            register_quotes_login_handler(get_registry())
            handler = get_registry().get_login("quotes_login")
            await scraper.login_with_handler("https://quotes.toscrape.com/login", "admin", "amdin", handler)
            await scraper.save_storage_state(state_file)

    # Step 2: Use saved session (no login)
    config = ScrapeFlowConfig(ethical_crawling=EthicalCrawlingConfig(respect_robots_txt=True),
                             browser=BrowserConfig(storage_state_path=state_file))
    async with ScrapeFlow(config) as scraper:
        await scraper.navigate("https://quotes.toscrape.com/")
        data = await SpecificationExtractor(QuotesPage, schema=schema).extract(scraper.page)
        print(f"Extracted {len(data.quotes)} quotes (authenticated)")

asyncio.run(run())

Real Output:

Extracted 10 quotes (authenticated)

(On first run: login + save; on subsequent runs: load session and extract without re-login.)

What you get: login_with_handler, save_storage_state, storage_state_pathโ€”reusable authenticated sessions.

๐Ÿ“š Documentation

Configuration

Use Case: Setting up a production-ready scraper for monitoring competitor prices across multiple sites.

from scrapeflow.config import (
    ScrapeFlowConfig,
    AntiDetectionConfig,
    RateLimitConfig,
    RetryConfig,
    BrowserConfig,
    BrowserType,
)

# Production configuration for price monitoring
config = ScrapeFlowConfig(
    browser=BrowserConfig(
        browser_type=BrowserType.CHROMIUM,
        headless=True,  # Run in background
        timeout=30000,   # 30 second timeout
    ),
    retry=RetryConfig(
        max_retries=5,           # Retry up to 5 times
        initial_delay=1.0,       # Start with 1 second delay
        max_delay=60.0,          # Cap at 60 seconds
        exponential_base=2.0,     # Double delay each retry
        jitter=True,              # Add randomness
    ),
    rate_limit=RateLimitConfig(
        requests_per_second=2.0,  # Max 2 requests/second
        burst_size=5,              # Allow bursts of 5
    ),
    anti_detection=AntiDetectionConfig(
        rotate_user_agents=True,   # Rotate user agents
        stealth_mode=True,          # Remove automation traces
        viewport_width=1920,
        viewport_height=1080,
    ),
    log_level="INFO",  # Log important events
)

Specification-Driven Extraction (Pydantic)

Use Case: Declarative extraction with validationโ€”fields, types, and rules in specs, not fragile XPaths.

from pydantic import BaseModel
from scrapeflow import ScrapeFlow, SpecificationExtractor
from scrapeflow.specifications import FieldSpec, ItemSpec, ProductPriceSpec

# Model for list of products
class BookListing(BaseModel):
    books: list[ProductPriceSpec]

# Schema maps fields to selectors
schema = {
    "books": ItemSpec(
        items_selector="article.product_pod",
        fields={
            "title": FieldSpec(selector="h3 a"),
            "price": FieldSpec(selector=".price_color"),
            "availability": FieldSpec(selector=".instock.availability", default=""),
            "url": FieldSpec(selector="h3 a", type="attribute", attribute="href"),
        },
    )
}

async with ScrapeFlow() as scraper:
    await scraper.navigate("https://books.toscrape.com/")
    extractor = SpecificationExtractor(BookListing, schema=schema)
    # Extract and validate in one step
    data = await extractor.extract(scraper.page)
    for book in data.books:
        print(f"{book.title}: {book.price}")

Ethical Crawling & robots.txt

Use Case: GDPR/CCPA compliance and robots.txt respect built into the specification layer.

from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, EthicalCrawlingConfig

config = ScrapeFlowConfig(
    ethical_crawling=EthicalCrawlingConfig(
        respect_robots_txt=True,      # Check robots.txt before each request
        user_agent_for_robots="ScrapeFlow",
        anonymize_ip=True,            # GDPR: minimize personal data
        data_retention_days=30,       # Document retention policy
    )
)

async with ScrapeFlow(config) as scraper:
    # navigate() automatically checks robots.txt
    await scraper.navigate("https://example.com/page")

Anti-Detection

Use Case: Scraping protected e-commerce sites that block automated access.

from scrapeflow.config import ScrapeFlowConfig, AntiDetectionConfig

# Configure anti-detection for protected sites
config = ScrapeFlowConfig(
    anti_detection=AntiDetectionConfig(
        # Rotate user agents to appear as different browsers
        rotate_user_agents=True,
        user_agents=[
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Safari/537.36",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Firefox/120.0",
        ],
        # Enable stealth mode to remove automation indicators
        stealth_mode=True,  # Removes webdriver property, mocks plugins, etc.
        # Use realistic viewport sizes
        viewport_width=1920,
        viewport_height=1080,
        # Optional: Rotate proxies for additional protection
        rotate_proxies=True,
        proxies=[
            {"server": "http://proxy1.example.com:8080"},
            {"server": "http://proxy2.example.com:8080"},
        ],
    )
)

async with ScrapeFlow(config) as scraper:
    # This will use stealth techniques automatically
    await scraper.navigate("https://protected-site.com")
    # Your scraping code here...

Rate Limiting

Use Case: Respecting API rate limits when scraping multiple pages to avoid getting blocked.

from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RateLimitConfig

config = ScrapeFlowConfig(
    rate_limit=RateLimitConfig(
        requests_per_second=1.0,    # Max 1 request per second
        requests_per_minute=60.0,   # Or 60 requests per minute
        burst_size=5,                # Allow bursts of 5 requests
    )
)

async with ScrapeFlow(config) as scraper:
    # Scrape multiple pages - rate limiter ensures we don't exceed limits
    urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
    ]

    for url in urls:
        await scraper.navigate(url)  # Automatically rate-limited
        # Extract data...
        print(f"Scraped: {url}")
        # Rate limiter ensures proper delays between requests

Retry Logic

Use Case: Handling network failures and temporary server errors when scraping unreliable sources.

from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RetryConfig

config = ScrapeFlowConfig(
    retry=RetryConfig(
        max_retries=5,              # Retry up to 5 times
        initial_delay=1.0,          # Start with 1 second delay
        max_delay=60.0,             # Cap at 60 seconds
        exponential_base=2.0,       # Double delay each retry (1s, 2s, 4s, 8s...)
        jitter=True,                 # Add randomness to avoid thundering herd
    )
)

async with ScrapeFlow(config) as scraper:
    # If this fails, it will automatically retry with exponential backoff
    await scraper.navigate("https://unreliable-site.com/products")
    # Retry logic handles:
    # - Network timeouts
    # - 500/502/503 server errors
    # - Connection errors
    # - Temporary blocks

Pagination

Use Case: Scrape multiple pages (e.g. search results, product listings) with limits.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, PaginationConfig
from scrapeflow.pagination import paginate
from scrapeflow.registry import get_registry

async def extract_quotes(page, context):
    quotes = []
    for i in range(await page.locator(".quote").count()):
        q = page.locator(".quote").nth(i)
        text = await q.locator(".text").text_content()
        author = await q.locator(".author").text_content()
        quotes.append({"text": text or "", "author": author or ""})
    return quotes

async def main():
    get_registry().register_pagination("quotes", next_selector="li.next a", has_next="li.next")
    handler = get_registry().get_pagination("quotes")
    config = PaginationConfig(max_pages=3, max_results=25)

    async with ScrapeFlow(ScrapeFlowConfig()) as scraper:
        async for page_data in paginate(scraper, "https://quotes.toscrape.com/", handler, extract_quotes, config):
            print(f"Page: {len(page_data)} quotes")
            for q in page_data[:2]:
                print(f"  - {q['author']}: {q['text'][:40]}...")

asyncio.run(main())

Data Extraction

Use Case: Extracting structured data from quotes.toscrape.com and books.toscrape.com.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.extractors import Extractor, StructuredExtractor

async def main():
    async with ScrapeFlow() as scraper:
        await scraper.navigate("https://quotes.toscrape.com/")

        # Method 1: Simple extraction
        page_title = await Extractor.extract_text(scraper.page, "h1")
        all_links = await Extractor.extract_links(scraper.page, "a")

        # Method 2: Structured extraction with schema (Best for complex pages)
        schema = {
            "page_title": "h1",
            "quotes": {
                "items": ".quote",  # Find all quote elements
                "schema": {
                    "text": ".text",           # Extract quote text
                    "author": ".author",       # Extract author
                    "tags": ".tag",            # Extract all tags
                },
            },
        }
        extractor = StructuredExtractor(schema)
        structured_data = await extractor.extract(scraper.page)

        print(f"Page: {structured_data['page_title']}")
        print(f"Quotes found: {len(structured_data['quotes'])}")
        if structured_data['quotes']:
            first = structured_data['quotes'][0]
            print(f"First quote: {first['text'][:60]}...")
            print(f"Author: {first['author']}")
            print(f"Tags: {first['tags']}")

asyncio.run(main())

Real Output:

Page: Quotes to Scrape
Quotes found: 10
First quote: "The world as we have created it is a process of our thinkin...
Author: Albert Einstein
Tags: ['change', 'deep-thoughts', 'thinking', 'world']

Workflows

Use Case: Building a multi-step scraper that navigates, extracts, and processes data with error handling.

import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.extractors import Extractor

async def login_step(page, context):
    """Step 1: Login to the site"""
    # Scraper is automatically available in context
    scraper = context["scraper"]
    await scraper.navigate("https://example.com/login")
    await scraper.fill("#username", context["username"])
    await scraper.fill("#password", context["password"])
    await scraper.click("button[type='submit']")
    await scraper.wait_for_selector(".dashboard", timeout=10000)

async def extract_products(page, context):
    """Step 2: Extract product data"""
    products = []
    product_elements = page.locator(".product")
    count = await product_elements.count()

    for i in range(count):
        product = product_elements.nth(i)
        products.append({
            "name": await Extractor.extract_text(product, ".name"),
            "price": await Extractor.extract_text(product, ".price"),
        })
    return products

async def save_to_database(data, context):
    """Callback: Save extracted data"""
    print(f"๐Ÿ’พ Saving {len(data)} products to database...")
    # Your database save logic here

async def handle_error(error, context):
    """Callback: Handle errors"""
    print(f"โŒ Error in workflow: {error}")
    # Your error handling logic here

async def main():
    async with ScrapeFlow() as scraper:
        workflow = Workflow(name="product_scraper")

        # Step 1: Login (required - stops workflow if fails)
        workflow.add_step(
            name="login",
            func=login_step,
            required=True,
            retryable=True,
        )

        # Step 2: Extract products (only if login succeeded)
        workflow.add_step(
            name="extract",
            func=extract_products,
            retryable=True,
            on_success=save_to_database,
            on_error=handle_error,
            condition=lambda ctx: ctx.get("logged_in", False),  # Conditional
        )

        # Set context
        workflow.set_context("username", "user@example.com")
        workflow.set_context("password", "secret123")

        # Execute workflow
        result = await scraper.run_workflow(workflow)
        print(f"โœ… Workflow completed: {result.success}")

asyncio.run(main())

Monitoring & Metrics

Use Case: Monitoring scraping performance when scraping multiple pages from quotes.toscrape.com.

import asyncio
from scrapeflow import ScrapeFlow

async def main():
    async with ScrapeFlow() as scraper:
        # Perform multiple scraping operations
        urls = [
            "https://quotes.toscrape.com/",
            "https://quotes.toscrape.com/page/2/",
        ]

        for url in urls:
            await scraper.navigate(url)
            # Extract data...

        # Get comprehensive metrics
        metrics = scraper.get_metrics()

        print(f"๐Ÿ“Š Performance Metrics:")
        print(f"   Success Rate: {metrics.get_success_rate():.2f}%")
        print(f"   Total Requests: {metrics.total_requests}")
        print(f"   Successful: {metrics.successful_requests}")
        print(f"   Failed: {metrics.failed_requests}")
        print(f"   Retries: {metrics.retry_count}")
        print(f"   Avg Response Time: {metrics.average_response_time:.2f}s")
        print(f"   Total Duration: {metrics.total_duration:.2f}s")

        # Reset metrics for next batch
        scraper.reset_metrics()

asyncio.run(main())

Real Output:

๐Ÿ“Š Performance Metrics:
   Success Rate: 100.00%
   Total Requests: 2
   Successful: 2
   Failed: 0
   Retries: 0
   Avg Response Time: 1.10s
   Total Duration: 2.20s

Error Handling

Use Case: Gracefully handling different types of errors when scraping quotes.toscrape.com.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.exceptions import (
    ScrapeFlowError,
    ScrapeFlowRetryError,
    ScrapeFlowTimeoutError,
    ScrapeFlowBlockedError,
)

async def scrape_with_error_handling():
    async with ScrapeFlow() as scraper:
        try:
            await scraper.navigate("https://quotes.toscrape.com/")
            title = await scraper.page.title()
            print(f"โœ… Successfully scraped: {title}")

        except ScrapeFlowBlockedError as e:
            # Site blocked us - wait and retry later
            print(f"๐Ÿšซ Blocked! Retry after {e.retry_after} seconds")

        except ScrapeFlowTimeoutError:
            # Request took too long
            print("โฑ๏ธ Request timed out - site may be slow")

        except ScrapeFlowRetryError as e:
            # All retries exhausted
            print(f"โŒ Failed after {e.retry_count} retries")

        except ScrapeFlowError as e:
            # Generic ScrapeFlow error
            print(f"โš ๏ธ ScrapeFlow error: {e}")

        except Exception as e:
            # Other unexpected errors
            print(f"๐Ÿ’ฅ Unexpected error: {e}")

asyncio.run(scrape_with_error_handling())

Real Output:

โœ… Successfully scraped: Quotes to Scrape

๐Ÿ“‹ Feature Coverageโ€”Nothing Hidden

Every ScrapeFlow feature is demonstrated somewhere. Quick reference:

Feature Where to See It
Extractors Use Case 1, basic_usage.py, advanced_example.py
StructuredExtractor Use Case 2, workflow_example.py, advanced_example.py
SpecificationExtractor Use Case 6, Doc section, specification_driven_example.py
HybridExtractor hybrid_extraction_example.py
FieldSpec, ItemSpec Doc section, specification_driven_example.py, hybrid_extraction_example.py
Selector fallback [".a", ".b"] hybrid_extraction_example.py
LLM extraction Use Case 5, llm_extraction_example.py
Login Use Case 6, authenticated_quotes_example.py, session_persistence_example.py
Session persistence Use Case 6, session_persistence_example.py
Registry, LoginHandler authenticated_quotes_example.py, specification_driven_example.py
Workflow Use Case 2, workflow_example.py
Rate limit, Retry Use Cases 1, 4, basic_usage.py, advanced_example.py
Anti-detection Use Case 3, advanced_example.py
Ethical crawling, robots.txt specification_driven_example.py, authenticated_quotes_example.py
Pagination Doc section, scrapeflow.pagination.paginate, PaginationConfig
Content cleaning Used internally by MistralLLMBackend; scrapeflow.content_utils

๐ŸŽจ Complete Examples

Check out the examples/ directoryโ€”each example showcases specific features:

Example Features Demonstrated
basic_usage.py ScrapeFlow, Extractor, rate limit, retry, metrics
workflow_example.py Workflow, steps, on_success/on_error, StructuredExtractor
advanced_example.py Anti-detection, rate limit, retry, StructuredExtractor
specification_driven_example.py SpecificationExtractor, FieldSpec, ItemSpec, ethical crawling, registry
authenticated_quotes_example.py Login, login_with_handler, registry, LoginHandler
llm_extraction_example.py Mistral LLM, generate_schema_from_prompt, extract_with_schema, MistralLLMBackend (requires MISTRAL_API_KEY)
hybrid_extraction_example.py HybridExtractor, selector fallback chain [".a", ".b"], LLM fallback
session_persistence_example.py save_storage_state, storage_state_path, skip login on reuse

Real Outputs from Examples

Outputs below are from running each example against live sites (quotes.toscrape.com, books.toscrape.com).

basic_usage.py

Page title: Quotes to Scrape

Extracted 5 quotes:

1. "The world as we have created it is a process of our thinkin...
   Author: Albert Einstein
   Tags: change, deep-thoughts, thinking, world

2. "It is our choices, Harry, that show what we truly are, far ...
   Author: J.K. Rowling
   Tags: abilities, choices

3. "There are only two ways to live your life. One is as though...
   Author: Albert Einstein
   Tags: inspirational, life, live, miracle, miracles

4. "The person, be it gentleman or lady, who has not pleasure i...
   Author: Jane Austen
   Tags: aliteracy, books, classic, humor

5. "Imperfection is beauty, madness is genius and it's better t...
   Author: Marilyn Monroe
   Tags: be-yourself, inspirational

๐Ÿ“Š Metrics:
   Success rate: 100.00%
   Total requests: 1
   Average response time: 1.72s

advanced_example.py

๐Ÿ“ Extracted quotes:

1. "The world as we have created it is a process of our thinkin...
   Author: Albert Einstein
   Tags: change, deep-thoughts, thinking, world

2. "It is our choices, Harry, that show what we truly are, far ...
   Author: J.K. Rowling
   Tags: abilities, choices

3. "There are only two ways to live your life. One is as though...
   Author: Albert Einstein
   Tags: inspirational, life, live, miracle, miracles

4. "The person, be it gentleman or lady, who has not pleasure i...
   Author: Jane Austen
   Tags: aliteracy, books, classic, humor

5. "Imperfection is beauty, madness is genius and it's better t...
   Author: Marilyn Monroe
   Tags: be-yourself, inspirational

Scraping completed:
  Success rate: 100.00%
  Total requests: 1
  Retries: 0

workflow_example.py

๐Ÿ’พ Saving 20 books to database...
   - A Light in the ...... - ยฃ51.77
   - Tipping the Velvet... - ยฃ53.74
   - Soumission... - ยฃ50.10

โœ… Workflow completed successfully!
๐Ÿ“š Extracted 20 books

๐Ÿ“Š Metrics:
   Total requests: 3
   Success rate: 100.00%
   Average response time: 5.85s

specification_driven_example.py

Extracted 10 items (validated via Pydantic)
  1. "The world as we have created it is a process of o... | Albert Einstein
  2. "It is our choices, Harry, that show what we truly... | J.K. Rowling
  3. "There are only two ways to live your life. One is... | Albert Einstein

Metrics: 1 requests, 100.0% success

authenticated_quotes_example.py

Authenticated login: True
Quotes extracted: 10
1. Albert Einstein: "The world as we have created it is a process of our thinking. It cannot be chan...
2. J.K. Rowling: "It is our choices, Harry, that show what we truly are, far more than our abilit...
3. Albert Einstein: "There are only two ways to live your life. One is as though nothing is a miracl...

hybrid_extraction_example.py

Extracted (selector or LLM):
  Title: A Light in the Attic
  Price: ยฃ51.77
  Availability: In stock (22 available)

llm_extraction_example.py (requires MISTRAL_API_KEY)

Generated schema: {'type': 'object', 'properties': {'quote_text': {'type': 'string', 'description': 'The text of the quote'}, 'author_name': {'type': 'string', 'description': 'The name of the author of the quote'}, 'tags': {'type': 'array', 'description': 'A list of tags associated with the quote'}}, 'required': ['quote_text', 'author_name', 'tags']}

LLM extracted: [{'quote_text': 'The world as we have created it is a process of our thinking...', 'author_name': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}, ...]

Semantic extraction: {'quote': 'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', 'author': 'Albert Einstein', 'tags': 'change, deep-thoughts, thinking, world'}

session_persistence_example.py

Session saved to /tmp/scrapeflow_quotes_session.json
Loaded session: extracted 10 quotes (authenticated)

Example: Complete Book Scraper with All Features

Real-world scenario: Complete example scraping books from books.toscrape.com using all ScrapeFlow features.

import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.config import (
    ScrapeFlowConfig,
    AntiDetectionConfig,
    RateLimitConfig,
    RetryConfig
)
from scrapeflow.extractors import StructuredExtractor

async def scrape_books_complete():
    """Complete book scraping solution with all ScrapeFlow features."""

    config = ScrapeFlowConfig(
        anti_detection=AntiDetectionConfig(
            rotate_user_agents=True,
            stealth_mode=True,
        ),
        rate_limit=RateLimitConfig(requests_per_second=1.0),
        retry=RetryConfig(max_retries=3),
    )

    async with ScrapeFlow(config) as scraper:
        workflow = Workflow(name="book_monitor")

        async def extract_books(page, context):
            schema = {
                "books": {
                    "items": "article.product_pod",
                    "schema": {
                        "title": "h3 a",
                        "price": ".price_color",
                    }
                }
            }
            extractor = StructuredExtractor(schema)
            return await extractor.extract(page)

        async def navigate_to_books(page, context):
            scraper = context["scraper"]
            await scraper.navigate("https://books.toscrape.com/")
            await scraper.wait_for_selector("article.product_pod", timeout=10000)

        workflow.add_step("navigate", navigate_to_books, required=True)
        workflow.add_step("extract", extract_books)

        result = await scraper.run_workflow(workflow)

        # Get metrics
        metrics = scraper.get_metrics()
        books = result.final_data.get("books", [])
        print(f"โœ… Scraped {len(books)} books")
        print(f"๐Ÿ“Š Success rate: {metrics.get_success_rate():.2f}%")
        print(f"๐Ÿ“Š Average response time: {metrics.average_response_time:.2f}s")

        if books:
            print(f"\n๐Ÿ“š Sample books:")
            for book in books[:3]:
                print(f"   - {book.get('title', '')[:40]}... - {book.get('price', '')}")

        return result.final_data

asyncio.run(scrape_books_complete())

Real Output:

Workflow 'book_monitor' completed. Success: True, Steps: 2/2
โœ… Scraped 20 books
๐Ÿ“Š Success rate: 100.00%
๐Ÿ“Š Average response time: 1.15s

๐Ÿ“š Sample books:
   - A Light in the Attic... - ยฃ51.77
   - Tipping the Velvet... - ยฃ53.74
   - Soumission... - ยฃ50.10

๐Ÿ—๏ธ Architecture

ScrapeFlow is built with a modular architecture:

scrapeflow/
โ”œโ”€โ”€ engine.py           # Main ScrapeFlow engine
โ”œโ”€โ”€ ports.py            # Protocols for dependency inversion
โ”œโ”€โ”€ browser_runtime.py  # Playwright runtime adapter
โ”œโ”€โ”€ workflow.py         # Workflow definition entities
โ”œโ”€โ”€ workflow_executor.py # Workflow execution service
โ”œโ”€โ”€ config.py           # Configuration (EthicalCrawling, Pagination, etc.)
โ”œโ”€โ”€ specifications.py   # SpecificationExtractor, HybridExtractor, FieldSpec
โ”œโ”€โ”€ schema_library.py   # Reusable schema definitions
โ”œโ”€โ”€ extractors.py       # Extractor, StructuredExtractor
โ”œโ”€โ”€ llm_extract.py      # Mistral LLM schema + extraction
โ”œโ”€โ”€ content_utils.py    # HTML cleaning for LLM
โ”œโ”€โ”€ mcp_backend.py      # MCPBackend, MistralLLMBackend
โ”œโ”€โ”€ pagination.py       # paginate() helper
โ”œโ”€โ”€ robots.py           # robots.txt parsing and enforcement
โ”œโ”€โ”€ registry.py         # Selectors, login handlers, pagination
โ”œโ”€โ”€ anti_detection.py   # Stealth mode, user agent rotation
โ”œโ”€โ”€ rate_limiter.py     # Rate limiting implementation
โ”œโ”€โ”€ retry.py            # Retry logic and error classification
โ”œโ”€โ”€ monitoring.py       # Metrics, logging, alerting
โ””โ”€โ”€ exceptions.py       # Custom exceptions

For deeper design details, see ARCHITECTURE.md.

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built on top of Playwright - an amazing browser automation library
  • Inspired by the need for production-ready scraping solutions

๐Ÿ“ง Contact

Irfan Ali - GitHub

Project Link: https://github.com/irfanalidv/scrapeflow-py


Made with โค๏ธ for the scraping community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapeflow_py-0.3.0.tar.gz (67.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapeflow_py-0.3.0-py3-none-any.whl (59.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapeflow_py-0.3.0.tar.gz.

File metadata

  • Download URL: scrapeflow_py-0.3.0.tar.gz
  • Upload date:
  • Size: 67.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for scrapeflow_py-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c986b1542e2d1ea273629023680ea53fb14bbce53b87b43eceb0020afc8211ee
MD5 4677117e7758d658999aa9d5191deccf
BLAKE2b-256 0a564f601fb8f06d77b2f25a6db3cf3f434ba8369203d34d7613451039e4fe28

See more details on using hashes here.

File details

Details for the file scrapeflow_py-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: scrapeflow_py-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 59.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for scrapeflow_py-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6672de8c25619f9865a1fe4b3ad5bdf8596bfbff51b4ab158300242bd3f48a99
MD5 a50ff3835785bd3c15715980a2f1d5b8
BLAKE2b-256 89c95bf6a7dde46ed8e2657ce58b19368021c5e926cc04a71d04fd2640d04c0e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page