Skip to main content

A flexible browser automation library with support for multiple drivers

Project description

Silk

PyPI version Python versions License Code style: black Type check: mypy

Silk is a functional web scraping framework for Python that reimagines how web automation should work. Built around composable "Actions" and the Expression library, Silk enables you to write elegant, maintainable, and resilient web scrapers with true functional programming patterns.

Unlike traditional scraping libraries, Silk embraces Railway-Oriented Programming for robust error handling, uses immutable data structures for predictability, and provides an expressive, composable API that makes even complex scraping workflows readable and maintainable.

Why Silk?

Traditional web scraping approaches in Python often lead to complex, brittle code that's difficult to maintain. Silk solves these common challenges:

  • No More Callback Hell: Replace nested try/except blocks with elegant Railway-Oriented Programming
  • Resilient Scraping: Built-in retry mechanisms, fallback selectors, and error recovery
  • Composable Actions: Chain operations with intuitive operators (>>, &, |) for cleaner code
  • Type-Safe: Full typing support with Mypy and Pydantic for fewer runtime errors
  • Browser Agnostic: Same API for Playwright, Selenium, or any other browser automation tool
  • Parallelization Made Easy: Run operations concurrently with the & operator

Whether you're building a small data collection script or a large-scale scraping system, Silk's functional approach scales with your needs while keeping your codebase clean and maintainable.

Features

  • Purely Functional Design: Built on Expression library for robust functional programming in Python
  • Immutable Data Structures: Uses immutable collections for thread-safety and predictability
  • Railway-Oriented Programming: Elegant error handling with Result types
  • Functional & Composable API: Build pipelines with intuitive operators (>>, &, |)
  • Browser Abstraction: Works with Playwright, Selenium, or any other browser automation tool
  • Resilient Selectors: Fallback mechanisms to handle changing website structures
  • Type Safety: Leverages Pydantic, Mypy and Python's type hints for static type checking
  • Parallel Execution: Easy concurrent scraping with functional composition

Installation

You can install Silk with your preferred browser driver:

# Base installation (no drivers)
pip install silk-scraper

# With Playwright support
pip install silk-scraper[playwright]

# With Selenium support
pip install silk-scraper[selenium]

# With Puppeteer support
pip install silk-scraper[puppeteer]

# With all drivers
pip install silk-scraper[all]

Quick Start

Basic Example

Here's a minimal example to get you started with Silk:

import asyncio
from silk.actions.navigation import Navigate
from silk.actions.extraction import GetText
from silk.browsers.manager import BrowserManager

async def main():
    # Create a browser manager (defaults to Playwright)
    async with BrowserManager() as manager:
        # Define a simple scraping pipeline
        pipeline = (
            Navigate("https://example.com") 
            >> GetText("h1")
        )
        
        # Execute the pipeline
        result = await pipeline(manager)
        
        if result.is_ok():
            print(f"Page title: {result.default_value(None)}")
        else:
            print(f"Error: {result.error}")

if __name__ == "__main__":
    asyncio.run(main())

Configuring the Browser

Silk supports different browser drivers. You can configure them like this:

from silk.models.browser import BrowserOptions
from silk.browsers.manager import BrowserManager

# Configure browser options
options = BrowserOptions(
    headless=False,  # Set to False to see the browser UI
    browser_name="chromium",  # Choose "chromium", "firefox", or "webkit"
    slow_mo=50,  # Slow down operations by 50ms (useful for debugging)
    viewport={"width": 1280, "height": 800}
)

# Create a manager with specific driver and options
manager = BrowserManager(driver_type="playwright", default_options=options)

Creating Custom Actions

You can easily create your own actions for reusable scraping logic:

from silk.actions.base import Action
from silk.actions.decorators import action
from expression.core import Ok, Error
from silk.models.browser import ActionContext

@action()
async def extract_price(context, selector):
    """Extract and parse a price from the page"""
    page_result = await context.get_page()
    if page_result.is_error():
        return page_result
        
    page = page_result.default_value(None)
    if page is None:
        return Error("No page found")   
    
    element_result = await page.query_selector(selector)
    
    if element_result.is_error():
        return Error(f"Element not found: {selector}")
        
    element = element_result.default_value(None)
    if element is None:
        return Error("No element found")
    
    text_result = await element.get_text()
    
    if text_result.is_error():
        return text_result
        
    text = text_result.default_value(None)
    if text is None:
        return Error("No text found")
    
    try:
        # Remove currency symbol and convert to float
        price = float(text.replace('$', '').strip())
        return Ok(price)
    except ValueError:
        return Error(f"Failed to parse price from: {text}")

Core Concepts

Actions

The fundamental building block in Silk is the Action. An Action represents a pure operation that can be composed with other actions using functional programming patterns. Each Action takes an ActionContext and returns a Result containing either the operation's result or an error.

class FindElement(Action[ElementHandle]):
    """Action to find an element on the page"""
    
    def __init__(self, selector: str):
        self.selector = selector
        
    async def execute(self, context: ActionContext) -> Result[ElementHandle, Exception]:
        try:
            page_result = await context.get_page()
            if page_result.is_error():
                return page_result
                
            page = page_result.default_value(None)
            if page is None:
                return Error("No page found")
            
            return await page.query_selector(self.selector)
        except Exception as e:
            return Error(e)

ActionContext

The ActionContext carries references to the browser, page, and other execution context information. Actions use this context to interact with the browser.

Result Type

Silk uses the Result[T, E] type from the Expression library for error handling. Rather than relying on exceptions, actions return Ok(value) for success or Error(exception) for failures.

Composition Operators

Silk provides powerful operators for composing actions:

  • >> (then): Chain actions sequentially
  • & (and): Run actions in parallel
  • | (or): Try one action, fall back to another if it fails

These operators make it easy to build complex scraping workflows with clear, readable code.

Detailed Examples

Handling Complex Selectors

Silk provides robust ways to handle changing website structures with selector groups. Selector groups are a collection of selectors that are tried in order until one succeeds.

from silk.selectors.selector import SelectorGroup, css, xpath

# Create a selector group with fallback options
product_price = SelectorGroup(
    "product_price",
    css(".current-price"),             # Try this first
    css(".product-price .amount"),     # Fall back to this
    xpath("//div[contains(@class, 'price')]//span")  # Last resort
)

# Use it in an extraction action
extract_price = GetText(product_price)

Resilient Scraping with Retry and Fallbacks

from silk.actions.flow import retry, fallback
from silk.actions.extraction import GetText
from silk.actions.navigation import Navigate

# Retry navigation up to 3 times with 2s delay
resilient_navigation = retry(
    Navigate("https://example.com"),
    max_attempts=3,
    delay_ms=2000
)

# Try multiple selectors for extracting data
extract_title = fallback(
    GetText(".main-title"),
    GetText("h1.title"),
    GetText("#product-name")
)

# Combine into a pipeline
pipeline = resilient_navigation >> extract_title

Parallel Extraction

Extract multiple pieces of information at once:

from silk.actions.composition import parallel
from silk.actions.extraction import GetText, GetAttribute

# Extract product details in parallel
product_details = parallel(
    GetText(".product-name"),
    GetText(".product-price"),
    GetAttribute(".product-image", "src"),
    GetText(".product-description")
)

# Use in a pipeline
pipeline = Navigate(product_url) >> product_details

# Results come back as a collection
result = await pipeline(manager)
if result.is_ok():
    product_details = result.default_value(None)
    if product_details is None:
        print("No product details found")
    else:
        [name, price, image_url, description] = product_details
        print(f"Product: {name}, Price: {price}")

Form Filling and Submission

from silk.actions.input import Fill, Click
from silk.actions.flow import compose

login_action = compose(
    Navigate("https://example.com/login"),
    Fill("#username", "user@example.com"),
    Fill("#password", "password123"),
    Click("button[type='submit']")
)

Handling Dynamic Content

from silk.actions.flow import wait, loop_until
from silk.actions.conditions import ElementExists

# Wait for dynamic content to load
wait_for_results = wait(1000) >> ElementExists(".search-results-item")

# Loop until a condition is met
load_all_results = loop_until(
    condition=ElementExists(".no-more-results"),
    body=Click(".load-more-button"),
    max_iterations=10,
    delay_ms=1000
)

# Use in a pipeline
search_pipeline = (
    Navigate("https://example.com/search?q=example")
    >> wait_for_results
    >> load_all_results
    >> GetText(".search-results-count")
)

Action Decorator for Custom Functions

Easily convert any function into a composable Action using the @action decorator:

from silk import action, Ok, Error

@action
async def scroll_to_element(driver, selector, smooth=True):
    """Scrolls the page to bring the element into view"""
    try:
        element = await driver.query_selector(selector)
        await element.scroll_into_view({"behavior": "smooth" if smooth else "auto"})
        return "Element scrolled into view"
    except Exception as e:
        raise e

# Use it in a pipeline - the function is now a composable Action!
pipeline = (
    Navigate(url)
    >> scroll_to_element("#my-element")
    >> extract_text("#my-element")
)

result = await pipeline(browser)
if result.is_ok():
    print(f"Extracted text after scrolling: {result.default_value(None)}")

Composable Operations

Silk provides intuitive operators for composable scraping:

Sequential Operations (>>)

# Navigate to a page, then extract the title
Navigate(url) >> Click(title_selector)

Parallel Operations (&)

# Extract name, price, and description in parallel
# Each action is executed in a new context when using the & operator
Navigate(url) & Navigate(url2) & Navigate(url3)
# Combining parallel and sequential operations
# Each parallel branch can contain its own chain of sequential actions
(
    # First website: Get product details
    (Navigate("https://site1.com/product") 
     >> Wait(1000)
     >> GetText(".product-name"))
    &
    # Second website: Search and extract first result
    (Navigate("https://site2.com") 
     >> Fill("#search-input", "smartphone")
     >> Click("#search-button")
     >> Wait(2000)
     >> GetText(".first-result .name"))
    &
    # Third website: Login and get account info
    (Navigate("https://site3.com/login")
     >> Fill("#username", "user@example.com")
     >> Fill("#password", "password123")
     >> Click(".login-button")
     >> Wait(1500)
     >> GetText(".account-info"))
)
# Results are collected as a Block of 3 items, one from each parallel branch

Fallback Operations (|)

# Try to extract with one selector, fall back to another if it fails
GetText(primary_selector) | GetText(fallback_selector)

Fallback operations are powerful tools for building resilient scraping pipelines. They allow you to try multiple scraping strategies and return the first successful result. in combination with SelectorGroups, you can create very robust scraping pipelines.

from silk.actions.navigation import Navigate
from silk.actions.extraction import GetText, GetAttribute, QueryAll, ExtractTable
from silk.actions.input import Click
from silk.actions.flow import wait, retry, fallback
from silk.selectors.selector import SelectorGroup, css, xpath

# Example: Advanced product information scraping with multiple strategies
async def scrape_product(url, manager):
    # Strategy 1: Direct extraction using primary selectors
    primary_strategy = (
        Navigate(url)
        >> GetText(".product-title")
    )
    
    # Strategy 2: Click on a tab first, then extract from revealed content
    secondary_strategy = (
        Navigate(url)
        >> Click(".details-tab")
        >> wait(500)  # Wait for tab content to load
        >> GetText(".tab-content h1")
    )
    
    # Strategy 3: Extract from structured JSON data in script tag
    json_strategy = (
        Navigate(url)
        >> GetAttribute('script[type="application/ld+json"]', "textContent")
        # Additional processing would parse the JSON and extract title
    )
    
    # Combine all strategies with fallback operator
    product_title_pipeline = (
        primary_strategy | secondary_strategy | json_strategy
    )
    
    # Multiple fallback approaches for price extraction
    price_pipeline = (
        # Try special sale price first
        (Navigate(url) >> GetText(".special-price .price-amount"))
        |
        # Then try regular price
        (Navigate(url) >> GetText(".regular-price"))
        |
        # Then try to extract from a pricing table
        (Navigate(url) 
         >> ExtractTable("#pricing-table")
         # Additional processing would extract price from table data
        )
        |
        # Last resort: Try to find price in any element containing "$"
        (Navigate(url)
         >> QueryAll("*:contains('$')")
         # Additional processing would filter and extract price
        )
    )
    
    # Execute both pipelines
    title_result = await product_title_pipeline(manager)
    price_result = await price_pipeline(manager)
    
    return {
        "title": title_result.default_value("Unknown Title"),
        "price": price_result.default_value("Price Unavailable")
    }

# Example with SelectorGroups for even more resilience
def build_robust_product_scraper(url):
    # Create selector groups with multiple options
    title_selectors = SelectorGroup(
        "product_title",
        css(".product-title"),
        css("h1.title"),
        xpath("//div[@class='product-info']//h1"),
        css(".pdp-title")
    )
    
    price_selectors = SelectorGroup(
        "product_price",
        css(".special-price .amount"),
        css(".product-price"),
        xpath("//span[contains(@class, 'price')]"),
        css(".price-info .price")
    )
    
    image_selectors = SelectorGroup(
        "product_image",
        css(".product-image-gallery img"),
        css(".main-image"),
        xpath("//div[contains(@class, 'gallery')]//img")
    )
    
    # Use these groups in a pipeline with retries
    return (
        Navigate(url)
        >> retry(GetText(title_selectors), max_attempts=3, delay_ms=1000)
        >> retry(GetText(price_selectors), max_attempts=3, delay_ms=1000)
        >> retry(GetAttribute(image_selectors, "src"), max_attempts=3, delay_ms=1000)
    )

API Reference

Core Modules

  • silk.actions: Core action classes for browser automation

    • silk.actions.base: Base Action class and core utilities
    • silk.actions.navigation: Actions for navigating between pages
    • silk.actions.extraction: Actions for extracting data from pages
    • silk.actions.input: Actions for interacting with forms and elements
    • silk.actions.flow: Control flow actions like branch, retry, and loop
    • silk.actions.composition: Utilities for composing actions (sequence, parallel, pipe)
    • silk.actions.decorators: Decorators like @action for creating custom actions
  • silk.browsers: Browser management and abstraction layer

    • silk.browsers.manager: BrowserManager for session handling
    • silk.browsers.driver: Abstract BrowserDriver interface
    • silk.browsers.element: ElementHandle for working with DOM elements
  • silk.selectors: Selector utilities

    • silk.selectors.selector: Selector and SelectorGroup classes
  • silk.models: Data models using Pydantic

    • silk.models.browser: BrowserOptions, ActionContext, etc.

Common Action Classes

  • Navigation

    • Navigate(url): Navigate to a URL
    • Reload(): Reload the current page
    • GoBack(): Navigate back in history
    • GoForward(): Navigate forward in history
  • Extraction

    • Query(selector): Find an element
    • QueryAll(selector): Find all matching elements
    • GetText(selector): Extract text from an element
    • GetAttribute(selector, attribute): Get an attribute value
    • GetHtml(selector, outer=True): Get element HTML
    • ExtractTable(table_selector): Extract data from an HTML table
  • Input

    • Click(target): Click an element
    • DoubleClick(target): Double-click an element
    • Fill(target, text): Fill a form field
    • Type(target, text): Type text (alias for Fill)
    • Select(target, value/text): Select an option from a dropdown
    • MouseMove(target): Move the mouse to an element
    • KeyPress(key, modifiers): Press a key or key combination
  • Flow Control

    • branch(condition, if_true, if_false): Conditional branching
    • loop_until(condition, body, max_iterations): Loop until condition is met
    • retry(action, max_attempts, delay_ms): Retry an action on failure
    • retry_with_backoff(action): Retry with exponential backoff
    • with_timeout(action, timeout_ms): Apply a timeout to an action
  • Composition

    • sequence(*actions): Run actions in sequence, collect all results
    • parallel(*actions): Run actions in parallel, collect all results
    • pipe(*actions): Create a pipeline where each action uses the previous result
    • fallback(*actions): Try actions in sequence until one succeeds
    • compose(*actions): Compose actions sequentially, return only the last result

For a complete API reference, please see the API documentation.

Best Practices

Error Handling

Silk uses Railway-Oriented Programming for error handling. Instead of using try/except, leverage the Result type:

result = await pipeline(manager)
if result.is_ok():
    data = result.default_value(None)
    # Process the data
else:
    # Handle the error
    error = result.error
    logger.error(f"Scraping failed: {error}")

Browser Resources

Always use context managers to ensure browser resources are properly cleaned up:

async with BrowserManager() as manager:
    # Your scraping code here
    pass  # Resources automatically cleaned up

Selector Resilience

Use selector groups for resilient scraping that can handle UI changes:

# Instead of a single brittle selector:
extract_price = GetText(".price-box .price")

# Use a group with fallbacks:
price_selector = SelectorGroup(
    "price",
    css(".price-box .price"),
    css(".product-price"),
    xpath("//span[contains(@class, 'price')]")
)
extract_price = GetText(price_selector)

Action Composition

Build reusable pipelines through composition instead of large monolithic functions:

# Define reusable components
navigate_to_product = Navigate("https://example.com/product")
extract_product_info = parallel(
    GetText(".product-name"),
    GetText(".product-price"),
    GetText(".product-description")
)
extract_related_products = QueryAll(".related-product") >> extract_text_from_elements

# Compose them in different ways
full_scraper = navigate_to_product >> extract_product_info >> extract_related_products
minimal_scraper = navigate_to_product >> extract_product_info

Logging

Enable logging to better debug your scraping pipelines:

import logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger("silk").setLevel(logging.DEBUG)

Contributing

Contributions to Silk are welcome! Please feel free to submit a Pull Request.

Development Setup

  1. Clone the repository

    git clone https://github.com/galaddirie/silk.git
    cd silk
    
  2. Install Poetry (if not already installed)

    curl -sSL https://install.python-poetry.org | python3 -
    
  3. Install dependencies

    poetry install --all-extras
    
  4. Activate the virtual environment

    poetry shell
    
  5. Run tests

    poetry run pytest
    

Guidelines

  • Follow PEP 8 and use Black for code formatting
  • Write tests for new features
  • Keep the functional programming paradigm in mind
  • Update documentation with new features

Acknowledgements

Silk builds upon several excellent libraries:

Roadmap

  • Initial release with Playwright support
  • Improve parallel execution
  • Support multiple actions in parallel in the same context/page eg. (GetText & GetAttribute & GetHtml) in an ergonomic way
  • Selenium integration
  • Puppeteer integration
  • Add examples
  • Support Mapped tasks similar to airflow tasks eg. (QueryAll >> GetText[]) where get text is applied to each element in the collection
  • Add proxy options
  • Explore stealth options for browser automation ( implement patchwright, no-driver, driverless, etc.)
  • add dependency review
  • Support for task dependencies
  • action signature validation
  • Data extraction DSL for declarative scraping
  • Support computer using agentds (browser-use, openai cua, claude computer-use)
  • Enhanced caching mechanisms
  • Distributed scraping support
  • Rate limiting and polite scraping utilities
  • Integration with popular data processing libraries (Pandas, etc.)
  • CLI tool for quick scraping tasks

FAQ

How does Silk compare to other scraping libraries?

Silk differs from traditional scraping libraries like Scrapy, Beautiful Soup, or plain Selenium/Playwright in its functional approach. While these tools focus on imperative code with callbacks and exceptions, Silk embraces functional composition, immutable data structures, and Railway-Oriented Programming for cleaner, more maintainable code.

Can I use Silk with my existing Playwright/Selenium code?

Yes, Silk is designed to work alongside existing browser automation code. You can gradually adopt Silk's patterns while keeping your existing code.

Is Silk suitable for large-scale scraping?

Absolutely. Silk's composable nature makes it excellent for large-scale scraping projects. Its built-in error handling, retries, and parallel execution capabilities are particularly valuable for robust production systems.

How can I handle authentication in Silk?

You can handle authentication like any other browser interaction:

login_action = compose(
    Navigate("https://example.com/login"),
    Fill("#username", "user@example.com"),
    Fill("#password", "password123"),
    Click("button[type='submit']"),
    wait(1000)  # Wait for login to complete
)

# Then use the authenticated context for further actions
pipeline = login_action >> Navigate("https://example.com/protected-content") >> GetText("#protected-data")

You can also save and reuse authentication state with browser context options.

License

Silk is released under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

silk_scraper-0.1.1.tar.gz (53.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

silk_scraper-0.1.1-py3-none-any.whl (56.6 kB view details)

Uploaded Python 3

File details

Details for the file silk_scraper-0.1.1.tar.gz.

File metadata

  • Download URL: silk_scraper-0.1.1.tar.gz
  • Upload date:
  • Size: 53.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for silk_scraper-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5cd48803fe4a4f71a886002bfb3a0325daa7b63bfb7eb43182c88496f242812d
MD5 7cee475b08be3c75d078725a6fe1a763
BLAKE2b-256 d969a25ca9667570c996b240108030f655e9fadbf6b5718a9321cf36d32c9eb0

See more details on using hashes here.

File details

Details for the file silk_scraper-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: silk_scraper-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 56.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for silk_scraper-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6a9c583c6ae881aaa645fbe6616e955bd40dc6ebbd8b63f6f406ce6beedd932f
MD5 ea0b42dcdd2c28202e1dffa26f414669
BLAKE2b-256 ab417d86dda470ea85e2b0ad649fbd9ea0a93c14b1ef6a5786229e7adb7df0c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page