Skip to main content

A Playwright-based web scraper with persistent caching, parallel scraping, and multiple output formats

Project description

Ghostscraper

A Playwright-based web scraper with persistent caching, automatic browser installation, and multiple output formats.

Changelog

v0.3.0 (Latest)

  • Added DynamoDB L2 cache support for cross-machine cache sharing
  • Simplified logging to boolean (logging=True/False)
  • Added dynamodb_table parameter to GhostScraper and scrape_many()

v0.2.1

  • Fixed RuntimeError when browser installation check runs within an active event loop
  • Improved compatibility with Linux and other Unix-like systems

v0.2.0

  • Initial stable release

Features

  • Headless Browser Scraping: Uses Playwright for reliable scraping of JavaScript-heavy websites
  • Parallel Scraping: Scrape multiple URLs concurrently with shared browser instances
  • Persistent Caching: Stores scraped data between runs for improved performance
  • DynamoDB L2 Cache: Optional cross-machine cache sharing via AWS DynamoDB
  • Automatic Browser Installation: Self-installs required browsers
  • Multiple Output Formats: HTML, Markdown, Plain Text, BeautifulSoup
  • Boolean Logging: Enable/disable logging with logging=True/False
  • Error Handling: Robust retry mechanism with exponential backoff
  • Asynchronous API: Modern async/await interface
  • Type Hints: Full type annotation support for better IDE integration

Installation

pip install ghostscraper

Basic Usage

Simple Scraping

import asyncio
from ghostscraper import GhostScraper

async def main():
    # Initialize the scraper
    scraper = GhostScraper(url="https://example.com")
    
    # Get the HTML content
    html = await scraper.html()
    print(html)
    
    # Get plain text content
    text = await scraper.text()
    print(text)
    
    # Get markdown version
    markdown = await scraper.markdown()
    print(markdown)

# Run the async function
asyncio.run(main())

Batch Scraping (Parallel)

import asyncio
from ghostscraper import GhostScraper

async def main():
    urls = [
        "https://example.com",
        "https://www.python.org",
        "https://github.com"
    ]
    
    # Scrape multiple URLs in parallel with a shared browser
    scrapers = await GhostScraper.scrape_many(
        urls=urls,
        max_concurrent=3,
        logging=True
    )
    
    # Access results from each scraper
    for scraper in scrapers:
        text = await scraper.text()
        print(f"{scraper.url}: {len(text)} characters")

asyncio.run(main())

With Custom Options

import asyncio
from ghostscraper import GhostScraper

async def main():
    # Initialize with custom options
    scraper = GhostScraper(
        url="https://example.com",
        browser_type="firefox",  # Use Firefox instead of default Chromium
        headless=False,          # Show the browser window
        load_timeout=60000,      # 60 seconds timeout
        clear_cache=True,        # Clear previous cache
        ttl=1,                   # Cache for 1 day
        logging=True             # Enable logging
    )
    
    # Get the HTML content
    html = await scraper.html()
    print(html)

asyncio.run(main())

With DynamoDB Cache

import asyncio
from ghostscraper import GhostScraper

async def main():
    # Single scraper with DynamoDB L2 cache
    scraper = GhostScraper(
        url="https://example.com",
        dynamodb_table="my-cache-table"  # Requires AWS credentials
    )
    html = await scraper.html()

    # Batch scraping with DynamoDB
    scrapers = await GhostScraper.scrape_many(
        urls=["https://example.com", "https://python.org"],
        dynamodb_table="my-cache-table"
    )

asyncio.run(main())

API Reference

GhostScraper

The main class for web scraping with persistent caching.

Constructor

GhostScraper(
    url: str = "",
    clear_cache: bool = False,
    ttl: int = 999,
    markdown_options: Optional[Dict[str, Any]] = None,
    logging: bool = True,
    dynamodb_table: Optional[str] = None,
    **kwargs
)

Parameters:

  • url (str): The URL to scrape.
  • clear_cache (bool): Whether to clear existing cache on initialization.
  • ttl (int): Time-to-live for cached data in days.
  • markdown_options (Dict[str, Any]): Options for HTML to Markdown conversion.
  • logging (bool): Enable/disable logging. Default: True.
  • dynamodb_table (str): DynamoDB table name for cross-machine caching. Default: None.
  • **kwargs: Additional options passed to PlaywrightScraper.

Playwright Options (passed via kwargs):

  • browser_type (str): Browser engine to use, one of "chromium", "firefox", or "webkit". Default: "chromium".
  • headless (bool): Whether to run the browser in headless mode. Default: True.
  • browser_args (Dict[str, Any]): Additional arguments to pass to the browser.
  • context_args (Dict[str, Any]): Additional arguments to pass to the browser context.
  • max_retries (int): Maximum number of retry attempts. Default: 3.
  • backoff_factor (float): Factor for exponential backoff between retries. Default: 2.0.
  • network_idle_timeout (int): Milliseconds to wait for network to be idle. Default: 10000 (10 seconds).
  • load_timeout (int): Milliseconds to wait for page to load. Default: 30000 (30 seconds).
  • wait_for_selectors (List[str]): CSS selectors to wait for before considering page loaded.

Methods

async html() -> str

Returns the raw HTML content of the page.

async response_code() -> int

Returns the HTTP response code from the page request.

async markdown() -> str

Returns the page content converted to Markdown.

async article() -> newspaper.Article

Returns a newspaper.Article object with parsed content.

async text() -> str

Returns the plain text content of the page.

async authors() -> str

Returns the detected authors of the content.

async soup() -> BeautifulSoup

Returns a BeautifulSoup object for the page.

@classmethod async scrape_many(urls: List[str], max_concurrent: int = 5, logging: bool = True, **kwargs) -> List[GhostScraper]

Scrape multiple URLs in parallel using a shared browser instance.

Parameters:

  • urls (List[str]): List of URLs to scrape.
  • max_concurrent (int): Maximum number of concurrent page loads. Default: 5.
  • logging (bool): Enable/disable logging. Default: True.
  • **kwargs: Additional options passed to GhostScraper and PlaywrightScraper.

Returns: List of GhostScraper instances with cached results.

PlaywrightScraper

Low-level browser automation class used by GhostScraper.

Constructor

PlaywrightScraper(
    url: str = "",
    browser_type: Literal["chromium", "firefox", "webkit"] = "chromium",
    headless: bool = True,
    browser_args: Optional[Dict[str, Any]] = None,
    context_args: Optional[Dict[str, Any]] = None,
    max_retries: int = 3,
    backoff_factor: float = 2.0,
    network_idle_timeout: int = 10000,
    load_timeout: int = 30000,
    wait_for_selectors: Optional[List[str]] = None,
    logging: bool = True
)

Parameters: Same as listed in GhostScraper kwargs above.

Methods

async fetch() -> Tuple[str, int]

Fetches the page and returns a tuple of (html_content, status_code).

async fetch_url(url: str) -> Tuple[str, int]

Fetches a specific URL using the shared browser instance.

async fetch_many(urls: List[str], max_concurrent: int = 5) -> List[Tuple[str, int]]

Fetches multiple URLs in parallel using a shared browser instance with concurrency control.

async fetch_and_close() -> Tuple[str, int]

Fetches the page, closes the browser, and returns a tuple of (html_content, status_code).

async close() -> None

Closes the browser and playwright resources.

async check_and_install_browser() -> bool

Checks if the required browser is installed, and installs it if not. Returns True if successful.

Advanced Usage

Configuring Global Defaults

from ghostscraper import ScraperDefaults

# Modify defaults for all future scraper instances
ScraperDefaults.MAX_CONCURRENT = 20
ScraperDefaults.LOGGING = False
ScraperDefaults.HEADLESS = False
ScraperDefaults.LOAD_TIMEOUT = 30000
ScraperDefaults.DYNAMODB_TABLE = "my-cache-table"

Batch Scraping with Options

import asyncio
from ghostscraper import GhostScraper

async def main():
    urls = [f"https://example.com/page{i}" for i in range(1, 11)]
    
    # Scrape with custom options
    scrapers = await GhostScraper.scrape_many(
        urls=urls,
        max_concurrent=5,
        browser_type="chromium",
        headless=True,
        load_timeout=60000,
        ttl=7,  # Cache for 7 days
        logging=True
    )
    
    # Process results
    for scraper in scrapers:
        markdown = await scraper.markdown()
        print(f"Scraped {scraper.url}")

asyncio.run(main())

Custom Browser Configurations

from ghostscraper import GhostScraper

# Set up a browser with custom viewport size and user agent
browser_context_args = {
    "viewport": {"width": 1920, "height": 1080},
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}

scraper = GhostScraper(
    url="https://example.com",
    context_args=browser_context_args
)

Waiting for Dynamic Content

from ghostscraper import GhostScraper

# Wait for specific elements to load before considering the page ready
scraper = GhostScraper(
    url="https://example.com/dynamic-page",
    wait_for_selectors=["#content", ".product-list", "button.load-more"]
)

Custom Markdown Options

from ghostscraper import GhostScraper

# Customize the markdown conversion
markdown_options = {
    "ignore_links": True,
    "ignore_images": True,
    "bullet_character": "*"
}

scraper = GhostScraper(
    url="https://example.com",
    markdown_options=markdown_options
)

Browser Management

from ghostscraper import check_browser_installed, install_browser
import asyncio

async def setup_browsers():
    # Check if browsers are installed
    chromium_installed = await check_browser_installed("chromium")
    firefox_installed = await check_browser_installed("firefox")
    
    # Install browsers if needed
    if not chromium_installed:
        install_browser("chromium")
    
    if not firefox_installed:
        install_browser("firefox")

asyncio.run(setup_browsers())

Performance Considerations

  • Use caching effectively by setting appropriate TTL values
  • Use scrape_many() for batch scraping to share browser instances and reduce memory usage
  • Adjust max_concurrent based on your system resources and target website rate limits
  • Consider browser memory usage when scraping multiple pages
  • For best performance, use "chromium" as it's generally the fastest engine
  • Use logging=False for production to minimize overhead

Error Handling

GhostScraper uses a progressive loading strategy:

  1. First attempts with "networkidle" (most reliable)
  2. Falls back to "load" event if timeout occurs
  3. Finally tries "domcontentloaded" (fastest but least complete)

If all strategies fail, it will retry up to max_retries with exponential backoff.

License

This project is licensed under the MIT License.

Dependencies

  • playwright
  • beautifulsoup4
  • html2text
  • newspaper4k
  • python-slugify
  • logorator
  • cacherator
  • lxml_html_clean

Contributing

Contributions are welcome! Visit the GitHub repository: https://github.com/Redundando/ghostscraper

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostscraper-0.3.0.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ghostscraper-0.3.0-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file ghostscraper-0.3.0.tar.gz.

File metadata

  • Download URL: ghostscraper-0.3.0.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ghostscraper-0.3.0.tar.gz
Algorithm Hash digest
SHA256 9b9cb829791403900a64e6ed1db79c71e565f0a588f0c7daa1955fe8a8e570a9
MD5 564d3993412cb6f83dcc11c28ea656d2
BLAKE2b-256 de03fc5dd95b8f696ff0c6d84775d60f51b99ae24e394a3cca831c211ae7e333

See more details on using hashes here.

File details

Details for the file ghostscraper-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: ghostscraper-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 15.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ghostscraper-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 07626fbfa9b0cfb22eadc15dbb7a0499086615c6258751b5cdb3c6ac9cf8fd9e
MD5 8d310d69079ad3e539c64e88a9f16dc0
BLAKE2b-256 da4c65d8019ac556758f4d51fbfa54fedadc7265174595a794171f3ed02899d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page