An asynchronous web scraper using Playwright with HTML to Markdown conversion

These details have not been verified by PyPI

Project links

Project description

GhostScraper

GhostScraper is an asynchronous web scraping library built on top of Playwright that makes it easy to fetch and convert web content to Markdown format. It handles browser management, retries, and provides a clean interface for working with web content.

Features

Asynchronous web scraping with Playwright
HTML to Markdown conversion
Built-in retry mechanism with exponential backoff
Result caching using JSONCache
Smart content extraction
Support for multiple browser types (Chromium, Firefox, WebKit)

Installation

pip install ghostscraper

GhostScraper will automatically install and manage required browsers during the first run.

Basic Usage

import asyncio
from ghostscraper import GhostScraper

async def main():
    # Create a scraper instance
    scraper = GhostScraper(url="https://example.com")
    
    # Get the HTML content
    html = await scraper.html()
    
    # Get the Markdown converted content
    markdown = await scraper.markdown()
    
    # Get the response code
    status_code = await scraper.response_code()
    
    print(f"Status code: {status_code}")
    print(f"Markdown content:\n{markdown}")

# Run the async function
asyncio.run(main())

API Reference

GhostScraper

The main class for scraping and converting web content.

Constructor

GhostScraper(
    url: str = "",
    clear_cache: bool = False,
    markdown_options: Optional[Dict[str, Any]] = None,
    **kwargs
)

url: The URL to scrape
clear_cache: Whether to clear the cache before scraping
markdown_options: Options for the Markdown converter
**kwargs: Additional arguments passed to the PlaywrightScraper

Methods

async html() -> str: Get the HTML content of the URL
async response_code() -> int: Get the HTTP response code
async markdown() -> str: Get the content converted to Markdown
async soup() -> BeautifulSoup: Get a BeautifulSoup object for the HTML content

**kwargs Keywords

The GhostScraper constructor accepts any keyword arguments and passes them directly to the underlying PlaywrightScraper. This allows you to customize the browser behavior without directly interacting with the PlaywrightScraper class.

# GhostScraper accepts all these keyword arguments which are passed to PlaywrightScraper
scraper = GhostScraper(
    url="https://example.com",
    browser_type="chromium",     # Browser to use: "chromium", "firefox", or "webkit"
    headless=True,               # Run browser in headless mode
    browser_args={},             # Arguments for browser launcher
    context_args={},             # Arguments for browser context
    max_retries=3,               # Maximum retry attempts
    backoff_factor=2.0,          # Exponential backoff factor
    network_idle_timeout=10000,  # Network idle timeout (ms)
    load_timeout=30000,          # Page load timeout (ms)
    wait_for_selectors=[]        # CSS selectors to wait for
)

These keyword arguments configure how the page is loaded, browser behavior, and retry mechanisms.

Advanced Usage

Custom Markdown Options

from ghostscraper import GhostScraper

# Configure the Markdown converter
markdown_options = {
    "strip_tags": ["script", "style", "nav", "footer", "header", "aside"],
    "keep_tags": ["article", "main", "div", "section", "p"],
    "content_selectors": ["article", "main", ".content", "#content"],
    "preserve_images": True,
    "preserve_links": True,
    "preserve_tables": True,
    "include_title": True,
    "compact_output": False
}

# Create a scraper with custom Markdown options
scraper = GhostScraper(
    url="https://example.com",
    markdown_options=markdown_options
)

Custom Browser Configuration

from ghostscraper import GhostScraper

# Create a scraper with custom browser settings
scraper = GhostScraper(
    url="https://example.com",
    # Browser configuration options (passed to PlaywrightScraper)
    browser_type="firefox",                         # Use Firefox instead of Chromium
    headless=False,                                 # Show the browser window
    max_retries=5,                                  # Increase retry attempts
    load_timeout=60000,                             # Increase load timeout to 60 seconds
    wait_for_selectors=[".content", ".main-article"] # Wait for these selectors
)

# You can also pass browser-specific arguments
scraper = GhostScraper(
    url="https://example.com",
    browser_args={
        "proxy": {                                  # Set up a proxy
            "server": "http://myproxy.com:8080",
            "username": "user",
            "password": "pass"
        },
        "slowMo": 50,                               # Slow down browser operations by 50ms
    },
    context_args={
        "userAgent": "Custom User Agent",           # Set custom user agent
        "viewport": {"width": 1920, "height": 1080} # Set viewport size
    }
)

Progressive Loading Strategy

GhostScraper uses a progressive loading strategy that tries different methods to load the page:

First tries with networkidle - waits until network is idle
If that fails, tries with load - waits for the load event
If that fails, tries with domcontentloaded - waits for DOM content loaded

This ensures maximum compatibility with different websites.

Browser Installation

GhostScraper automatically checks if the required browser is installed and installs it if needed:

# Install browsers manually if needed
from ghostscraper import install_browser

# Install a specific browser type
install_browser("chromium")
install_browser("firefox")
install_browser("webkit")

Using Caching

By default, GhostScraper caches results in the data/ghostscraper directory. To clear the cache:

# Clear cache for a specific URL
scraper = GhostScraper(url="https://example.com", clear_cache=True)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.11

Apr 13, 2026

0.9.10

Apr 13, 2026

0.9.9

Apr 13, 2026

0.9.8

Apr 13, 2026

0.9.7

Apr 13, 2026

0.9.6

Apr 13, 2026

0.9.5

Mar 16, 2026

0.9.3

Mar 16, 2026

0.9.1

Mar 13, 2026

0.8.0

Mar 12, 2026

0.7.4

Mar 12, 2026

0.7.3

Mar 12, 2026

0.7.2

Mar 12, 2026

0.7.1

Feb 27, 2026

0.6.1

Feb 26, 2026

0.6.0

Feb 26, 2026

0.5.0

Feb 23, 2026

0.4.3

Feb 22, 2026

0.4.2

Feb 22, 2026

0.4.0

Feb 22, 2026

0.3.0

Feb 21, 2026

0.2.1

Feb 20, 2026

0.2.0

Feb 16, 2026

0.1.0

Feb 12, 2026

0.0.3

Mar 21, 2025

0.0.2

Mar 21, 2025

This version

0.0.1

Mar 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostscraper-0.0.1.tar.gz (9.5 kB view details)

Uploaded Mar 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ghostscraper-0.0.1-py3-none-any.whl (10.3 kB view details)

Uploaded Mar 19, 2025 Python 3

File details

Details for the file ghostscraper-0.0.1.tar.gz.

File metadata

Download URL: ghostscraper-0.0.1.tar.gz
Upload date: Mar 19, 2025
Size: 9.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for ghostscraper-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`b4157d4a9102f532076284bfda29ffea619684f92583ab66a919553151c0b9f3`
MD5	`9a4a080acb80dc8a4e4fbd62fd3a738b`
BLAKE2b-256	`3524d1da5799aec3c4a0d2afc4dc444a028a054541875091d07534bd1939bc36`

See more details on using hashes here.

File details

Details for the file ghostscraper-0.0.1-py3-none-any.whl.

File metadata

Download URL: ghostscraper-0.0.1-py3-none-any.whl
Upload date: Mar 19, 2025
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for ghostscraper-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c240bbe657fa586883eef1703fe6dcde22911d6dfe28e49a559308fd2680005`
MD5	`cc31f6a350b2e7c5fc6faa9e93dbf8b5`
BLAKE2b-256	`64dcfaf86cd8203122b764efb47e0b61ea3f3e7c3819236bd4fb818d22b71a38`

See more details on using hashes here.

ghostscraper 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GhostScraper

Features

Installation

Basic Usage

API Reference

GhostScraper

Constructor

Methods

**kwargs Keywords

Advanced Usage

Custom Markdown Options

Custom Browser Configuration

Progressive Loading Strategy

Browser Installation

Using Caching

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes