Skip to main content

A Playwright-based web scraper with persistent caching, parallel scraping, progress callbacks, and multiple output formats

Project description

Ghostscraper

A Playwright-based web scraper with persistent caching, automatic browser installation, and multiple output formats.

Changelog

See CHANGELOG.md for the full version history.

Overview

GhostScraper wraps Playwright with persistent JSON caching (via cacherator.JSONCache), retry logic, and multiple output formats. The primary interface is GhostScraper. PlaywrightScraper is the lower-level browser automation layer used internally.

All scraping methods are async. Results (html, response_code, response_headers, redirect_chain) are cached to disk on first fetch and restored on subsequent instantiation with the same URL. Derived outputs (markdown, text, authors, soup, seo, article) are computed in-memory from the cached HTML and are not persisted.

Installation

pip install ghostscraper

GhostScraper

Constructor

GhostScraper(
    url: str = "",
    clear_cache: bool = False,
    ttl: int = 999,
    markdown_options: Optional[Dict[str, Any]] = None,
    logging: bool = True,
    dynamodb_table: Optional[str] = None,
    on_progress: Optional[Callable] = None,
    **kwargs  # passed to PlaywrightScraper
)

Parameters:

  • url: The URL to scrape.
  • clear_cache: Clear existing cache on initialization. Default: False.
  • ttl: Cache time-to-live in days. Default: 999.
  • markdown_options: Options forwarded to html2text.HTML2Text for Markdown conversion.
  • logging: Enable/disable logging. Default: True.
  • dynamodb_table: DynamoDB table name for cross-machine caching. Replaces local cache when set. Default: None.
  • on_progress: Callback fired at key scraping events. Accepts sync and async callables. Default: None.
  • **kwargs: Forwarded to PlaywrightScraper (see below).

PlaywrightScraper kwargs:

  • browser_type: "chromium" | "firefox" | "webkit". Default: "chromium".
  • headless: Run browser headlessly. Default: True.
  • browser_args: Extra args passed to browser.launch().
  • context_args: Extra args passed to browser.new_context().
  • max_retries: Retry attempts per URL. Default: 3.
  • backoff_factor: Exponential backoff multiplier. Default: 2.0.
  • network_idle_timeout: Timeout in ms for the networkidle strategy. Default: 3000.
  • load_timeout: Timeout in ms for all other strategies. Default: 20000.
  • wait_for_selectors: CSS selectors to wait for after page load.
  • load_strategies: Loading strategy chain. Default: ["load", "networkidle", "domcontentloaded"].
  • no_retry_on: Status codes that abort retries immediately (e.g. [404, 410]). Default: None.

Instance Attributes

  • url (str): The URL this scraper was initialized with.
  • error (Exception | None): Set when a fetch fails under fail_fast=False in scrape_many. When set, html() returns "" and response_code() returns None.

Methods

All methods trigger a fetch (or cache restore) on first call. Subsequent calls return the cached/computed value.

async html() -> str

Raw HTML of the page. Returns "" if self.error is set.

async response_code() -> Optional[int]

HTTP status code. Returns None if self.error is set.

async response_headers() -> dict

HTTP response headers from the final response. Cached alongside HTML.

async redirect_chain() -> list

Full list of responses during navigation. Each entry: {"url": str, "status": int}. Cached alongside HTML.

[
    {"url": "https://example.com/old", "status": 301},
    {"url": "https://example.com/new", "status": 200},
]

async final_url() -> str

URL of the last entry in redirect_chain(). Falls back to self.url if chain is empty.

async markdown() -> str

Page content converted to Markdown via html2text. Respects markdown_options.

async text() -> str

Plain text extracted via newspaper4k.

async authors() -> list

Authors detected by newspaper4k.

async article() -> newspaper.Article

Full newspaper.Article object with parsed content.

async soup() -> BeautifulSoup

BeautifulSoup object parsed from the HTML.

async seo() -> dict

SEO metadata parsed from the HTML. All keys are omitted if the corresponding tag is absent.

{
    "title": str,           # <title>
    "description": str,     # <meta name="description">
    "canonical": str,       # <link rel="canonical">
    "robots": {             # <meta name="robots"> — directives as keys
        "noindex": True,
        "nofollow": True,
    },
    "googlebot": { ... },   # <meta name="googlebot">, same shape as robots
    "og": {                 # <meta property="og:*"> keyed by suffix
        "title": str,
        "description": str,
        "image": str,
        "url": str,
    },
    "twitter": { ... },     # <meta name="twitter:*">, same pattern
    "hreflang": {           # <link rel="alternate" hreflang="..."> — values are lists
        "en-us": ["https://..."],
        "de": ["https://..."],
    }
}

@classmethod async fetch_bytes(url, cache=False, clear_cache=False, ttl=999, dynamodb_table=None, logging=True, **kwargs) -> Tuple[bytes, int, dict]

Fetch a URL as raw bytes using the Playwright browser context (inherits UA, cookies, and headers). Useful for CDN-protected resources that block plain HTTP clients.

Parameters:

  • url: URL to fetch.
  • cache: Persist result to disk/DynamoDB. Default: False.
  • clear_cache: Force re-fetch even if cached. Default: False.
  • ttl: Cache TTL in days. Default: 999.
  • dynamodb_table: DynamoDB table for cross-machine caching. Default: None.
  • logging: Enable logging. Default: True.
  • **kwargs: Forwarded to PlaywrightScraper (max_retries, browser_type, no_retry_on, etc.).

Returns: Tuple[bytes, int, dict](body, status_code, headers)

@classmethod async scrape_many(urls, max_concurrent=15, logging=True, fail_fast=True, on_scraped=None, **kwargs) -> List[GhostScraper]

Scrape multiple URLs in parallel using a single shared browser instance.

Parameters:

  • urls: List of URLs to scrape.
  • max_concurrent: Max concurrent page loads. Default: 15.
  • logging: Enable logging. Default: True.
  • fail_fast: If True, any unhandled exception aborts the entire batch. If False, failed scrapers have scraper.error set, html() returns "", response_code() returns None. Default: True.
  • on_scraped: Async or sync callback invoked immediately after each URL is fetched and cached — before the batch returns. Receives the fully-populated GhostScraper instance. Fires for cached URLs too. Default: None.
  • on_progress: Progress callback (sync or async). Default: None.
  • **kwargs: Forwarded to GhostScraper and PlaywrightScraper.

Returns: List[GhostScraper] in the same order as urls. Already-cached URLs are skipped.

Caching

  • Cached fields: _html, _response_code, _response_headers, _redirect_chain.
  • Cache key: slugified URL.
  • Cache location: data/ghostscraper/ (configurable via ScraperDefaults.CACHE_DIRECTORY).
  • DynamoDB cache replaces local cache when dynamodb_table is set. Requires AWS credentials.
  • clear_cache=True forces a fresh fetch and overwrites the cache.

Loading Strategies

GhostScraper tries Playwright loading strategies in order, falling back on timeout:

  1. load — waits for the load event. Works for most sites.
  2. networkidle — waits until no network activity for 500ms. Better for JS-heavy pages. Uses network_idle_timeout.
  3. domcontentloaded — waits only for HTML parsing. Fastest, least complete.

If all strategies fail, the attempt is retried up to max_retries times with exponential backoff (backoff_factor ** attempt seconds).

Override the chain via load_strategies:

scrapers = await GhostScraper.scrape_many(urls=urls, load_strategies=["domcontentloaded"])

# Or globally:
from ghostscraper import ScraperDefaults
ScraperDefaults.LOAD_STRATEGIES = ["domcontentloaded"]

Progress Callbacks

Pass on_progress to receive real-time events. Accepts sync and async callables. Errors raised inside the callback are logged (when logging=True) and swallowed — they will not abort a scrape.

scraper = GhostScraper(url="https://example.com", on_progress=lambda e: print(e["event"]))

Each event is a dict with event (str) and ts (Unix timestamp). Additional fields:

event extra fields notes
started url fired before fetch begins
browser_installing browser first-run only; sync callback only
browser_ready browser browser check passed
loading_strategy url, strategy, attempt, max_retries, timeout fired per strategy attempt
retry url, attempt, max_retries + optional reason, status_code only fires when another attempt follows
page_loaded url, completed, total, status_code, scraper fires on success or error status; scraper only in scrape_many; fires for cached URLs too
error url, message unhandled exception during fetch
batch_started total, to_fetch, cached scrape_many only
batch_done total scrape_many only

ScraperDefaults

Global defaults, modifiable at runtime before instantiation:

from ghostscraper import ScraperDefaults

ScraperDefaults.BROWSER_TYPE = "chromium"       # default browser
ScraperDefaults.HEADLESS = True
ScraperDefaults.LOAD_TIMEOUT = 20000            # ms
ScraperDefaults.NETWORK_IDLE_TIMEOUT = 3000     # ms
ScraperDefaults.LOAD_STRATEGIES = ["load", "networkidle", "domcontentloaded"]
ScraperDefaults.MAX_RETRIES = 3
ScraperDefaults.BACKOFF_FACTOR = 2.0
ScraperDefaults.MAX_CONCURRENT = 15
ScraperDefaults.CACHE_TTL = 999                 # days
ScraperDefaults.CACHE_DIRECTORY = "data/ghostscraper"
ScraperDefaults.DYNAMODB_TABLE = None
ScraperDefaults.LOGGING = True

PlaywrightScraper

Low-level browser automation used internally by GhostScraper. Use directly only if you need raw browser control.

Constructor

PlaywrightScraper(
    url: str = "",
    browser_type: Literal["chromium", "firefox", "webkit"] = "chromium",
    headless: bool = True,
    browser_args: Optional[Dict[str, Any]] = None,
    context_args: Optional[Dict[str, Any]] = None,
    max_retries: int = 3,
    backoff_factor: float = 2.0,
    network_idle_timeout: int = 3000,
    load_timeout: int = 20000,
    wait_for_selectors: Optional[List[str]] = None,
    logging: bool = True,
    on_progress: Optional[Callable] = None,
    load_strategies: Optional[List[str]] = None,
    no_retry_on: Optional[List[int]] = None
)

Methods

async fetch() -> Tuple[str, int, dict, list]

Fetch self.url. Returns (html, status_code, headers, redirect_chain).

async fetch_url(url: str) -> Tuple[str, int, dict, list]

Fetch a specific URL using the shared browser instance.

async fetch_many(urls: List[str], max_concurrent: int = 5) -> List[Tuple[str, int, dict, list]]

Fetch multiple URLs in parallel.

async fetch_and_close() -> Tuple[str, int, dict, list]

Fetch and immediately close the browser.

async close() -> None

Close browser and release Playwright resources.

async check_and_install_browser() -> bool

Check if the configured browser is installed; install it if not. Result is cached per process.

Supports async context manager (async with PlaywrightScraper(...) as browser).

Browser Installation Utilities

from ghostscraper import check_browser_installed, install_browser

installed = await check_browser_installed("chromium")  # bool
install_browser("chromium")                            # sync, runs playwright install

Usage Examples

Single URL

import asyncio
from ghostscraper import GhostScraper

async def main():
    scraper = GhostScraper(url="https://example.com")
    html = await scraper.html()
    text = await scraper.text()
    markdown = await scraper.markdown()
    code = await scraper.response_code()
    headers = await scraper.response_headers()
    seo = await scraper.seo()

asyncio.run(main())

Batch Scraping

scrapers = await GhostScraper.scrape_many(
    urls=["https://example.com", "https://python.org"],
    max_concurrent=5,
    ttl=7,
    load_strategies=["domcontentloaded"],
)
for scraper in scrapers:
    print(await scraper.text())

Partial Failure Handling

scrapers = await GhostScraper.scrape_many(urls=urls, fail_fast=False)
for s in scrapers:
    if s.error:
        print(f"FAILED {s.url}: {s.error}")
    else:
        print(f"OK {s.url}: {await s.response_code()}")

Redirect Chain

scraper = GhostScraper(url="https://example.com/redirect")
print(await scraper.final_url())
for hop in await scraper.redirect_chain():
    print(hop["status"], hop["url"])

Fetch Raw Bytes (CDN-protected resources)

body, status_code, headers = await GhostScraper.fetch_bytes(
    "https://example.com/image.jpg",
    cache=True,
)

Skip Retries on Terminal Status Codes

scraper = GhostScraper(url="https://example.com/missing", no_retry_on=[404, 410, 403])
print(await scraper.response_code())

DynamoDB Cache

scraper = GhostScraper(url="https://example.com", dynamodb_table="my-cache-table")
scrapers = await GhostScraper.scrape_many(urls=urls, dynamodb_table="my-cache-table")

Custom Browser Context

scraper = GhostScraper(
    url="https://example.com",
    context_args={"viewport": {"width": 1920, "height": 1080}, "user_agent": "..."},
    wait_for_selectors=["#content", ".product-list"],
)

Memory-Efficient Batch Processing

For large batches, use on_scraped to process and discard each result as it arrives rather than holding all HTML in memory simultaneously:

results = []

async def handle(scraper: GhostScraper) -> None:
    results.append(await scraper.text())
    scraper._html = None  # release — already persisted to cache

await GhostScraper.scrape_many(
    urls=urls,
    max_concurrent=10,
    on_scraped=handle,
)

Dependencies

  • playwright
  • beautifulsoup4
  • html2text
  • newspaper4k
  • python-slugify
  • logorator
  • cacherator
  • lxml_html_clean

License

MIT. Contributions welcome: https://github.com/Redundando/ghostscraper

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostscraper-0.7.3.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ghostscraper-0.7.3-py3-none-any.whl (19.4 kB view details)

Uploaded Python 3

File details

Details for the file ghostscraper-0.7.3.tar.gz.

File metadata

  • Download URL: ghostscraper-0.7.3.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ghostscraper-0.7.3.tar.gz
Algorithm Hash digest
SHA256 f050b2b59c24c66f4c35d10115f84840914c1e125ac43407933400fbd8de39f0
MD5 1850940230c63d521998c91ab97f6e66
BLAKE2b-256 336f57a60c11d39923ba02bacb4b49e6615c7756d3b12ee400f6468414f8166f

See more details on using hashes here.

File details

Details for the file ghostscraper-0.7.3-py3-none-any.whl.

File metadata

  • Download URL: ghostscraper-0.7.3-py3-none-any.whl
  • Upload date:
  • Size: 19.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ghostscraper-0.7.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c75f1b30af741ed6d999219426a43d163c8b0b0fb56fa98be0a3599c72567ead
MD5 ad8fba702de8e6710477500eb71ce040
BLAKE2b-256 253d7cec5b277f2c1444fc842c1c45b3053bb8a772bdc2ada25bb762e592e3e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page