Skip to main content

A Playwright-based web scraper with persistent caching, parallel scraping, progress callbacks, and multiple output formats

Project description

Ghostscraper

Playwright-based async web scraper with persistent caching, subprocess isolation for memory safety, and multiple output formats.

Installation

pip install ghostscraper

Playwright browsers are installed automatically on first run.

Quick Start

import asyncio
from ghostscraper import GhostScraper

async def main():
    scraper = GhostScraper(url="https://example.com")
    print(await scraper.response_code())  # 200
    print(await scraper.text())           # plain text
    print(await scraper.markdown())       # markdown

asyncio.run(main())

GhostScraper

Constructor

GhostScraper(
    url: str = "",
    cache: bool = True,
    clear_cache: bool = False,
    ttl: int = 999,
    markdown_options: Optional[Dict[str, Any]] = None,
    logging: bool = True,
    dynamodb_table: Optional[str] = None,
    on_progress: Optional[Callable] = None,
    lazy: bool = False,
    **kwargs  # forwarded to PlaywrightScraper
)
Parameter Type Default Description
url str "" URL to scrape
cache bool True Enable caching. False disables all cache reads/writes
clear_cache bool False Delete existing cache entry on init
ttl int 999 Cache time-to-live in days
markdown_options dict None Options forwarded to html2text.HTML2Text
logging bool True Enable/disable log output
dynamodb_table str None DynamoDB table name. When set, replaces local cache with DynamoDB
on_progress Callable None Progress callback (sync or async). Errors are swallowed
lazy bool False Skip cache restore on init. Used internally by ScrapeStream
**kwargs Forwarded to PlaywrightScraper (see below). Note: cache, clear_cache, ttl, lazy, and markdown_options are consumed by GhostScraper and never reach PlaywrightScraper

PlaywrightScraper kwargs (passable to GhostScraper):

Parameter Type Default Description
browser_type str "chromium" "chromium", "firefox", or "webkit"
headless bool True Run browser headlessly
browser_args dict None Extra args for browser.launch()
context_args dict None Extra args for browser.new_context()
max_retries int 3 Retry attempts per URL
backoff_factor float 2.0 Exponential backoff multiplier
network_idle_timeout int 3000 Timeout (ms) for networkidle strategy
load_timeout int 20000 Timeout (ms) for other strategies
wait_for_selectors list None CSS selectors to wait for after page load
load_strategies list ["load", "networkidle", "domcontentloaded"] Loading strategy chain, tried in order
no_retry_on list None Status codes that skip retries (e.g. [404, 410])
proxy str None Proxy server URL (e.g. "socks5://localhost:1080", "http://proxy:8080"). For full proxy dict with username/password/bypass, use browser_args={"proxy": {...}} instead

Instance Attributes

Attribute Type Description
url str The URL this scraper was initialized with
error Exception | None Set when a fetch fails under fail_fast=False. When set, html() returns "" and response_code() returns None

Async Output Methods

All methods trigger a fetch (or cache restore) on first call. Subsequent calls return the cached/computed value.

Method Returns Description
await html() str Raw HTML. Returns "" if error is set
await response_code() int | None HTTP status code. Returns None if error is set
await response_headers() dict HTTP response headers
await redirect_chain() list[dict] List of {"url": str, "status": int} entries
await final_url() str Last URL in redirect chain. Falls back to self.url
await markdown() str HTML converted to Markdown via html2text
await text() str Plain text via newspaper4k
await authors() list Authors detected by newspaper4k
await article() newspaper.Article Full parsed article object
await soup() BeautifulSoup Parsed HTML
await seo() dict SEO metadata (see below)

SEO Dict Structure

All keys are omitted if the corresponding tag is absent:

{
    "title": str,           # <title>
    "description": str,     # <meta name="description">
    "canonical": str,       # <link rel="canonical">
    "robots": {             # <meta name="robots">
        "noindex": True,
        "nofollow": True,
    },
    "googlebot": { ... },   # same shape as robots
    "og": {                 # <meta property="og:*">
        "title": str,
        "description": str,
        "image": str,
        "url": str,
    },
    "twitter": { ... },     # <meta name="twitter:*">
    "hreflang": {           # <link rel="alternate" hreflang="...">
        "en-us": ["https://..."],
        "de": ["https://..."],
    }
}

Cache Methods

Method Description
save_cache() Persist cached fields to disk/DynamoDB
clear_cache_entry() Delete this URL's cache entry
cache_stats() Returns {"key": str, "exists": bool}
cache_list_keys(limit=100, last_key=None) Returns {"keys": [...], "last_key": ...}

Deprecated Methods

These still work but emit DeprecationWarning:

Old New
json_cache_save() save_cache()
json_cache_save_db() save_cache()
json_cache_clear() clear_cache_entry()
json_cache_stats() cache_stats()
json_cache_list_db_keys() cache_list_keys()

Caching

Three cache modes, determined at construction:

Mode Condition Behavior
Local JSON Default (cache=True, no dynamodb_table) JSON files in data/ghostscraper/
DynamoDB dynamodb_table is set DynamoDB only via dynamorator (compressed). No local files
Disabled cache=False All cache operations are no-ops

Cached fields: _html, _response_code, _response_headers, _redirect_chain. Cache key: slugified URL. Derived outputs (markdown, text, authors, soup, seo, article) are computed in-memory and not persisted.

# Local cache (default)
scraper = GhostScraper(url="https://example.com")

# DynamoDB cache
scraper = GhostScraper(url="https://example.com", dynamodb_table="my-cache-table")

# No cache
scraper = GhostScraper(url="https://example.com", cache=False)

# Force re-fetch
scraper = GhostScraper(url="https://example.com", clear_cache=True)

Batch Scraping

scrapers = await GhostScraper.scrape_many(
    urls=["https://example.com", "https://python.org"],
    max_concurrent=5,
    ttl=7,
    load_strategies=["domcontentloaded"],
)
for scraper in scrapers:
    print(await scraper.text())

scrape_many Parameters

Parameter Type Default Description
urls list[str] required URLs to scrape
max_concurrent int 15 Max concurrent page loads
logging bool True Enable logging
fail_fast bool True True: exception aborts batch. False: failed scrapers get error set
on_scraped Callable None Callback per URL (sync or async). Fires for cached URLs too
browser_restart_every int None Restart browser every N pages to cap memory growth
on_progress Callable None Progress callback
**kwargs Forwarded to GhostScraper and PlaywrightScraper

Returns List[GhostScraper] in the same order as urls. Already-cached URLs are skipped.

Partial Failure Handling

scrapers = await GhostScraper.scrape_many(urls=urls, fail_fast=False)
for s in scrapers:
    if s.error:
        print(f"FAILED {s.url}: {s.error}")
    else:
        print(f"OK {s.url}: {await s.response_code()}")

Memory-Efficient Batch Processing

results = []

async def handle(scraper: GhostScraper) -> None:
    results.append(await scraper.text())
    scraper._html = None  # release — already persisted to cache

await GhostScraper.scrape_many(urls=urls, max_concurrent=10, on_scraped=handle)

ScrapeStream

Memory-safe streaming for large URL sets. Each chunk runs in a disposable subprocess — when it exits, the OS reclaims all Chromium memory. Results are yielded one at a time via async for.

Creating a Stream

stream = GhostScraper.create_stream(
    urls=urls,
    dynamodb_table="my-cache-table",
    stream_id="my-seo-audit",
    priority=5,
    subprocess_batch_size=50,
    max_concurrent=10,
    on_progress=my_callback,
    # All other kwargs forwarded to GhostScraper/PlaywrightScraper
)
Parameter Type Default Description
urls list[str] required URLs to scrape
dynamodb_table str None DynamoDB table for cache. Local cache used if None
stream_id str auto UUID Identifier for monitoring/cancellation
priority int 5 Lower = higher priority (0–10)
subprocess_batch_size int 50 URLs per subprocess
max_concurrent int 15 Concurrent pages within each subprocess
on_progress Callable None Progress callback
**kwargs Forwarded to GhostScraper/PlaywrightScraper

Consuming Results

async for scraper in stream:
    if scraper.error:
        print(f"FAILED {scraper.url}: {scraper.error}")
    else:
        text = await scraper.text()
        save_to_db(scraper.url, text)
    # scraper goes out of scope → GC reclaims

Monitoring

status = GhostScraper.get_stream_status("my-seo-audit")
# StreamStatus(stream_id, total, completed, failed, pending, status)
# status: "running" | "completed" | "cancelled"

all_streams = GhostScraper.get_all_streams()  # List[StreamStatus]

Cancellation

GhostScraper.cancel_stream("my-seo-audit")
# Current subprocess chunk finishes gracefully, then the async for loop ends

Shutdown

await GhostScraper.shutdown()
# Waits for running subprocesses, drains queue

Fetch Raw Bytes

Fetch a URL as raw bytes using the Playwright browser context. Useful for CDN-protected resources that block plain HTTP clients.

body, status_code, headers = await GhostScraper.fetch_bytes(
    "https://example.com/image.jpg",
    cache=True,
    ttl=30,
)
Parameter Type Default Description
url str required URL to fetch
cache bool False Persist result to disk/DynamoDB
clear_cache bool False Force re-fetch
ttl int 999 Cache TTL in days
dynamodb_table str None DynamoDB table
logging bool True Enable logging
**kwargs Forwarded to PlaywrightScraper

Returns Tuple[bytes, int, dict](body, status_code, headers).

Loading Strategies

Playwright loading strategies are tried in order, falling back on timeout:

  1. load — waits for the load event. Works for most sites.
  2. networkidle — waits until no network activity for 500ms. Better for JS-heavy pages.
  3. domcontentloaded — waits only for HTML parsing. Fastest, least complete.

If all strategies fail, the attempt is retried up to max_retries times with exponential backoff.

# Per-scraper override
scraper = GhostScraper(url=url, load_strategies=["domcontentloaded"])

# Global override
ScraperDefaults.LOAD_STRATEGIES = ["domcontentloaded"]

Progress Callbacks

Pass on_progress to receive real-time events. Accepts sync and async callables. Errors inside the callback are swallowed.

scraper = GhostScraper(url="https://example.com", on_progress=lambda e: print(e["event"]))

Each event is a dict with event (str) and ts (Unix timestamp). Additional fields vary by event:

Event Extra Fields Notes
started url Before fetch begins
loading_strategy url, strategy, attempt, max_retries, timeout Per strategy attempt
retry url, attempt, max_retries When another attempt follows
page_loaded url, completed, total, status_code Success or error status
error url, message Unhandled exception
batch_started total, to_fetch, cached scrape_many only
batch_done total scrape_many only
browser_ready browser First-run browser check
browser_installing browser First-run install

ScraperDefaults

Global defaults, modifiable at runtime:

from ghostscraper import ScraperDefaults

ScraperDefaults.BROWSER_TYPE = "chromium"
ScraperDefaults.HEADLESS = True
ScraperDefaults.LOAD_TIMEOUT = 20000            # ms
ScraperDefaults.NETWORK_IDLE_TIMEOUT = 3000     # ms
ScraperDefaults.LOAD_STRATEGIES = ["load", "networkidle", "domcontentloaded"]
ScraperDefaults.MAX_RETRIES = 3
ScraperDefaults.BACKOFF_FACTOR = 2.0
ScraperDefaults.MAX_CONCURRENT = 15
ScraperDefaults.CACHE_TTL = 999                 # days
ScraperDefaults.CACHE_DIRECTORY = "data/ghostscraper"
ScraperDefaults.DYNAMODB_TABLE = None
ScraperDefaults.BROWSER_RESTART_EVERY = None
ScraperDefaults.PROXY = None                    # e.g. "socks5://localhost:1080"
ScraperDefaults.LOGGING = True

# Stream settings
ScraperDefaults.MAX_WORKERS = 2                 # concurrent subprocess workers
ScraperDefaults.SUBPROCESS_BATCH_SIZE = 50      # URLs per subprocess
ScraperDefaults.MAX_QUEUE_SIZE = 500            # max pending chunks in queue
ScraperDefaults.DEFAULT_PRIORITY = 5            # default stream priority (0–10)

PlaywrightScraper

Low-level browser automation used internally. Use directly only for raw browser control.

async with PlaywrightScraper(logging=False) as browser:
    html, status, headers, chain = await browser.fetch_url("https://example.com")
    body, status, headers = await browser.fetch_bytes("https://example.com/image.jpg")
Method Returns Description
await fetch() (html, status, headers, chain) Fetch self.url
await fetch_url(url) (html, status, headers, chain) Fetch specific URL
await fetch_many(urls, max_concurrent=5) list[tuple] Parallel fetch
await fetch_and_close() (html, status, headers, chain) Fetch and close browser
await fetch_bytes(url) (bytes, status, headers) Raw bytes fetch
await close() None Close browser
await check_and_install_browser() bool Check/install browser

Supports async with context manager.

Dependencies

  • playwright
  • beautifulsoup4
  • html2text
  • newspaper4k
  • python-slugify
  • logorator
  • dynamorator
  • lxml_html_clean

License

MIT. Contributions welcome: https://github.com/Redundando/ghostscraper

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostscraper-0.9.8.tar.gz (45.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ghostscraper-0.9.8-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file ghostscraper-0.9.8.tar.gz.

File metadata

  • Download URL: ghostscraper-0.9.8.tar.gz
  • Upload date:
  • Size: 45.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ghostscraper-0.9.8.tar.gz
Algorithm Hash digest
SHA256 042e11cb004655fd15150233f9034ee24958b79aa3c9e7537672407d9b7f6504
MD5 fa8293c9ea5642851402366f3a0f199b
BLAKE2b-256 af30f1cd182f48004da3ae0a857253b03e422d70162d77a69b21c1b9552e847b

See more details on using hashes here.

File details

Details for the file ghostscraper-0.9.8-py3-none-any.whl.

File metadata

  • Download URL: ghostscraper-0.9.8-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ghostscraper-0.9.8-py3-none-any.whl
Algorithm Hash digest
SHA256 5f24f76e4206db8e2417337b9c7e4f639474e2b9c47d52a0797c2cda5ab82e20
MD5 793edd9efd6d1f3b55ae0c108b193f49
BLAKE2b-256 1c18fb417bdb076208c863982edcdf9f8eff3112206b01047e6f8ef1ecab2c03

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page