A Playwright-based web scraper with persistent caching, parallel scraping, progress callbacks, and multiple output formats

These details have not been verified by PyPI

Project links

Project description

Ghostscraper

Playwright-based async web scraper with persistent caching, subprocess isolation for memory safety, and multiple output formats.

Installation

pip install ghostscraper

Playwright browsers are installed automatically on first run.

Quick Start

import asyncio
from ghostscraper import GhostScraper

async def main():
    scraper = GhostScraper(url="https://example.com")
    print(await scraper.response_code())  # 200
    print(await scraper.text())           # plain text
    print(await scraper.markdown())       # markdown

asyncio.run(main())

GhostScraper

Constructor

GhostScraper(
    url: str = "",
    cache: bool = True,
    clear_cache: bool = False,
    ttl: int = 999,
    markdown_options: Optional[Dict[str, Any]] = None,
    logging: bool = True,
    dynamodb_table: Optional[str] = None,
    on_progress: Optional[Callable] = None,
    lazy: bool = False,
    **kwargs  # forwarded to PlaywrightScraper
)

Parameter	Type	Default	Description
`url`	`str`	`""`	URL to scrape
`cache`	`bool`	`True`	Enable caching. `False` disables all cache reads/writes
`clear_cache`	`bool`	`False`	Delete existing cache entry on init
`ttl`	`int`	`999`	Cache time-to-live in days
`markdown_options`	`dict`	`None`	Options forwarded to `html2text.HTML2Text`
`logging`	`bool`	`True`	Enable/disable log output
`dynamodb_table`	`str`	`None`	DynamoDB table name. When set, replaces local cache with DynamoDB
`on_progress`	`Callable`	`None`	Progress callback (sync or async). Errors are swallowed
`lazy`	`bool`	`False`	Skip cache restore on init. Used internally by `ScrapeStream`
`**kwargs`			Forwarded to `PlaywrightScraper` (see below). Note: `cache`, `clear_cache`, `ttl`, `lazy`, and `markdown_options` are consumed by `GhostScraper` and never reach `PlaywrightScraper`

PlaywrightScraper kwargs (passable to GhostScraper):

Parameter	Type	Default	Description
`browser_type`	`str`	`"chromium"`	`"chromium"`, `"firefox"`, or `"webkit"`
`headless`	`bool`	`True`	Run browser headlessly
`browser_args`	`dict`	`None`	Extra args for `browser.launch()`
`context_args`	`dict`	`None`	Extra args for `browser.new_context()`
`max_retries`	`int`	`3`	Retry attempts per URL
`backoff_factor`	`float`	`2.0`	Exponential backoff multiplier
`network_idle_timeout`	`int`	`3000`	Timeout (ms) for `networkidle` strategy
`load_timeout`	`int`	`20000`	Timeout (ms) for other strategies
`wait_for_selectors`	`list`	`None`	CSS selectors to wait for after page load
`load_strategies`	`list`	`["load", "networkidle", "domcontentloaded"]`	Loading strategy chain, tried in order
`no_retry_on`	`list`	`None`	Status codes that skip retries (e.g. `[404, 410]`)
`proxy`	`str`	`None`	Proxy server URL (e.g. `"socks5://localhost:1080"`, `"http://proxy:8080"`). For full proxy dict with `username`/`password`/`bypass`, use `browser_args={"proxy": {...}}` instead

Instance Attributes

Attribute	Type	Description
`url`	`str`	The URL this scraper was initialized with
`error`	`Exception \| None`	Set when a fetch fails under `fail_fast=False`. When set, `html()` returns `""` and `response_code()` returns `None`

Async Output Methods

All methods trigger a fetch (or cache restore) on first call. Subsequent calls return the cached/computed value.

Method	Returns	Description
`await html()`	`str`	Raw HTML. Returns `""` if `error` is set
`await response_code()`	`int \| None`	HTTP status code. Returns `None` if `error` is set
`await response_headers()`	`dict`	HTTP response headers
`await redirect_chain()`	`list[dict]`	List of `{"url": str, "status": int}` entries
`await final_url()`	`str`	Last URL in redirect chain. Falls back to `self.url`
`await markdown()`	`str`	HTML converted to Markdown via `html2text`
`await text()`	`str`	Plain text via `newspaper4k`
`await authors()`	`list`	Authors detected by `newspaper4k`
`await article()`	`newspaper.Article`	Full parsed article object
`await soup()`	`BeautifulSoup`	Parsed HTML
`await seo()`	`dict`	SEO metadata (see below)

SEO Dict Structure

All keys are omitted if the corresponding tag is absent:

{
    "title": str,           # <title>
    "description": str,     # <meta name="description">
    "canonical": str,       # <link rel="canonical">
    "robots": {             # <meta name="robots">
        "noindex": True,
        "nofollow": True,
    },
    "googlebot": { ... },   # same shape as robots
    "og": {                 # <meta property="og:*">
        "title": str,
        "description": str,
        "image": str,
        "url": str,
    },
    "twitter": { ... },     # <meta name="twitter:*">
    "hreflang": {           # <link rel="alternate" hreflang="...">
        "en-us": ["https://..."],
        "de": ["https://..."],
    }
}

Cache Methods

Method	Description
`save_cache()`	Persist cached fields to disk/DynamoDB
`clear_cache_entry()`	Delete this URL's cache entry
`cache_stats()`	Returns `{"key": str, "exists": bool}`
`cache_list_keys(limit=100, last_key=None)`	Returns `{"keys": [...], "last_key": ...}`

Deprecated Methods

These still work but emit DeprecationWarning:

Old	New
`json_cache_save()`	`save_cache()`
`json_cache_save_db()`	`save_cache()`
`json_cache_clear()`	`clear_cache_entry()`
`json_cache_stats()`	`cache_stats()`
`json_cache_list_db_keys()`	`cache_list_keys()`

Caching

Three cache modes, determined at construction:

Mode	Condition	Behavior
Local JSON	Default (`cache=True`, no `dynamodb_table`)	JSON files in `data/ghostscraper/`
DynamoDB	`dynamodb_table` is set	DynamoDB only via dynamorator (compressed). No local files
Disabled	`cache=False`	All cache operations are no-ops

Cached fields: _html, _response_code, _response_headers, _redirect_chain. Cache key: slugified URL. Derived outputs (markdown, text, authors, soup, seo, article) are computed in-memory and not persisted.

# Local cache (default)
scraper = GhostScraper(url="https://example.com")

# DynamoDB cache
scraper = GhostScraper(url="https://example.com", dynamodb_table="my-cache-table")

# No cache
scraper = GhostScraper(url="https://example.com", cache=False)

# Force re-fetch
scraper = GhostScraper(url="https://example.com", clear_cache=True)

Batch Scraping

scrapers = await GhostScraper.scrape_many(
    urls=["https://example.com", "https://python.org"],
    max_concurrent=5,
    ttl=7,
    load_strategies=["domcontentloaded"],
)
for scraper in scrapers:
    print(await scraper.text())

scrape_many Parameters

Parameter	Type	Default	Description
`urls`	`list[str]`	required	URLs to scrape
`max_concurrent`	`int`	`15`	Max concurrent page loads
`logging`	`bool`	`True`	Enable logging
`fail_fast`	`bool`	`True`	`True`: exception aborts batch. `False`: failed scrapers get `error` set
`on_scraped`	`Callable`	`None`	Callback per URL (sync or async). Fires for cached URLs too
`browser_restart_every`	`int`	`None`	Restart browser every N pages to cap memory growth
`on_progress`	`Callable`	`None`	Progress callback
`**kwargs`			Forwarded to `GhostScraper` and `PlaywrightScraper`

Returns List[GhostScraper] in the same order as urls. Already-cached URLs are skipped.

Partial Failure Handling

scrapers = await GhostScraper.scrape_many(urls=urls, fail_fast=False)
for s in scrapers:
    if s.error:
        print(f"FAILED {s.url}: {s.error}")
    else:
        print(f"OK {s.url}: {await s.response_code()}")

Memory-Efficient Batch Processing

results = []

async def handle(scraper: GhostScraper) -> None:
    results.append(await scraper.text())
    scraper._html = None  # release — already persisted to cache

await GhostScraper.scrape_many(urls=urls, max_concurrent=10, on_scraped=handle)

ScrapeStream

Memory-safe streaming for large URL sets. Each chunk runs in a disposable subprocess — when it exits, the OS reclaims all Chromium memory. Results are yielded one at a time via async for.

Creating a Stream

stream = GhostScraper.create_stream(
    urls=urls,
    dynamodb_table="my-cache-table",
    stream_id="my-seo-audit",
    priority=5,
    subprocess_batch_size=50,
    max_concurrent=10,
    on_progress=my_callback,
    # All other kwargs forwarded to GhostScraper/PlaywrightScraper
)

Parameter	Type	Default	Description
`urls`	`list[str]`	required	URLs to scrape
`dynamodb_table`	`str`	`None`	DynamoDB table for cache. Local cache used if `None`
`stream_id`	`str`	auto UUID	Identifier for monitoring/cancellation
`priority`	`int`	`5`	Lower = higher priority (0–10)
`subprocess_batch_size`	`int`	`50`	URLs per subprocess
`max_concurrent`	`int`	`15`	Concurrent pages within each subprocess
`on_progress`	`Callable`	`None`	Progress callback
`**kwargs`			Forwarded to `GhostScraper`/`PlaywrightScraper`

Consuming Results

async for scraper in stream:
    if scraper.error:
        print(f"FAILED {scraper.url}: {scraper.error}")
    else:
        text = await scraper.text()
        save_to_db(scraper.url, text)
    # scraper goes out of scope → GC reclaims

Monitoring

status = GhostScraper.get_stream_status("my-seo-audit")
# StreamStatus(stream_id, total, completed, failed, pending, status)
# status: "running" | "completed" | "cancelled"

all_streams = GhostScraper.get_all_streams()  # List[StreamStatus]

Cancellation

GhostScraper.cancel_stream("my-seo-audit")
# Current subprocess chunk finishes gracefully, then the async for loop ends

Shutdown

await GhostScraper.shutdown()
# Waits for running subprocesses, drains queue

Fetch Raw Bytes

Fetch a URL as raw bytes using the Playwright browser context. Useful for CDN-protected resources that block plain HTTP clients.

body, status_code, headers = await GhostScraper.fetch_bytes(
    "https://example.com/image.jpg",
    cache=True,
    ttl=30,
)

Parameter	Type	Default	Description
`url`	`str`	required	URL to fetch
`cache`	`bool`	`False`	Persist result to disk/DynamoDB
`clear_cache`	`bool`	`False`	Force re-fetch
`ttl`	`int`	`999`	Cache TTL in days
`dynamodb_table`	`str`	`None`	DynamoDB table
`logging`	`bool`	`True`	Enable logging
`**kwargs`			Forwarded to `PlaywrightScraper`

Returns Tuple[bytes, int, dict] — (body, status_code, headers).

Loading Strategies

Playwright loading strategies are tried in order, falling back on timeout:

load — waits for the load event. Works for most sites.
networkidle — waits until no network activity for 500ms. Better for JS-heavy pages.
domcontentloaded — waits only for HTML parsing. Fastest, least complete.

If all strategies fail, the attempt is retried up to max_retries times with exponential backoff.

# Per-scraper override
scraper = GhostScraper(url=url, load_strategies=["domcontentloaded"])

# Global override
ScraperDefaults.LOAD_STRATEGIES = ["domcontentloaded"]

Progress Callbacks

Pass on_progress to receive real-time events. Accepts sync and async callables. Errors inside the callback are swallowed.

scraper = GhostScraper(url="https://example.com", on_progress=lambda e: print(e["event"]))

Each event is a dict with event (str) and ts (Unix timestamp). Additional fields vary by event:

Event	Extra Fields	Notes
`started`	`url`	Before fetch begins
`loading_strategy`	`url`, `strategy`, `attempt`, `max_retries`, `timeout`	Per strategy attempt
`retry`	`url`, `attempt`, `max_retries`	When another attempt follows
`page_loaded`	`url`, `completed`, `total`, `status_code`	Success or error status
`error`	`url`, `message`	Unhandled exception
`batch_started`	`total`, `to_fetch`, `cached`	`scrape_many` only
`batch_done`	`total`	`scrape_many` only
`browser_ready`	`browser`	First-run browser check
`browser_installing`	`browser`	First-run install

ScraperDefaults

Global defaults, modifiable at runtime:

from ghostscraper import ScraperDefaults

ScraperDefaults.BROWSER_TYPE = "chromium"
ScraperDefaults.HEADLESS = True
ScraperDefaults.LOAD_TIMEOUT = 20000            # ms
ScraperDefaults.NETWORK_IDLE_TIMEOUT = 3000     # ms
ScraperDefaults.LOAD_STRATEGIES = ["load", "networkidle", "domcontentloaded"]
ScraperDefaults.MAX_RETRIES = 3
ScraperDefaults.BACKOFF_FACTOR = 2.0
ScraperDefaults.MAX_CONCURRENT = 15
ScraperDefaults.CACHE_TTL = 999                 # days
ScraperDefaults.CACHE_DIRECTORY = "data/ghostscraper"
ScraperDefaults.DYNAMODB_TABLE = None
ScraperDefaults.BROWSER_RESTART_EVERY = None
ScraperDefaults.PROXY = None                    # e.g. "socks5://localhost:1080"
ScraperDefaults.LOGGING = True

# Stream settings
ScraperDefaults.MAX_WORKERS = 2                 # concurrent subprocess workers
ScraperDefaults.SUBPROCESS_BATCH_SIZE = 50      # URLs per subprocess
ScraperDefaults.MAX_QUEUE_SIZE = 500            # max pending chunks in queue
ScraperDefaults.DEFAULT_PRIORITY = 5            # default stream priority (0–10)

PlaywrightScraper

Low-level browser automation used internally. Use directly only for raw browser control.

async with PlaywrightScraper(logging=False) as browser:
    html, status, headers, chain = await browser.fetch_url("https://example.com")
    body, status, headers = await browser.fetch_bytes("https://example.com/image.jpg")

Method	Returns	Description
`await fetch()`	`(html, status, headers, chain)`	Fetch `self.url`
`await fetch_url(url)`	`(html, status, headers, chain)`	Fetch specific URL
`await fetch_many(urls, max_concurrent=5)`	`list[tuple]`	Parallel fetch
`await fetch_and_close()`	`(html, status, headers, chain)`	Fetch and close browser
`await fetch_bytes(url)`	`(bytes, status, headers)`	Raw bytes fetch
`await close()`	`None`	Close browser
`await check_and_install_browser()`	`bool`	Check/install browser

Supports async with context manager.

Dependencies

playwright
beautifulsoup4
html2text
newspaper4k
python-slugify
logorator
dynamorator
lxml_html_clean

License

MIT. Contributions welcome: https://github.com/Redundando/ghostscraper

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.11

Apr 13, 2026

0.9.10

Apr 13, 2026

0.9.9

Apr 13, 2026

This version

0.9.8

Apr 13, 2026

0.9.7

Apr 13, 2026

0.9.6

Apr 13, 2026

0.9.5

Mar 16, 2026

0.9.3

Mar 16, 2026

0.9.1

Mar 13, 2026

0.8.0

Mar 12, 2026

0.7.4

Mar 12, 2026

0.7.3

Mar 12, 2026

0.7.2

Mar 12, 2026

0.7.1

Feb 27, 2026

0.6.1

Feb 26, 2026

0.6.0

Feb 26, 2026

0.5.0

Feb 23, 2026

0.4.3

Feb 22, 2026

0.4.2

Feb 22, 2026

0.4.0

Feb 22, 2026

0.3.0

Feb 21, 2026

0.2.1

Feb 20, 2026

0.2.0

Feb 16, 2026

0.1.0

Feb 12, 2026

0.0.3

Mar 21, 2025

0.0.2

Mar 21, 2025

0.0.1

Mar 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostscraper-0.9.8.tar.gz (45.9 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ghostscraper-0.9.8-py3-none-any.whl (29.3 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file ghostscraper-0.9.8.tar.gz.

File metadata

Download URL: ghostscraper-0.9.8.tar.gz
Upload date: Apr 13, 2026
Size: 45.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ghostscraper-0.9.8.tar.gz
Algorithm	Hash digest
SHA256	`042e11cb004655fd15150233f9034ee24958b79aa3c9e7537672407d9b7f6504`
MD5	`fa8293c9ea5642851402366f3a0f199b`
BLAKE2b-256	`af30f1cd182f48004da3ae0a857253b03e422d70162d77a69b21c1b9552e847b`

See more details on using hashes here.

File details

Details for the file ghostscraper-0.9.8-py3-none-any.whl.

File metadata

Download URL: ghostscraper-0.9.8-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 29.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ghostscraper-0.9.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5f24f76e4206db8e2417337b9c7e4f639474e2b9c47d52a0797c2cda5ab82e20`
MD5	`793edd9efd6d1f3b55ae0c108b193f49`
BLAKE2b-256	`1c18fb417bdb076208c863982edcdf9f8eff3112206b01047e6f8ef1ecab2c03`

See more details on using hashes here.

ghostscraper 0.9.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Ghostscraper

Installation

Quick Start

GhostScraper

Constructor

Instance Attributes

Async Output Methods

SEO Dict Structure

Cache Methods

Deprecated Methods

Caching

Batch Scraping

scrape_many Parameters

Partial Failure Handling

Memory-Efficient Batch Processing

ScrapeStream

Creating a Stream

Consuming Results

Monitoring

Cancellation

Shutdown

Fetch Raw Bytes

Loading Strategies

Progress Callbacks

ScraperDefaults

PlaywrightScraper

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes