A Playwright-based web scraper with persistent caching, parallel scraping, progress callbacks, and multiple output formats

These details have not been verified by PyPI

Project links

Project description

Ghostscraper

A Playwright-based web scraper with persistent caching, automatic browser installation, and multiple output formats.

Changelog

See CHANGELOG.md for the full version history.

Overview

GhostScraper wraps Playwright with persistent JSON caching (via cacherator.JSONCache), retry logic, and multiple output formats. The primary interface is GhostScraper. PlaywrightScraper is the lower-level browser automation layer used internally.

All scraping methods are async. Results (html, response_code, response_headers, redirect_chain) are cached to disk on first fetch and restored on subsequent instantiation with the same URL. Derived outputs (markdown, text, authors, soup, seo, article) are computed in-memory from the cached HTML and are not persisted.

Installation

pip install ghostscraper

GhostScraper

Constructor

GhostScraper(
    url: str = "",
    clear_cache: bool = False,
    ttl: int = 999,
    markdown_options: Optional[Dict[str, Any]] = None,
    logging: bool = True,
    dynamodb_table: Optional[str] = None,
    on_progress: Optional[Callable] = None,
    **kwargs  # passed to PlaywrightScraper
)

Parameters:

url: The URL to scrape.
clear_cache: Clear existing cache on initialization. Default: False.
ttl: Cache time-to-live in days. Default: 999.
markdown_options: Options forwarded to html2text.HTML2Text for Markdown conversion.
logging: Enable/disable logging. Default: True.
dynamodb_table: DynamoDB table name for cross-machine caching. Replaces local cache when set. Default: None.
on_progress: Callback fired at key scraping events. Accepts sync and async callables. Default: None.
**kwargs: Forwarded to PlaywrightScraper (see below).

PlaywrightScraper kwargs:

browser_type: "chromium" | "firefox" | "webkit". Default: "chromium".
headless: Run browser headlessly. Default: True.
browser_args: Extra args passed to browser.launch().
context_args: Extra args passed to browser.new_context().
max_retries: Retry attempts per URL. Default: 3.
backoff_factor: Exponential backoff multiplier. Default: 2.0.
network_idle_timeout: Timeout in ms for the networkidle strategy. Default: 3000.
load_timeout: Timeout in ms for all other strategies. Default: 20000.
wait_for_selectors: CSS selectors to wait for after page load.
load_strategies: Loading strategy chain. Default: ["load", "networkidle", "domcontentloaded"].
no_retry_on: Status codes that abort retries immediately (e.g. [404, 410]). Default: None.

Instance Attributes

url (str): The URL this scraper was initialized with.
error (Exception | None): Set when a fetch fails under fail_fast=False in scrape_many. When set, html() returns "" and response_code() returns None.

Methods

All methods trigger a fetch (or cache restore) on first call. Subsequent calls return the cached/computed value.

`async html() -> str`

Raw HTML of the page. Returns "" if self.error is set.

`async response_code() -> Optional[int]`

HTTP status code. Returns None if self.error is set.

`async response_headers() -> dict`

HTTP response headers from the final response. Cached alongside HTML.

`async redirect_chain() -> list`

Full list of responses during navigation. Each entry: {"url": str, "status": int}. Cached alongside HTML.

[
    {"url": "https://example.com/old", "status": 301},
    {"url": "https://example.com/new", "status": 200},
]

`async final_url() -> str`

URL of the last entry in redirect_chain(). Falls back to self.url if chain is empty.

`async markdown() -> str`

Page content converted to Markdown via html2text. Respects markdown_options.

`async text() -> str`

Plain text extracted via newspaper4k.

`async authors() -> list`

Authors detected by newspaper4k.

`async article() -> newspaper.Article`

Full newspaper.Article object with parsed content.

`async soup() -> BeautifulSoup`

BeautifulSoup object parsed from the HTML.

`async seo() -> dict`

SEO metadata parsed from the HTML. All keys are omitted if the corresponding tag is absent.

{
    "title": str,           # <title>
    "description": str,     # <meta name="description">
    "canonical": str,       # <link rel="canonical">
    "robots": {             # <meta name="robots"> — directives as keys
        "noindex": True,
        "nofollow": True,
    },
    "googlebot": { ... },   # <meta name="googlebot">, same shape as robots
    "og": {                 # <meta property="og:*"> keyed by suffix
        "title": str,
        "description": str,
        "image": str,
        "url": str,
    },
    "twitter": { ... },     # <meta name="twitter:*">, same pattern
    "hreflang": {           # <link rel="alternate" hreflang="..."> — values are lists
        "en-us": ["https://..."],
        "de": ["https://..."],
    }
}

`@classmethod async fetch_bytes(url, cache=False, clear_cache=False, ttl=999, dynamodb_table=None, logging=True, **kwargs) -> Tuple[bytes, int, dict]`

Fetch a URL as raw bytes using the Playwright browser context (inherits UA, cookies, and headers). Useful for CDN-protected resources that block plain HTTP clients.

Parameters:

url: URL to fetch.
cache: Persist result to disk/DynamoDB. Default: False.
clear_cache: Force re-fetch even if cached. Default: False.
ttl: Cache TTL in days. Default: 999.
dynamodb_table: DynamoDB table for cross-machine caching. Default: None.
logging: Enable logging. Default: True.
**kwargs: Forwarded to PlaywrightScraper (max_retries, browser_type, no_retry_on, etc.).

Returns: Tuple[bytes, int, dict] — (body, status_code, headers)

`@classmethod async scrape_many(urls, max_concurrent=15, logging=True, fail_fast=True, on_scraped=None, **kwargs) -> List[GhostScraper]`

Scrape multiple URLs in parallel using a single shared browser instance.

Parameters:

urls: List of URLs to scrape.
max_concurrent: Max concurrent page loads. Default: 15.
logging: Enable logging. Default: True.
fail_fast: If True, any unhandled exception aborts the entire batch. If False, failed scrapers have scraper.error set, html() returns "", response_code() returns None. Default: True.
on_scraped: Async or sync callback invoked immediately after each URL is fetched and cached — before the batch returns. Receives the fully-populated GhostScraper instance. Fires for cached URLs too. Default: None.
on_progress: Progress callback (sync or async). Default: None.
**kwargs: Forwarded to GhostScraper and PlaywrightScraper.

Returns: List[GhostScraper] in the same order as urls. Already-cached URLs are skipped.

Caching

Cached fields: _html, _response_code, _response_headers, _redirect_chain.
Cache key: slugified URL.
Cache location: data/ghostscraper/ (configurable via ScraperDefaults.CACHE_DIRECTORY).
DynamoDB cache replaces local cache when dynamodb_table is set. Requires AWS credentials.
clear_cache=True forces a fresh fetch and overwrites the cache.

Loading Strategies

GhostScraper tries Playwright loading strategies in order, falling back on timeout:

load — waits for the load event. Works for most sites.
networkidle — waits until no network activity for 500ms. Better for JS-heavy pages. Uses network_idle_timeout.
domcontentloaded — waits only for HTML parsing. Fastest, least complete.

If all strategies fail, the attempt is retried up to max_retries times with exponential backoff (backoff_factor ** attempt seconds).

Override the chain via load_strategies:

scrapers = await GhostScraper.scrape_many(urls=urls, load_strategies=["domcontentloaded"])

# Or globally:
from ghostscraper import ScraperDefaults
ScraperDefaults.LOAD_STRATEGIES = ["domcontentloaded"]

Progress Callbacks

Pass on_progress to receive real-time events. Accepts sync and async callables. Errors raised inside the callback are logged (when logging=True) and swallowed — they will not abort a scrape.

scraper = GhostScraper(url="https://example.com", on_progress=lambda e: print(e["event"]))

Each event is a dict with event (str) and ts (Unix timestamp). Additional fields:

event	extra fields	notes
`started`	`url`	fired before fetch begins
`browser_installing`	`browser`	first-run only; sync callback only
`browser_ready`	`browser`	browser check passed
`loading_strategy`	`url`, `strategy`, `attempt`, `max_retries`, `timeout`	fired per strategy attempt
`retry`	`url`, `attempt`, `max_retries` + optional `reason`, `status_code`	only fires when another attempt follows
`page_loaded`	`url`, `completed`, `total`, `status_code`, `scraper`	fires on success or error status; `scraper` only in `scrape_many`; fires for cached URLs too
`error`	`url`, `message`	unhandled exception during fetch
`batch_started`	`total`, `to_fetch`, `cached`	`scrape_many` only
`batch_done`	`total`	`scrape_many` only

ScraperDefaults

Global defaults, modifiable at runtime before instantiation:

from ghostscraper import ScraperDefaults

ScraperDefaults.BROWSER_TYPE = "chromium"       # default browser
ScraperDefaults.HEADLESS = True
ScraperDefaults.LOAD_TIMEOUT = 20000            # ms
ScraperDefaults.NETWORK_IDLE_TIMEOUT = 3000     # ms
ScraperDefaults.LOAD_STRATEGIES = ["load", "networkidle", "domcontentloaded"]
ScraperDefaults.MAX_RETRIES = 3
ScraperDefaults.BACKOFF_FACTOR = 2.0
ScraperDefaults.MAX_CONCURRENT = 15
ScraperDefaults.CACHE_TTL = 999                 # days
ScraperDefaults.CACHE_DIRECTORY = "data/ghostscraper"
ScraperDefaults.DYNAMODB_TABLE = None
ScraperDefaults.LOGGING = True

PlaywrightScraper

Low-level browser automation used internally by GhostScraper. Use directly only if you need raw browser control.

Constructor

PlaywrightScraper(
    url: str = "",
    browser_type: Literal["chromium", "firefox", "webkit"] = "chromium",
    headless: bool = True,
    browser_args: Optional[Dict[str, Any]] = None,
    context_args: Optional[Dict[str, Any]] = None,
    max_retries: int = 3,
    backoff_factor: float = 2.0,
    network_idle_timeout: int = 3000,
    load_timeout: int = 20000,
    wait_for_selectors: Optional[List[str]] = None,
    logging: bool = True,
    on_progress: Optional[Callable] = None,
    load_strategies: Optional[List[str]] = None,
    no_retry_on: Optional[List[int]] = None
)

Methods

`async fetch() -> Tuple[str, int, dict, list]`

Fetch self.url. Returns (html, status_code, headers, redirect_chain).

`async fetch_url(url: str) -> Tuple[str, int, dict, list]`

Fetch a specific URL using the shared browser instance.

`async fetch_many(urls: List[str], max_concurrent: int = 5) -> List[Tuple[str, int, dict, list]]`

Fetch multiple URLs in parallel.

`async fetch_and_close() -> Tuple[str, int, dict, list]`

Fetch and immediately close the browser.

`async close() -> None`

Close browser and release Playwright resources.

`async check_and_install_browser() -> bool`

Check if the configured browser is installed; install it if not. Result is cached per process.

Supports async context manager (async with PlaywrightScraper(...) as browser).

Browser Installation Utilities

from ghostscraper import check_browser_installed, install_browser

installed = await check_browser_installed("chromium")  # bool
install_browser("chromium")                            # sync, runs playwright install

Usage Examples

Single URL

import asyncio
from ghostscraper import GhostScraper

async def main():
    scraper = GhostScraper(url="https://example.com")
    html = await scraper.html()
    text = await scraper.text()
    markdown = await scraper.markdown()
    code = await scraper.response_code()
    headers = await scraper.response_headers()
    seo = await scraper.seo()

asyncio.run(main())

Batch Scraping

scrapers = await GhostScraper.scrape_many(
    urls=["https://example.com", "https://python.org"],
    max_concurrent=5,
    ttl=7,
    load_strategies=["domcontentloaded"],
)
for scraper in scrapers:
    print(await scraper.text())

Partial Failure Handling

scrapers = await GhostScraper.scrape_many(urls=urls, fail_fast=False)
for s in scrapers:
    if s.error:
        print(f"FAILED {s.url}: {s.error}")
    else:
        print(f"OK {s.url}: {await s.response_code()}")

Redirect Chain

scraper = GhostScraper(url="https://example.com/redirect")
print(await scraper.final_url())
for hop in await scraper.redirect_chain():
    print(hop["status"], hop["url"])

Fetch Raw Bytes (CDN-protected resources)

body, status_code, headers = await GhostScraper.fetch_bytes(
    "https://example.com/image.jpg",
    cache=True,
)

Skip Retries on Terminal Status Codes

scraper = GhostScraper(url="https://example.com/missing", no_retry_on=[404, 410, 403])
print(await scraper.response_code())

DynamoDB Cache

scraper = GhostScraper(url="https://example.com", dynamodb_table="my-cache-table")
scrapers = await GhostScraper.scrape_many(urls=urls, dynamodb_table="my-cache-table")

Custom Browser Context

scraper = GhostScraper(
    url="https://example.com",
    context_args={"viewport": {"width": 1920, "height": 1080}, "user_agent": "..."},
    wait_for_selectors=["#content", ".product-list"],
)

Memory-Efficient Batch Processing

For large batches, use on_scraped to process and discard each result as it arrives rather than holding all HTML in memory simultaneously:

results = []

async def handle(scraper: GhostScraper) -> None:
    results.append(await scraper.text())
    scraper._html = None  # release — already persisted to cache

await GhostScraper.scrape_many(
    urls=urls,
    max_concurrent=10,
    on_scraped=handle,
)

Dependencies

playwright
beautifulsoup4
html2text
newspaper4k
python-slugify
logorator
cacherator
lxml_html_clean

License

MIT. Contributions welcome: https://github.com/Redundando/ghostscraper

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.11

Apr 13, 2026

0.9.10

Apr 13, 2026

0.9.9

Apr 13, 2026

0.9.8

Apr 13, 2026

0.9.7

Apr 13, 2026

0.9.6

Apr 13, 2026

0.9.5

Mar 16, 2026

0.9.3

Mar 16, 2026

0.9.1

Mar 13, 2026

0.8.0

Mar 12, 2026

This version

0.7.4

Mar 12, 2026

0.7.3

Mar 12, 2026

0.7.2

Mar 12, 2026

0.7.1

Feb 27, 2026

0.6.1

Feb 26, 2026

0.6.0

Feb 26, 2026

0.5.0

Feb 23, 2026

0.4.3

Feb 22, 2026

0.4.2

Feb 22, 2026

0.4.0

Feb 22, 2026

0.3.0

Feb 21, 2026

0.2.1

Feb 20, 2026

0.2.0

Feb 16, 2026

0.1.0

Feb 12, 2026

0.0.3

Mar 21, 2025

0.0.2

Mar 21, 2025

0.0.1

Mar 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostscraper-0.7.4.tar.gz (22.4 kB view details)

Uploaded Mar 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ghostscraper-0.7.4-py3-none-any.whl (19.4 kB view details)

Uploaded Mar 12, 2026 Python 3

File details

Details for the file ghostscraper-0.7.4.tar.gz.

File metadata

Download URL: ghostscraper-0.7.4.tar.gz
Upload date: Mar 12, 2026
Size: 22.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ghostscraper-0.7.4.tar.gz
Algorithm	Hash digest
SHA256	`899ba943c5220935464a41a067c8ec72fc259c2a972587c2e596ffecb9b704d2`
MD5	`ad9a5765a0af10ca1eb16288ae5521e8`
BLAKE2b-256	`e466214524b4dda9eb1dfe094b534d19e660f268590a660b46de0d1f7f05c2e8`

See more details on using hashes here.

File details

Details for the file ghostscraper-0.7.4-py3-none-any.whl.

File metadata

Download URL: ghostscraper-0.7.4-py3-none-any.whl
Upload date: Mar 12, 2026
Size: 19.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ghostscraper-0.7.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a37bfce65129cc95973a4d5042c7e64dbe5cd74582a3a774d6ed341df4c893a2`
MD5	`fb075d9007df8f782ab05519035ecd0e`
BLAKE2b-256	`adeb27d07810731134c483a3b36df5120aef615190569a621b6010fb5c0e3d2d`

See more details on using hashes here.

ghostscraper 0.7.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Ghostscraper

Changelog

Overview

Installation

GhostScraper

Constructor

Instance Attributes

Methods

async html() -> str

async response_code() -> Optional[int]

async response_headers() -> dict

async redirect_chain() -> list

async final_url() -> str

async markdown() -> str

async text() -> str

async authors() -> list

async article() -> newspaper.Article

async soup() -> BeautifulSoup

async seo() -> dict

@classmethod async fetch_bytes(url, cache=False, clear_cache=False, ttl=999, dynamodb_table=None, logging=True, **kwargs) -> Tuple[bytes, int, dict]

@classmethod async scrape_many(urls, max_concurrent=15, logging=True, fail_fast=True, on_scraped=None, **kwargs) -> List[GhostScraper]

Caching

Loading Strategies

Progress Callbacks

ScraperDefaults

PlaywrightScraper

Constructor

Methods

async fetch() -> Tuple[str, int, dict, list]

async fetch_url(url: str) -> Tuple[str, int, dict, list]

async fetch_many(urls: List[str], max_concurrent: int = 5) -> List[Tuple[str, int, dict, list]]

async fetch_and_close() -> Tuple[str, int, dict, list]

async close() -> None

async check_and_install_browser() -> bool

Browser Installation Utilities

Usage Examples

Single URL

Batch Scraping

Partial Failure Handling

Redirect Chain

Fetch Raw Bytes (CDN-protected resources)

Skip Retries on Terminal Status Codes

DynamoDB Cache

Custom Browser Context

Memory-Efficient Batch Processing

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`async html() -> str`

`async response_code() -> Optional[int]`

`async response_headers() -> dict`

`async redirect_chain() -> list`

`async final_url() -> str`

`async markdown() -> str`

`async text() -> str`

`async authors() -> list`

`async article() -> newspaper.Article`

`async soup() -> BeautifulSoup`

`async seo() -> dict`

`@classmethod async fetch_bytes(url, cache=False, clear_cache=False, ttl=999, dynamodb_table=None, logging=True, **kwargs) -> Tuple[bytes, int, dict]`

`@classmethod async scrape_many(urls, max_concurrent=15, logging=True, fail_fast=True, on_scraped=None, **kwargs) -> List[GhostScraper]`

`async fetch() -> Tuple[str, int, dict, list]`

`async fetch_url(url: str) -> Tuple[str, int, dict, list]`

`async fetch_many(urls: List[str], max_concurrent: int = 5) -> List[Tuple[str, int, dict, list]]`

`async fetch_and_close() -> Tuple[str, int, dict, list]`

`async close() -> None`

`async check_and_install_browser() -> bool`