Skip to main content

Steatlh and stateful Chrome scraper

Project description

chrome-scraper

Stealth and stateful Chrome scraper, built for AI agents. Shared browser server with profile persistence with pre-built CLI scrapers for Google and x.com that dump content in HTML + clean markdown.

chrome-scraper is a Python package built on Patchright (a maintained Playwright fork) that runs a single long-lived Chrome instance behind a FastAPI HTTP API. Clients can connect concurrently, share cookies/sessions via a persistent profile, and avoid the cold-start/teardown cost of per-task browser launch.

Install

# Core package
uv add chrome-scraper

# Or install from source
uv sync

Supports Python 3.10–3.13.

Quickstart

# 1. Start the shared browser server (keep running)
uv run browser-api [--headless]

# 2. In another terminal — render a URL to markdown
html-to-md https://example.com --output out/example.md

# 3. Search Google and download results
google-fetch --query "machine learning" --max-results 3 --num-pages 1 --out-dir data/research/ml

# 4. Search x.com and download tweets
xcom-fetch --query "ai safety" --max-results 10 --out-dir data/research/xcom-ai

browser-api is a persistent server — start it once, run scrapers against it from multiple terminals or scripts. Chrome keeps cookies, logins, and profile state across sessions.

Architecture

┌──────────────────────────────────────────────────────┐
│                browser-api (port 9333)                │
│  ┌──────────────────────────────────────────────────┐ │
│  │  FastAPI server                                  │ │
│  │  /status  /tabs  /tabs/{id}/goto  /eval  /type   │ │
│  └──────────┬───────────────────────────────────────┘ │
│             │ owns                                    │
│  ┌──────────▼───────────────────────────────────────┐ │
│  │  Chrome (Patchright persistent context)          │ │
│  │  Profile                                         │ │
│  │  Tabs: separate label-keyed sandboxes            │ │
│  └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
         ▲ HTTP                         ▲ HTTP
         │                              │
┌────────┴──────────┐    ┌──────────────┴──────────────┐
│  html-to-md       │    │  google-fetch / xcom-fetch  │
│  short-lived      │    │  short-lived CLIs           │
│  open tab →       │    │  open tab → search →        │
│  extract → close  │    │  fetch each result → close  │
└───────────────────┘    └─────────────────────────────┘

The server patches Patchright's crBrowser.js so new tabs open in background — Chrome never steals focus during concurrent scraping.

CLI reference

browser-api

Start, stop, and check status of the shared browser server.

uv run browser-api                       # start on :9333 (default)
uv run browser-api --port 8080           # custom port
uv run browser-api --headless            # headless mode
uv run browser-api --chrome-path /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
uv run browser-api --profile-dir ~/my-chrome-profile
uv run browser-api --hide                # hide Chrome window (macOS)
uv run browser-api --proxy http://proxy:8080
uv run browser-api --browser-args="--disable-gpu --no-sandbox"
uv run browser-api status                # check if running
uv run browser-api stop                  # shut down

Stateful features:

  • Persistent profile — cookies, localStorage, and logins survive restarts. Profile lives at ~/Library/Application Support/thebase/playwright/profile/ on macOS.
  • Headless identity — when --headless is used, the server automatically probes a throwaway headless instance, strips the HeadlessChrome token from the User-Agent, and passes the clean UA to the real persistent context.
  • Hide macOS window--hide runs osascript to hide Chrome from sight without quitting (non-headless only).

html-to-md

Render any URL (or local HTML file) as layout-preserving markdown.

# Live URL
html-to-md https://example.com --output out/page.md

# Local HTML file
html-to-md --from-file page.html --output out/page.md

# Print to stdout
html-to-md https://example.com --output -

# Save raw text-node payload alongside markdown
html-to-md https://example.com --save-items

# Skip scroll pass (for already-scrolled SPAs)
html-to-md https://example.com --no-scroll

# Verbose layout diagnostics
html-to-md https://example.com -v

# Custom browser-api URL
html-to-md https://example.com --browser-api http://localhost:9333

Rendering preserves multi-column layout, code blocks, headings, links, inline code, and list structure. Sidebar content is separated by a --- rule. See docs/layout.md for details on the layout-to-markdown algorithm.

google-fetch

Search Google and download each result as HTML + markdown.

# Basic search, 1 page
google-fetch --query "quantum computing"

# Multi-page with output dir
google-fetch --query "machine learning transformers" \
  --num-pages 3 --out-dir data/research/transformers

# Filter by hostname
google-fetch --query "python typing" \
  --allowed-hosts docs.python.org peps.python.org

# Limit total results
google-fetch --query "Rust async" --max-results 5

# Custom browser-api server
google-fetch --query "agents" --browser-api http://localhost:9333

Output layout:

data/research/<query-slug>/<tag>/
├── results.json           # title + URL index
├── 01-introduction-to.html
├── 01-introduction-to.md   # frontmatter + rendered markdown
├── 02-advanced-topics.html
├── 02-advanced-topics.md
└── ...

xcom-fetch

Search x.com and download tweets as HTML + markdown.

# Keyword search
xcom-fetch --query "reinforcement learning"

# Restrict to one account
xcom-fetch --query "safety" --from "Anthropic"

# Limit results
xcom-fetch --query "alignment" --max-results 5

# Custom output dir
xcom-fetch --query "scaling laws" --out-dir data/tweets

Output layout:

data/research/xcom-<query>/
├── results.json           # permalink + author + text snippet
├── 01-anthropic-12345.html
├── 01-anthropic-12345.md   # frontmatter + rendered markdown
└── ...

Python API

from pathlib import Path
from chrome_scraper.html_to_md import extract_from_url, render_page

# Extract text-node payload from a URL
payload = extract_from_url(
    "https://example.com",
    browser_api_url="http://localhost:9333",
    timeout=30.0,
    scroll=True,
)

# Render to layout-preserving markdown
items = payload.get("items", [])
page_width = (payload.get("viewport") or {}).get("scroll_w", 1280)
md = render_page(items, page_width)
Path("out/example.md").write_text(md, encoding="utf-8")

Or manage the browser lifecycle yourself:

from chrome_scraper.browser_api.client import BrowserAPIClient
from chrome_scraper.html_to_md.extract import extract_page

client = BrowserAPIClient(timeout=30.0)
with client.tab("my-tab"):
    payload = extract_page(
        "https://example.com",
        client,
        tab_ref="my-tab",
        timeout=30.0,
        scroll=True,
    )

At a glance

browser-api — shared Chrome behind HTTP:

  • Persistent profile with cookies/logins.
  • Label-keyed tabs for concurrent clients.
  • Tab lifecycle isolated per client (open → use → close).
  • Background-tab patch so Chrome stays out of the way.
  • Headless-mode UA cleaning (strips HeadlessChrome).
  • macOS hide support (--hide).

html-to-md — layout-preserving markdown via Chrome CDP:

  • Extracts every rendered text node with position and styling.
  • Detects columns via x-start histogram peaks.
  • Splits main/sidebar content via widest vertical gutter.
  • Preserves code blocks, headings, links, inline code, lists.
  • Row boundaries computed from intersecting column gap sets — long main-column paragraphs stay intact regardless of sidebar density.

google-fetch — multi-page Google scraping:

  • Paginates through Google result pages.
  • Visits each result link, dumps outerHTML + rendered markdown.
  • Navigates back to search results after each fetch.
  • Optional hostname filtering and result count limits.

xcom-fetch — x.com tweet scraping:

  • Drives x.com's React search UI via native keyboard (Patchright).
  • Virtual list scrolling to populate results.
  • Visits each tweet permalink, dumps HTML + markdown.
  • SPA-safe navigation; falls back to direct URL navigation if anchor click fails.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chrome_scraper-0.1.0.tar.gz (30.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chrome_scraper-0.1.0-py3-none-any.whl (42.6 kB view details)

Uploaded Python 3

File details

Details for the file chrome_scraper-0.1.0.tar.gz.

File metadata

  • Download URL: chrome_scraper-0.1.0.tar.gz
  • Upload date:
  • Size: 30.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for chrome_scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4abf8c5f34e70d37e056e7fab765c02d2a75a3cbe405f2188cbda24856e68c49
MD5 8b536efb1d66d70fdd51be8d865b44b9
BLAKE2b-256 569573050517bc301039b0f8c99eef47a3827f67458a805c68ac9302a55e5100

See more details on using hashes here.

File details

Details for the file chrome_scraper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chrome_scraper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 42.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for chrome_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c50d5127d72667cde602e6359266a7158420bc4a12b82d8b73162f123d76ecb9
MD5 d0f9d8fd5e9742466c40c4bcef83530f
BLAKE2b-256 13626c24862fea1dae198857e7a1276f3dba3090b8343a56e53983b2e3c67e8e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page