Steatlh and stateful Chrome scraper

Project description

chrome-scraper

Stealth and stateful Chrome scraper, built for AI agents. Shared browser server with profile persistence with pre-built CLI scrapers for Google and x.com that dump content in HTML + clean markdown.

chrome-scraper is a Python package built on Patchright (a maintained Playwright fork) that runs a single long-lived Chrome instance behind a FastAPI HTTP API. Clients can connect concurrently, share cookies/sessions via a persistent profile, and avoid the cold-start/teardown cost of per-task browser launch.

Install

# Core package
uv add chrome-scraper

# Or install from source
uv sync

Supports Python 3.10–3.13.

Quickstart

# 1. Start the shared browser server (keep running)
uv run browser-api [--headless]

# 2. In another terminal — render a URL to markdown
html-to-md https://example.com --output out/example.md

# 3. Search Google and download results
google-fetch --query "machine learning" --max-results 3 --num-pages 1 --out-dir data/research/ml

# 4. Search x.com and download tweets
xcom-fetch --query "ai safety" --max-results 10 --out-dir data/research/xcom-ai

browser-api is a persistent server — start it once, run scrapers against it from multiple terminals or scripts. Chrome keeps cookies, logins, and profile state across sessions.

Architecture

┌──────────────────────────────────────────────────────┐
│                browser-api (port 9333)                │
│  ┌──────────────────────────────────────────────────┐ │
│  │  FastAPI server                                  │ │
│  │  /status  /tabs  /tabs/{id}/goto  /eval  /type   │ │
│  └──────────┬───────────────────────────────────────┘ │
│             │ owns                                    │
│  ┌──────────▼───────────────────────────────────────┐ │
│  │  Chrome (Patchright persistent context)          │ │
│  │  Profile                                         │ │
│  │  Tabs: separate label-keyed sandboxes            │ │
│  └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
         ▲ HTTP                         ▲ HTTP
         │                              │
┌────────┴──────────┐    ┌──────────────┴──────────────┐
│  html-to-md       │    │  google-fetch / xcom-fetch  │
│  short-lived      │    │  short-lived CLIs           │
│  open tab →       │    │  open tab → search →        │
│  extract → close  │    │  fetch each result → close  │
└───────────────────┘    └─────────────────────────────┘

The server patches Patchright's crBrowser.js so new tabs open in background — Chrome never steals focus during concurrent scraping.

CLI reference

browser-api

Start, stop, and check status of the shared browser server.

uv run browser-api                       # start on :9333 (default)
uv run browser-api --port 8080           # custom port
uv run browser-api --headless            # headless mode
uv run browser-api --chrome-path /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
uv run browser-api --profile-dir ~/my-chrome-profile
uv run browser-api --hide                # hide Chrome window (macOS)
uv run browser-api --proxy http://proxy:8080
uv run browser-api --browser-args="--disable-gpu --no-sandbox"
uv run browser-api status                # check if running
uv run browser-api stop                  # shut down

Stateful features:

Persistent profile — cookies, localStorage, and logins survive restarts. Profile lives at ~/Library/Application Support/thebase/playwright/profile/ on macOS.
Headless identity — when --headless is used, the server automatically probes a throwaway headless instance, strips the HeadlessChrome token from the User-Agent, and passes the clean UA to the real persistent context.
Hide macOS window — --hide runs osascript to hide Chrome from sight without quitting (non-headless only).

html-to-md

Render any URL (or local HTML file) as layout-preserving markdown.

# Live URL
html-to-md https://example.com --output out/page.md

# Local HTML file
html-to-md --from-file page.html --output out/page.md

# Print to stdout
html-to-md https://example.com --output -

# Save raw text-node payload alongside markdown
html-to-md https://example.com --save-items

# Skip scroll pass (for already-scrolled SPAs)
html-to-md https://example.com --no-scroll

# Verbose layout diagnostics
html-to-md https://example.com -v

# Custom browser-api URL
html-to-md https://example.com --browser-api http://localhost:9333

Rendering preserves multi-column layout, code blocks, headings, links, inline code, and list structure. Sidebar content is separated by a --- rule. See docs/layout.md for details on the layout-to-markdown algorithm.

google-fetch

Search Google and download each result as HTML + markdown.

# Basic search, 1 page
google-fetch --query "quantum computing"

# Multi-page with output dir
google-fetch --query "machine learning transformers" \
  --num-pages 3 --out-dir data/research/transformers

# Filter by hostname
google-fetch --query "python typing" \
  --allowed-hosts docs.python.org peps.python.org

# Limit total results
google-fetch --query "Rust async" --max-results 5

# Custom browser-api server
google-fetch --query "agents" --browser-api http://localhost:9333

Output layout:

data/research/<query-slug>/<tag>/
├── results.json           # title + URL index
├── 01-introduction-to.html
├── 01-introduction-to.md   # frontmatter + rendered markdown
├── 02-advanced-topics.html
├── 02-advanced-topics.md
└── ...

xcom-fetch

Search x.com and download tweets as HTML + markdown.

# Keyword search
xcom-fetch --query "reinforcement learning"

# Restrict to one account
xcom-fetch --query "safety" --from "Anthropic"

# Limit results
xcom-fetch --query "alignment" --max-results 5

# Custom output dir
xcom-fetch --query "scaling laws" --out-dir data/tweets

Output layout:

data/research/xcom-<query>/
├── results.json           # permalink + author + text snippet
├── 01-anthropic-12345.html
├── 01-anthropic-12345.md   # frontmatter + rendered markdown
└── ...

Python API

from pathlib import Path
from chrome_scraper.html_to_md import extract_from_url, render_page

# Extract text-node payload from a URL
payload = extract_from_url(
    "https://example.com",
    browser_api_url="http://localhost:9333",
    timeout=30.0,
    scroll=True,
)

# Render to layout-preserving markdown
items = payload.get("items", [])
page_width = (payload.get("viewport") or {}).get("scroll_w", 1280)
md = render_page(items, page_width)
Path("out/example.md").write_text(md, encoding="utf-8")

Or manage the browser lifecycle yourself:

from chrome_scraper.browser_api.client import BrowserAPIClient
from chrome_scraper.html_to_md.extract import extract_page

client = BrowserAPIClient(timeout=30.0)
with client.tab("my-tab"):
    payload = extract_page(
        "https://example.com",
        client,
        tab_ref="my-tab",
        timeout=30.0,
        scroll=True,
    )

At a glance

browser-api — shared Chrome behind HTTP:

Persistent profile with cookies/logins.
Label-keyed tabs for concurrent clients.
Tab lifecycle isolated per client (open → use → close).
Background-tab patch so Chrome stays out of the way.
Headless-mode UA cleaning (strips HeadlessChrome).
macOS hide support (--hide).

html-to-md — layout-preserving markdown via Chrome CDP:

Extracts every rendered text node with position and styling.
Detects columns via x-start histogram peaks.
Splits main/sidebar content via widest vertical gutter.
Preserves code blocks, headings, links, inline code, lists.
Row boundaries computed from intersecting column gap sets — long main-column paragraphs stay intact regardless of sidebar density.

google-fetch — multi-page Google scraping:

Paginates through Google result pages.
Visits each result link, dumps outerHTML + rendered markdown.
Navigates back to search results after each fetch.
Optional hostname filtering and result count limits.

xcom-fetch — x.com tweet scraping:

Drives x.com's React search UI via native keyboard (Patchright).
Virtual list scrolling to populate results.
Visits each tweet permalink, dumps HTML + markdown.
SPA-safe navigation; falls back to direct URL navigation if anchor click fails.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chrome_scraper-0.1.0.tar.gz (30.3 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chrome_scraper-0.1.0-py3-none-any.whl (42.6 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file chrome_scraper-0.1.0.tar.gz.

File metadata

Download URL: chrome_scraper-0.1.0.tar.gz
Upload date: May 11, 2026
Size: 30.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for chrome_scraper-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4abf8c5f34e70d37e056e7fab765c02d2a75a3cbe405f2188cbda24856e68c49`
MD5	`8b536efb1d66d70fdd51be8d865b44b9`
BLAKE2b-256	`569573050517bc301039b0f8c99eef47a3827f67458a805c68ac9302a55e5100`

See more details on using hashes here.

File details

Details for the file chrome_scraper-0.1.0-py3-none-any.whl.

File metadata

Download URL: chrome_scraper-0.1.0-py3-none-any.whl
Upload date: May 11, 2026
Size: 42.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for chrome_scraper-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c50d5127d72667cde602e6359266a7158420bc4a12b82d8b73162f123d76ecb9`
MD5	`d0f9d8fd5e9742466c40c4bcef83530f`
BLAKE2b-256	`13626c24862fea1dae198857e7a1276f3dba3090b8343a56e53983b2e3c67e8e`

See more details on using hashes here.

chrome-scraper 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

chrome-scraper

Install

Quickstart

Architecture

CLI reference

browser-api

html-to-md

google-fetch

xcom-fetch

Python API

At a glance

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes