Steatlh and stateful Chrome scraper
Project description
chrome-scraper
Stealth and stateful Chrome scraper, built for AI agents. Shared browser server with profile persistence with pre-built CLI scrapers for Google and x.com that dump content in HTML + clean markdown.
chrome-scraper is a Python package built on Patchright (a maintained Playwright fork) that runs a single long-lived Chrome instance behind a FastAPI HTTP API. Clients can connect concurrently, share cookies/sessions via a persistent profile, and avoid the cold-start/teardown cost of per-task browser launch.
Install
# Core package
uv add chrome-scraper
# Or install from source
uv sync
Supports Python 3.10–3.13.
Quickstart
# 1. Start the shared browser server (keep running)
uv run browser-api [--headless]
# 2. In another terminal — render a URL to markdown
html-to-md https://example.com --output out/example.md
# 3. Search Google and download results
google-fetch --query "machine learning" --max-results 3 --num-pages 1 --out-dir data/research/ml
# 4. Search x.com and download tweets
xcom-fetch --query "ai safety" --max-results 10 --out-dir data/research/xcom-ai
browser-api is a persistent server — start it once, run scrapers against it from multiple terminals or scripts. Chrome keeps cookies, logins, and profile state across sessions.
Architecture
┌──────────────────────────────────────────────────────┐
│ browser-api (port 9333) │
│ ┌──────────────────────────────────────────────────┐ │
│ │ FastAPI server │ │
│ │ /status /tabs /tabs/{id}/goto /eval /type │ │
│ └──────────┬───────────────────────────────────────┘ │
│ │ owns │
│ ┌──────────▼───────────────────────────────────────┐ │
│ │ Chrome (Patchright persistent context) │ │
│ │ Profile │ │
│ │ Tabs: separate label-keyed sandboxes │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
▲ HTTP ▲ HTTP
│ │
┌────────┴──────────┐ ┌──────────────┴──────────────┐
│ html-to-md │ │ google-fetch / xcom-fetch │
│ short-lived │ │ short-lived CLIs │
│ open tab → │ │ open tab → search → │
│ extract → close │ │ fetch each result → close │
└───────────────────┘ └─────────────────────────────┘
The server patches Patchright's crBrowser.js so new tabs open in background — Chrome never steals focus during concurrent scraping.
CLI reference
browser-api
Start, stop, and check status of the shared browser server.
uv run browser-api # start on :9333 (default)
uv run browser-api --port 8080 # custom port
uv run browser-api --headless # headless mode
uv run browser-api --chrome-path /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
uv run browser-api --profile-dir ~/my-chrome-profile
uv run browser-api --hide # hide Chrome window (macOS)
uv run browser-api --proxy http://proxy:8080
uv run browser-api --browser-args="--disable-gpu --no-sandbox"
uv run browser-api status # check if running
uv run browser-api stop # shut down
Stateful features:
- Persistent profile — cookies, localStorage, and logins survive restarts. Profile lives at
~/Library/Application Support/thebase/playwright/profile/on macOS. - Headless identity — when
--headlessis used, the server automatically probes a throwaway headless instance, strips theHeadlessChrometoken from the User-Agent, and passes the clean UA to the real persistent context. - Hide macOS window —
--hiderunsosascriptto hide Chrome from sight without quitting (non-headless only).
html-to-md
Render any URL (or local HTML file) as layout-preserving markdown.
# Live URL
html-to-md https://example.com --output out/page.md
# Local HTML file
html-to-md --from-file page.html --output out/page.md
# Print to stdout
html-to-md https://example.com --output -
# Save raw text-node payload alongside markdown
html-to-md https://example.com --save-items
# Skip scroll pass (for already-scrolled SPAs)
html-to-md https://example.com --no-scroll
# Verbose layout diagnostics
html-to-md https://example.com -v
# Custom browser-api URL
html-to-md https://example.com --browser-api http://localhost:9333
Rendering preserves multi-column layout, code blocks, headings, links, inline code, and list structure. Sidebar content is separated by a --- rule. See docs/layout.md for details on the layout-to-markdown algorithm.
google-fetch
Search Google and download each result as HTML + markdown.
# Basic search, 1 page
google-fetch --query "quantum computing"
# Multi-page with output dir
google-fetch --query "machine learning transformers" \
--num-pages 3 --out-dir data/research/transformers
# Filter by hostname
google-fetch --query "python typing" \
--allowed-hosts docs.python.org peps.python.org
# Limit total results
google-fetch --query "Rust async" --max-results 5
# Custom browser-api server
google-fetch --query "agents" --browser-api http://localhost:9333
Output layout:
data/research/<query-slug>/<tag>/
├── results.json # title + URL index
├── 01-introduction-to.html
├── 01-introduction-to.md # frontmatter + rendered markdown
├── 02-advanced-topics.html
├── 02-advanced-topics.md
└── ...
xcom-fetch
Search x.com and download tweets as HTML + markdown.
# Keyword search
xcom-fetch --query "reinforcement learning"
# Restrict to one account
xcom-fetch --query "safety" --from "Anthropic"
# Limit results
xcom-fetch --query "alignment" --max-results 5
# Custom output dir
xcom-fetch --query "scaling laws" --out-dir data/tweets
Output layout:
data/research/xcom-<query>/
├── results.json # permalink + author + text snippet
├── 01-anthropic-12345.html
├── 01-anthropic-12345.md # frontmatter + rendered markdown
└── ...
Python API
from pathlib import Path
from chrome_scraper.html_to_md import extract_from_url, render_page
# Extract text-node payload from a URL
payload = extract_from_url(
"https://example.com",
browser_api_url="http://localhost:9333",
timeout=30.0,
scroll=True,
)
# Render to layout-preserving markdown
items = payload.get("items", [])
page_width = (payload.get("viewport") or {}).get("scroll_w", 1280)
md = render_page(items, page_width)
Path("out/example.md").write_text(md, encoding="utf-8")
Or manage the browser lifecycle yourself:
from chrome_scraper.browser_api.client import BrowserAPIClient
from chrome_scraper.html_to_md.extract import extract_page
client = BrowserAPIClient(timeout=30.0)
with client.tab("my-tab"):
payload = extract_page(
"https://example.com",
client,
tab_ref="my-tab",
timeout=30.0,
scroll=True,
)
At a glance
browser-api — shared Chrome behind HTTP:
- Persistent profile with cookies/logins.
- Label-keyed tabs for concurrent clients.
- Tab lifecycle isolated per client (open → use → close).
- Background-tab patch so Chrome stays out of the way.
- Headless-mode UA cleaning (strips
HeadlessChrome). - macOS hide support (
--hide).
html-to-md — layout-preserving markdown via Chrome CDP:
- Extracts every rendered text node with position and styling.
- Detects columns via x-start histogram peaks.
- Splits main/sidebar content via widest vertical gutter.
- Preserves code blocks, headings, links, inline code, lists.
- Row boundaries computed from intersecting column gap sets — long main-column paragraphs stay intact regardless of sidebar density.
google-fetch — multi-page Google scraping:
- Paginates through Google result pages.
- Visits each result link, dumps outerHTML + rendered markdown.
- Navigates back to search results after each fetch.
- Optional hostname filtering and result count limits.
xcom-fetch — x.com tweet scraping:
- Drives x.com's React search UI via native keyboard (Patchright).
- Virtual list scrolling to populate results.
- Visits each tweet permalink, dumps HTML + markdown.
- SPA-safe navigation; falls back to direct URL navigation if anchor click fails.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chrome_scraper-0.1.0.tar.gz.
File metadata
- Download URL: chrome_scraper-0.1.0.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4abf8c5f34e70d37e056e7fab765c02d2a75a3cbe405f2188cbda24856e68c49
|
|
| MD5 |
8b536efb1d66d70fdd51be8d865b44b9
|
|
| BLAKE2b-256 |
569573050517bc301039b0f8c99eef47a3827f67458a805c68ac9302a55e5100
|
File details
Details for the file chrome_scraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chrome_scraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 42.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c50d5127d72667cde602e6359266a7158420bc4a12b82d8b73162f123d76ecb9
|
|
| MD5 |
d0f9d8fd5e9742466c40c4bcef83530f
|
|
| BLAKE2b-256 |
13626c24862fea1dae198857e7a1276f3dba3090b8343a56e53983b2e3c67e8e
|