Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction

Project description

rolling-reader

Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction.

Install

pip install rolling-reader

Python 3.11+. No Node.js required.

Note: playwright install chromium is not needed. rolling-reader connects to your existing Chrome browser — it does not download or manage its own browser.

Quick start

Static pages — works immediately after install:

rr https://news.ycombinator.com/
rr https://arxiv.org/abs/1706.03762 --clean   # article body only

SPA / login-required pages — requires Chrome running with remote debugging:

# Step 1: start Chrome with remote debugging (do this once per session)
#   macOS:   open -a "Google Chrome" --args --remote-debugging-port=9222
#   Windows: chrome --remote-debugging-port=9222
#   Linux:   google-chrome --remote-debugging-port=9222

# Step 2: scrape — rolling-reader reuses your existing session and cookies
rr https://app.example.com/dashboard

How it works

Level	Trigger	Speed
1 HTTP	Standard SSR page	~500 ms
2 CDP	SPA, JS rendering required, or auth-gated	~3 s
3 JS State	Next.js / Nuxt / Redux / Remix state variable detected	~1 s (3–4× faster than Level 2 DOM)

The dispatcher tries each level in order and stops at the first one that returns usable content. Level 3 is attempted inside Level 2 — if a known JS state variable is found, DOM parsing is skipped entirely.

Level 2 and 3 reuse your existing Chrome session, including cookies and local storage. No separate login step or credential storage required.

CLI options

Flag	Description
`--clean` / `-c`	Extract article body only (removes nav, ads, footers)
`--output json\|md`	Output format (default: json)
`--force-level 1\|2\|3`	Skip auto-detection, force a specific level
`--json-path <path>`	Extract a nested field, e.g. `title` or `props.pageProps`
`--no-cache`	Bypass profile cache, always re-explore
`--cdp <endpoint>`	Chrome DevTools endpoint (default: `http://localhost:9222`)
`--verbose` / `-v`	Print level selection and timing to stderr

Batch scraping

# Multiple URLs as arguments
rr batch https://example.com https://news.ycombinator.com/

# From a file (one URL per line, # for comments)
rr batch urls.txt

# Pipe-friendly: data goes to stdout, progress to stderr
rr batch urls.txt --clean > results.jsonl

# Control concurrency (default: 3)
rr batch urls.txt --concurrency 10

Why not X

Tool	Limitation
Scrapling	Cannot reuse an existing logged-in Chrome session; no JS state extraction
Firecrawl	Cloud API — data leaves your machine, metered pricing
Jina Reader	Cloud API — data leaves your machine, metered pricing
rolling-reader	Fully local, reuses your Chrome session and cookies, free forever

Supported JS state variables (v0.2+)

The following window.* variables are probed automatically for Level 3 extraction:

window.__NEXT_DATA__ — Next.js
window.__NUXT__ — Nuxt.js
window.__PRELOADED_STATE__ — Redux / custom
window.__INITIAL_STATE__ — various frameworks
window.__REDUX_STATE__ — Redux
window.__APP_STATE__ — various
window.__STATE__ — generic
window.__STORE__ — MobX / custom
window.APP_STATE — no-underscore variant
window.initialState — camelCase variant
window.__remixContext — Remix
window.__staticRouterHydrationData — React Router v6 SSR

Unknown variables matching window.VAR = {…} are also detected via automatic scan.

License

MIT

Project details

Release history Release notifications | RSS feed

0.6.6

Apr 16, 2026

0.6.5

Apr 16, 2026

0.6.4

Apr 16, 2026

This version

0.6.3

Apr 16, 2026

0.6.2

Apr 16, 2026

0.6.1

Apr 16, 2026

0.6.0

Apr 16, 2026

0.5.2

Apr 16, 2026

0.5.1

Apr 16, 2026

0.5.0

Apr 16, 2026

0.4.0

Apr 16, 2026

0.3.0

Apr 16, 2026

0.2.0

Apr 16, 2026

0.1.0

Apr 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rolling_reader-0.6.3.tar.gz (20.6 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rolling_reader-0.6.3-py3-none-any.whl (26.9 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file rolling_reader-0.6.3.tar.gz.

File metadata

Download URL: rolling_reader-0.6.3.tar.gz
Upload date: Apr 16, 2026
Size: 20.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rolling_reader-0.6.3.tar.gz
Algorithm	Hash digest
SHA256	`f0e5d2197e1d6ee030cbcf2545bd17cf0417bfbbd3f7c48484b14cf768a24c81`
MD5	`ce592c21662320271d56393c09e357a8`
BLAKE2b-256	`d8113c3dc3c733d8757433383d89686bfefeb7a789d0ee28d736972e7d3412c1`

See more details on using hashes here.

File details

Details for the file rolling_reader-0.6.3-py3-none-any.whl.

File metadata

Download URL: rolling_reader-0.6.3-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 26.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rolling_reader-0.6.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0cb6c0fd283af112c1c1b6a3f1793a775cb828f4f831f4fe74476815ce9aff7b`
MD5	`6336097093177036a623e9d1f2e4bdb2`
BLAKE2b-256	`9213c255fa4e0196bc8a14aefa037c55361f9d336deb7ccf3a8cd43542291f1f`

See more details on using hashes here.

rolling-reader 0.6.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

rolling-reader

Install

Quick start

How it works

CLI options

Batch scraping

Why not X

Supported JS state variables (v0.2+)

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes