Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction
Project description
rolling-reader
Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction.
Install
pip install rolling-reader
Python 3.11+. No Node.js required.
Note:
playwright install chromiumis not needed. rolling-reader connects to your existing Chrome browser — it does not download or manage its own browser.
Quick start
Static pages — works immediately after install:
rr https://news.ycombinator.com/
rr https://arxiv.org/abs/1706.03762 --clean # article body only
SPA / login-required pages — requires Chrome running with remote debugging:
# Step 1: start Chrome with remote debugging (do this once per session)
# macOS: open -a "Google Chrome" --args --remote-debugging-port=9222
# Windows: chrome --remote-debugging-port=9222
# Linux: google-chrome --remote-debugging-port=9222
# Step 2: scrape — rolling-reader reuses your existing session and cookies
rr https://app.example.com/dashboard
How it works
| Level | Trigger | Speed |
|---|---|---|
| 1 HTTP | Standard SSR page | ~500 ms |
| 2 CDP | SPA, JS rendering required, or auth-gated | ~3 s |
| 3 JS State | Next.js / Nuxt / Redux / Remix state variable detected | ~1 s (3–4× faster than Level 2 DOM) |
The dispatcher tries each level in order and stops at the first one that returns usable content. Level 3 is attempted inside Level 2 — if a known JS state variable is found, DOM parsing is skipped entirely.
Level 2 and 3 reuse your existing Chrome session, including cookies and local storage. No separate login step or credential storage required.
CLI options
| Flag | Description |
|---|---|
--clean / -c |
Extract article body only (removes nav, ads, footers) |
--output json|md |
Output format (default: json) |
--force-level 1|2|3 |
Skip auto-detection, force a specific level |
--json-path <path> |
Extract a nested field, e.g. title or props.pageProps |
--no-cache |
Bypass profile cache, always re-explore |
--cdp <endpoint> |
Chrome DevTools endpoint (default: http://localhost:9222) |
--verbose / -v |
Print level selection and timing to stderr |
Batch scraping
# Multiple URLs as arguments
rr batch https://example.com https://news.ycombinator.com/
# From a file (one URL per line, # for comments)
rr batch urls.txt
# Pipe-friendly: data goes to stdout, progress to stderr
rr batch urls.txt --clean > results.jsonl
# Control concurrency (default: 3)
rr batch urls.txt --concurrency 10
Why not X
| Tool | Limitation |
|---|---|
| Scrapling | Cannot reuse an existing logged-in Chrome session; no JS state extraction |
| Firecrawl | Cloud API — data leaves your machine, metered pricing |
| Jina Reader | Cloud API — data leaves your machine, metered pricing |
| rolling-reader | Fully local, reuses your Chrome session and cookies, free forever |
Supported JS state variables (v0.2+)
The following window.* variables are probed automatically for Level 3 extraction:
window.__NEXT_DATA__— Next.jswindow.__NUXT__— Nuxt.jswindow.__PRELOADED_STATE__— Redux / customwindow.__INITIAL_STATE__— various frameworkswindow.__REDUX_STATE__— Reduxwindow.__APP_STATE__— variouswindow.__STATE__— genericwindow.__STORE__— MobX / customwindow.APP_STATE— no-underscore variantwindow.initialState— camelCase variantwindow.__remixContext— Remixwindow.__staticRouterHydrationData— React Router v6 SSR
Unknown variables matching window.VAR = {…} are also detected via automatic scan.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rolling_reader-0.6.5.tar.gz.
File metadata
- Download URL: rolling_reader-0.6.5.tar.gz
- Upload date:
- Size: 22.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f59e65ad2b161044e0bb243b2b9abd64893f55783d7a4dfebbc3cc39a2be771
|
|
| MD5 |
6f7c6de1013df5b5fb08a4c6ff2b422d
|
|
| BLAKE2b-256 |
a8f3b0f2b07a91369c2e46b8f3285459fb9e334abf717ea859293352b9b9872c
|
File details
Details for the file rolling_reader-0.6.5-py3-none-any.whl.
File metadata
- Download URL: rolling_reader-0.6.5-py3-none-any.whl
- Upload date:
- Size: 28.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da5f18528928fc9bc0b23282fbde2ea695e75352f57c158379af7afa59aabd04
|
|
| MD5 |
439dcb01304f69018568e61d9f590eb0
|
|
| BLAKE2b-256 |
cf3cfd2f03e14f1605792b626b621879434a6a4661caf0f66414a21ae9c3be46
|