Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction
Project description
rolling-reader
Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction.
Install
pip install rolling-reader
playwright install chromium # required for Level 2 / Level 3
Python 3.11+. No Node.js required.
Quick start
Static page (Level 1, no browser needed):
rr https://news.ycombinator.com/
SPA or login-required page (Level 2, reuses your existing Chrome session):
# 1. Start Chrome with remote debugging enabled (see section below)
# 2. Run the command — Level 2 is selected automatically
rr https://app.example.com/dashboard
Output as Markdown:
rr https://example.com --output md
How it works
| Level | Trigger | Speed |
|---|---|---|
| 1 HTTP | Standard SSR page, no JS rendering needed | ~500 ms |
| 2 CDP | SPA, JS rendering required, or auth-gated | ~3 s |
| 3 JS State | Next.js / Nuxt / Redux / Remix state variable detected | ~1 s (3–4x faster than Level 2 DOM parse) |
The dispatcher probes each level in order and stops at the first one that returns usable content. Level 3 is attempted after Level 2 attaches to the browser — if a known JS state variable is found, DOM parsing is skipped entirely.
Starting Chrome for Level 2 / Level 3
Chrome must be running with remote debugging before invoking Level 2 or Level 3:
# macOS
open -a "Google Chrome" --args --remote-debugging-port=9222
# Windows
chrome --remote-debugging-port=9222
# Linux
google-chrome --remote-debugging-port=9222
The existing Chrome session (including cookies and local storage) is reused — no separate login step required.
CLI options
| Flag | Values | Description |
|---|---|---|
--output |
json, md |
Output format (default: plain text) |
--force-level |
1, 2, 3 |
Skip auto-detection, force a specific level |
--json-path |
dot-notation string | Extract a nested key from JSON output, e.g. title or props.pageProps |
--no-cache |
— | Disable response cache |
--cdp |
— | Force CDP connection (equivalent to --force-level 2) |
--verbose |
— | Print level selection reasoning and timing |
Why not X
| Tool | Limitation |
|---|---|
| Scrapling | Cannot reuse an existing logged-in Chrome session; no JS state extraction |
| Firecrawl | Cloud API — data leaves your machine, metered pricing |
| Jina Reader | Cloud API — data leaves your machine, metered pricing |
| rolling-reader | Fully local, reuses your Chrome session and cookies, free forever |
Supported JS state variables (v0.2)
The following window.* variables are probed automatically for Level 3 extraction:
window.__NEXT_DATA__— Next.js (Vercel ecosystem)window.__NUXT__— Nuxt.jswindow.__PRELOADED_STATE__— Redux / customwindow.__INITIAL_STATE__— various frameworkswindow.__REDUX_STATE__— Redux explicit namingwindow.__APP_STATE__— various frameworkswindow.__STATE__— genericwindow.__STORE__— MobX / customwindow.APP_STATE— no-underscore variantwindow.initialState— camelCase variantwindow.__remixContext— Remixwindow.__staticRouterHydrationData— React Router v6 SSR
Unknown variables matching the pattern window.VAR = {…} are also detected via regex scan.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rolling_reader-0.4.0.tar.gz.
File metadata
- Download URL: rolling_reader-0.4.0.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3475f57fd38643727b7e19f13061d9cb0fd29e798378b0cacad34fadfd573700
|
|
| MD5 |
c46ed80609d89e65af1a37b27dd314cb
|
|
| BLAKE2b-256 |
df86d01cfd5b34317a2172fd56d3d10f553de4a8cddb507715b8082d4ff50fc4
|
File details
Details for the file rolling_reader-0.4.0-py3-none-any.whl.
File metadata
- Download URL: rolling_reader-0.4.0-py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5c892c2c9a8f6cdf396086b195601037697b166fe9b47dc5149df96b9afe8b0
|
|
| MD5 |
228824a88f17360cadc6ec58dbef2dda
|
|
| BLAKE2b-256 |
6f896f23e0c2df914adb981b889ee9fd394f842bd22acc0768f73ab5f5d1bb81
|