Skip to main content

Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction

Project description

rolling-reader

Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction.

Install

pip install rolling-reader
playwright install chromium  # required for Level 2 / Level 3

Python 3.11+. No Node.js required.

Quick start

Static page (Level 1, no browser needed):

rr https://news.ycombinator.com/

SPA or login-required page (Level 2, reuses your existing Chrome session):

# 1. Start Chrome with remote debugging enabled (see section below)
# 2. Run the command — Level 2 is selected automatically
rr https://app.example.com/dashboard

Output as Markdown:

rr https://example.com --output md

How it works

Level Trigger Speed
1 HTTP Standard SSR page, no JS rendering needed ~500 ms
2 CDP SPA, JS rendering required, or auth-gated ~3 s
3 JS State Next.js / Nuxt / Redux / Remix state variable detected ~1 s (3–4x faster than Level 2 DOM parse)

The dispatcher probes each level in order and stops at the first one that returns usable content. Level 3 is attempted after Level 2 attaches to the browser — if a known JS state variable is found, DOM parsing is skipped entirely.

Starting Chrome for Level 2 / Level 3

Chrome must be running with remote debugging before invoking Level 2 or Level 3:

# macOS
open -a "Google Chrome" --args --remote-debugging-port=9222

# Windows
chrome --remote-debugging-port=9222

# Linux
google-chrome --remote-debugging-port=9222

The existing Chrome session (including cookies and local storage) is reused — no separate login step required.

CLI options

Flag Values Description
--output json, md Output format (default: plain text)
--force-level 1, 2, 3 Skip auto-detection, force a specific level
--json-path dot-notation string Extract a nested key from JSON output, e.g. title or props.pageProps
--no-cache Disable response cache
--cdp Force CDP connection (equivalent to --force-level 2)
--verbose Print level selection reasoning and timing

Why not X

Tool Limitation
Scrapling Cannot reuse an existing logged-in Chrome session; no JS state extraction
Firecrawl Cloud API — data leaves your machine, metered pricing
Jina Reader Cloud API — data leaves your machine, metered pricing
rolling-reader Fully local, reuses your Chrome session and cookies, free forever

Supported JS state variables (v0.2)

The following window.* variables are probed automatically for Level 3 extraction:

  • window.__NEXT_DATA__ — Next.js (Vercel ecosystem)
  • window.__NUXT__ — Nuxt.js
  • window.__PRELOADED_STATE__ — Redux / custom
  • window.__INITIAL_STATE__ — various frameworks
  • window.__REDUX_STATE__ — Redux explicit naming
  • window.__APP_STATE__ — various frameworks
  • window.__STATE__ — generic
  • window.__STORE__ — MobX / custom
  • window.APP_STATE — no-underscore variant
  • window.initialState — camelCase variant
  • window.__remixContext — Remix
  • window.__staticRouterHydrationData — React Router v6 SSR

Unknown variables matching the pattern window.VAR = {…} are also detected via regex scan.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rolling_reader-0.4.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rolling_reader-0.4.0-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file rolling_reader-0.4.0.tar.gz.

File metadata

  • Download URL: rolling_reader-0.4.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rolling_reader-0.4.0.tar.gz
Algorithm Hash digest
SHA256 3475f57fd38643727b7e19f13061d9cb0fd29e798378b0cacad34fadfd573700
MD5 c46ed80609d89e65af1a37b27dd314cb
BLAKE2b-256 df86d01cfd5b34317a2172fd56d3d10f553de4a8cddb507715b8082d4ff50fc4

See more details on using hashes here.

File details

Details for the file rolling_reader-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: rolling_reader-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rolling_reader-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d5c892c2c9a8f6cdf396086b195601037697b166fe9b47dc5149df96b9afe8b0
MD5 228824a88f17360cadc6ec58dbef2dda
BLAKE2b-256 6f896f23e0c2df914adb981b889ee9fd394f842bd22acc0768f73ab5f5d1bb81

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page