Skip to main content

Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction

Project description

rolling-reader

Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction.

Install

pip install rolling-reader
playwright install chromium  # required for Level 2 / Level 3

Python 3.11+. No Node.js required.

Quick start

Static page (Level 1, no browser needed):

rr https://news.ycombinator.com/

SPA or login-required page (Level 2, reuses your existing Chrome session):

# 1. Start Chrome with remote debugging enabled (see section below)
# 2. Run the command — Level 2 is selected automatically
rr https://app.example.com/dashboard

Output as Markdown:

rr https://example.com --output md

How it works

Level Trigger Speed
1 HTTP Standard SSR page, no JS rendering needed ~500 ms
2 CDP SPA, JS rendering required, or auth-gated ~3 s
3 JS State Next.js / Nuxt / Redux / Remix state variable detected ~1 s (3–4x faster than Level 2 DOM parse)

The dispatcher probes each level in order and stops at the first one that returns usable content. Level 3 is attempted after Level 2 attaches to the browser — if a known JS state variable is found, DOM parsing is skipped entirely.

Starting Chrome for Level 2 / Level 3

Chrome must be running with remote debugging before invoking Level 2 or Level 3:

# macOS
open -a "Google Chrome" --args --remote-debugging-port=9222

# Windows
chrome --remote-debugging-port=9222

# Linux
google-chrome --remote-debugging-port=9222

The existing Chrome session (including cookies and local storage) is reused — no separate login step required.

CLI options

Flag Values Description
--output json, md Output format (default: plain text)
--force-level 1, 2, 3 Skip auto-detection, force a specific level
--json-path dot-notation string Extract a nested key from JSON output, e.g. title or props.pageProps
--no-cache Disable response cache
--cdp Force CDP connection (equivalent to --force-level 2)
--verbose Print level selection reasoning and timing

Why not X

Tool Limitation
Scrapling Cannot reuse an existing logged-in Chrome session; no JS state extraction
Firecrawl Cloud API — data leaves your machine, metered pricing
Jina Reader Cloud API — data leaves your machine, metered pricing
rolling-reader Fully local, reuses your Chrome session and cookies, free forever

Supported JS state variables (v0.2)

The following window.* variables are probed automatically for Level 3 extraction:

  • window.__NEXT_DATA__ — Next.js (Vercel ecosystem)
  • window.__NUXT__ — Nuxt.js
  • window.__PRELOADED_STATE__ — Redux / custom
  • window.__INITIAL_STATE__ — various frameworks
  • window.__REDUX_STATE__ — Redux explicit naming
  • window.__APP_STATE__ — various frameworks
  • window.__STATE__ — generic
  • window.__STORE__ — MobX / custom
  • window.APP_STATE — no-underscore variant
  • window.initialState — camelCase variant
  • window.__remixContext — Remix
  • window.__staticRouterHydrationData — React Router v6 SSR

Unknown variables matching the pattern window.VAR = {…} are also detected via regex scan.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rolling_reader-0.3.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rolling_reader-0.3.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file rolling_reader-0.3.0.tar.gz.

File metadata

  • Download URL: rolling_reader-0.3.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rolling_reader-0.3.0.tar.gz
Algorithm Hash digest
SHA256 27d27b848b5ead14523cfd939e773cc4c15e23f0dd784b40bede88b6538847be
MD5 6dcf2a8ddf2b98380241cfbcec6b04f9
BLAKE2b-256 34e347c3f13c50326c7db4c42469699785a25424ff7e523444575fddec044d21

See more details on using hashes here.

File details

Details for the file rolling_reader-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: rolling_reader-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rolling_reader-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cdf3f4c9ca1f578bf1e00d76b744a4f7ef7029c533ae56dd608f9903614fc430
MD5 d321023a5c96877b9940683863c55d9e
BLAKE2b-256 be90c43db62f7750ace4102dd0aa59363390d9c0c82ffeaa213e70e689d046e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page