Skip to main content

Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction

Project description

rolling-reader

Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction.

Install

pip install rolling-reader
playwright install chromium  # required for Level 2 / Level 3

Python 3.11+. No Node.js required.

Quick start

Static page (Level 1, no browser needed):

rr https://news.ycombinator.com/

SPA or login-required page (Level 2, reuses your existing Chrome session):

# 1. Start Chrome with remote debugging enabled (see section below)
# 2. Run the command — Level 2 is selected automatically
rr https://app.example.com/dashboard

Output as Markdown:

rr https://example.com --output md

How it works

Level Trigger Speed
1 HTTP Standard SSR page, no JS rendering needed ~500 ms
2 CDP SPA, JS rendering required, or auth-gated ~3 s
3 JS State Next.js / Nuxt / Redux / Remix state variable detected ~1 s (3–4x faster than Level 2 DOM parse)

The dispatcher probes each level in order and stops at the first one that returns usable content. Level 3 is attempted after Level 2 attaches to the browser — if a known JS state variable is found, DOM parsing is skipped entirely.

Starting Chrome for Level 2 / Level 3

Chrome must be running with remote debugging before invoking Level 2 or Level 3:

# macOS
open -a "Google Chrome" --args --remote-debugging-port=9222

# Windows
chrome --remote-debugging-port=9222

# Linux
google-chrome --remote-debugging-port=9222

The existing Chrome session (including cookies and local storage) is reused — no separate login step required.

CLI options

Flag Values Description
--output json, md Output format (default: plain text)
--force-level 1, 2, 3 Skip auto-detection, force a specific level
--json-path dot-notation string Extract a nested key from JSON output, e.g. title or props.pageProps
--no-cache Disable response cache
--cdp Force CDP connection (equivalent to --force-level 2)
--verbose Print level selection reasoning and timing

Why not X

Tool Limitation
Scrapling Cannot reuse an existing logged-in Chrome session; no JS state extraction
Firecrawl Cloud API — data leaves your machine, metered pricing
Jina Reader Cloud API — data leaves your machine, metered pricing
rolling-reader Fully local, reuses your Chrome session and cookies, free forever

Supported JS state variables (v0.2)

The following window.* variables are probed automatically for Level 3 extraction:

  • window.__NEXT_DATA__ — Next.js (Vercel ecosystem)
  • window.__NUXT__ — Nuxt.js
  • window.__PRELOADED_STATE__ — Redux / custom
  • window.__INITIAL_STATE__ — various frameworks
  • window.__REDUX_STATE__ — Redux explicit naming
  • window.__APP_STATE__ — various frameworks
  • window.__STATE__ — generic
  • window.__STORE__ — MobX / custom
  • window.APP_STATE — no-underscore variant
  • window.initialState — camelCase variant
  • window.__remixContext — Remix
  • window.__staticRouterHydrationData — React Router v6 SSR

Unknown variables matching the pattern window.VAR = {…} are also detected via regex scan.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rolling_reader-0.5.0.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rolling_reader-0.5.0-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file rolling_reader-0.5.0.tar.gz.

File metadata

  • Download URL: rolling_reader-0.5.0.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rolling_reader-0.5.0.tar.gz
Algorithm Hash digest
SHA256 d6c5a6223220a8ff2361f8652628cad72b4e2f01587f88b348ccb25ce7f5eccf
MD5 d76b054fcbed51ca2191f4c7694fd6b2
BLAKE2b-256 b3c48ddd30dcf2b346736bcd21375932498022709fc2f69740d86d620b95f704

See more details on using hashes here.

File details

Details for the file rolling_reader-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: rolling_reader-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rolling_reader-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e065a2bf366ae3c5a1b075a39538601bc1244efa5793fb5c49454ed780f5bf55
MD5 af1980357fe390c44fa951ed25902ad6
BLAKE2b-256 e482c92843368c3f45a1564f016caf96c6a0cfac7e7d9bc289972a804dc92b87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page