Skip to main content

Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction

Project description

rolling-reader

Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction.

Install

pip install rolling-reader

Python 3.11+. No Node.js required.

Note: playwright install chromium is not needed. rolling-reader connects to your existing Chrome browser — it does not download or manage its own browser.

Quick start

Static pages — works immediately after install:

rr https://news.ycombinator.com/
rr https://arxiv.org/abs/1706.03762 --clean   # article body only

SPA / login-required pages — requires Chrome running with remote debugging:

# Step 1: start Chrome with remote debugging (do this once per session)
#   macOS:   open -a "Google Chrome" --args --remote-debugging-port=9222
#   Windows: chrome --remote-debugging-port=9222
#   Linux:   google-chrome --remote-debugging-port=9222

# Step 2: scrape — rolling-reader reuses your existing session and cookies
rr https://app.example.com/dashboard

How it works

Level Trigger Speed
1 HTTP Standard SSR page ~500 ms
2 CDP SPA, JS rendering required, or auth-gated ~3 s
3 JS State Next.js / Nuxt / Redux / Remix state variable detected ~1 s (3–4× faster than Level 2 DOM)

The dispatcher tries each level in order and stops at the first one that returns usable content. Level 3 is attempted inside Level 2 — if a known JS state variable is found, DOM parsing is skipped entirely.

Level 2 and 3 reuse your existing Chrome session, including cookies and local storage. No separate login step or credential storage required.

CLI options

Flag Description
--clean / -c Extract article body only (removes nav, ads, footers)
--output json|md Output format (default: json)
--force-level 1|2|3 Skip auto-detection, force a specific level
--json-path <path> Extract a nested field, e.g. title or props.pageProps
--no-cache Bypass profile cache, always re-explore
--cdp <endpoint> Chrome DevTools endpoint (default: http://localhost:9222)
--verbose / -v Print level selection and timing to stderr

Batch scraping

# Multiple URLs as arguments
rr batch https://example.com https://news.ycombinator.com/

# From a file (one URL per line, # for comments)
rr batch urls.txt

# Pipe-friendly: data goes to stdout, progress to stderr
rr batch urls.txt --clean > results.jsonl

# Control concurrency (default: 3)
rr batch urls.txt --concurrency 10

Why not X

Tool Limitation
Scrapling Cannot reuse an existing logged-in Chrome session; no JS state extraction
Firecrawl Cloud API — data leaves your machine, metered pricing
Jina Reader Cloud API — data leaves your machine, metered pricing
rolling-reader Fully local, reuses your Chrome session and cookies, free forever

Supported JS state variables (v0.2+)

The following window.* variables are probed automatically for Level 3 extraction:

  • window.__NEXT_DATA__ — Next.js
  • window.__NUXT__ — Nuxt.js
  • window.__PRELOADED_STATE__ — Redux / custom
  • window.__INITIAL_STATE__ — various frameworks
  • window.__REDUX_STATE__ — Redux
  • window.__APP_STATE__ — various
  • window.__STATE__ — generic
  • window.__STORE__ — MobX / custom
  • window.APP_STATE — no-underscore variant
  • window.initialState — camelCase variant
  • window.__remixContext — Remix
  • window.__staticRouterHydrationData — React Router v6 SSR

Unknown variables matching window.VAR = {…} are also detected via automatic scan.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rolling_reader-0.6.3.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rolling_reader-0.6.3-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file rolling_reader-0.6.3.tar.gz.

File metadata

  • Download URL: rolling_reader-0.6.3.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rolling_reader-0.6.3.tar.gz
Algorithm Hash digest
SHA256 f0e5d2197e1d6ee030cbcf2545bd17cf0417bfbbd3f7c48484b14cf768a24c81
MD5 ce592c21662320271d56393c09e357a8
BLAKE2b-256 d8113c3dc3c733d8757433383d89686bfefeb7a789d0ee28d736972e7d3412c1

See more details on using hashes here.

File details

Details for the file rolling_reader-0.6.3-py3-none-any.whl.

File metadata

  • Download URL: rolling_reader-0.6.3-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for rolling_reader-0.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0cb6c0fd283af112c1c1b6a3f1793a775cb828f4f831f4fe74476815ce9aff7b
MD5 6336097093177036a623e9d1f2e4bdb2
BLAKE2b-256 9213c255fa4e0196bc8a14aefa037c55361f9d336deb7ccf3a8cd43542291f1f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page