Skip to main content

One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool.

Project description

███████╗██╗    ██╗██╗████████╗ ██████╗██╗  ██╗██████╗  █████╗  ██████╗██╗  ██╗
██╔════╝██║    ██║██║╚══██╔══╝██╔════╝██║  ██║██╔══██╗██╔══██╗██╔════╝██║ ██╔╝
███████╗██║ █╗ ██║██║   ██║   ██║     ███████║██████╔╝███████║██║     █████╔╝
╚════██║██║███╗██║██║   ██║   ██║     ██╔══██║██╔══██╗██╔══██║██║     ██╔═██╗
███████║╚███╔███╔╝██║   ██║   ╚██████╗██║  ██║██████╔╝██║  ██║╚██████╗██║  ██╗
╚══════╝ ╚══╝╚══╝ ╚═╝   ╚═╝    ╚═════╝╚═╝  ╚═╝╚═════╝ ╚═╝  ╚═╝ ╚═════╝╚═╝  ╚═╝

One cost-ordered scrape cascade — HTTP → stealth browser → paid — shared by every tool.

Give it a URL; it tries the cheapest way to get clean Markdown first and only escalates to a heavier (slower, costlier) tier when the cheap one is walled. Stops at the first success.

PyPI Python License: MIT CI


Why

Most scrapers either give up on hard pages or send everything through an expensive headless browser / paid API. switchback orders the methods by cost and walks them cheapest-first, per host, learning which tier wins where so the next run starts there. The easy majority stays free; only genuinely-walled hosts pay for the heavy tiers.

  • Cost-ordered cascade — free APIs → cheap HTTP → anti-bot solver → stealth browser → paid API.
  • Per-host memory (botwall) — remembers the winning tier per host, skip-lists hard blockers, auto-skips hosts stuck on the paid tier.
  • Cost-scoped residential egress — routes only walled hosts through a residential proxy, never the easy majority.
  • One shape, three entry points — Python library, CLI (JSON on stdout), or an HTTP service.
  • Observable — every attempt is an OpenTelemetry span; logs ship trace-correlated to any OTLP backend (Jaeger, Tempo, SigNoz).
  • Runs with any subset installed — each tier imports its deps lazily; a missing one is just a tier miss.

Quickstart

pip install switchback                 # core: cheap tiers (0/1) + search
from switchback import scrape

for r in scrape(["https://arxiv.org/abs/1706.03762"]):
    print(r.source_method, len(r.markdown))
python -m switchback https://example.com/article    # JSON on stdout — bridge for any language

That's the whole loop. Add tiers as you need them (see Install).

The cascade (stop at first success)

Tier Strategy Cost
0 Direct APIs / mirrors (arxiv, wikipedia, EuropePMC; extend: job boards) free, cleanest
1 Plain HTTP + TLS impersonation (curl_cffi), incl. PDFs cheap
2 Cloudflare / anti-bot solver (cloudscraper, install .[cloudflare]) cheap-ish (~5s/host)
3 Stealth headless browser (patchright, Chromium) heavy
3b Camoufox (Firefox stealth) — on by default (opt out: SCRAPER_DISABLE_CAMOUFOX) heavy + slow (~40s on hard CF)
3c Residential-IP browser over CDP (BU_CDP_URL) — off unless configured heavy (remote egress)
4 Firecrawl (paid, env-gated, audited) paid, last resort

Every URL has a wall-clock budget (SCRAPER_DEADLINE_S, default 45s) checked between tiers so one URL can't run the whole cascade of timeouts. Each tier attempt records latency + outcome (ok / short_content / rate_limited / miss / not_applicable) to its span and the botwall event log; the root span carries total latency and the final outcome (incl. deadline_exceeded).

Search (query → URLs) is separate from the scrape cascade: switchback.search() / python -m switchback.api --search <query>, backed by a local SearXNG.

Install

pip install switchback                 # core: normalization + cheap tiers (0/1) + search
pip install "switchback[cloudflare]"   # + Tier 2 Cloudflare/anti-bot solver (cloudscraper)
pip install "switchback[server]"       # + HTTP service (fastapi, uvicorn) incl. /metrics + /traces
pip install "switchback[browser]" && patchright install chromium   # + Tier 3 stealth Chromium
pip install "switchback[camoufox]" && camoufox fetch               # + Tier 3b Firefox stealth
pip install "switchback[firecrawl]"    # + Tier 4 paid API (needs FIRECRAWL_API_KEY)
pip install "switchback[tracing]"      # + OpenTelemetry -> any OTLP backend
pip install "switchback[all]"          # everything

For Tier 2's full v3 JS-VM + Turnstile + stealth, install the Enhanced Edition 3.x fork (PyPI's cloudscraper is the older v1/v2 — PyPI forbids pinning a git-URL dep inside a published package, so install it alongside):

pip install "cloudscraper @ git+https://github.com/VeNoMouS/cloudscraper@3.0.0"

Or run the whole thing as a container: docker build -t switchback . && docker run -p 8799:8799 switchback.

Production / cold-start deployment

The two heavy tiers pull dependencies that often can't be baked into a base image and land after boot (e.g. an async install thread on Azure). Until they're ready, those tiers report unavailable (a distinct outcome carrying the exact fix) and the cascade falls through — they are never silently skipped. Checklist:

  • Tier 3 is the real workhorse for Cloudflare/JS sites — make sure its browser is installed: patchright install chromium (note: patchright, not vanilla playwright). On a cold start, run this in your post-boot install step/thread; Tier 3 flips to ready once it finishes.
  • Tier 2 needs the cloudscraper 3.x fork (above) to attempt stealth. With the frozen PyPI cloudscraper it reports unavailable and fails fast (no wasted solve budget) instead of erroring mid-cascade. Tier 2 is a weak solver for modern Cloudflare — treat it as a cheap try before the browser, not the primary.
  • Install Node.js for Tier 2's v3 JS-VM challenges — faster and thread-safe vs. the pure-Python js2py fallback (relevant under concurrent load).
  • Bound Tier 2's solve budget with SCRAPER_CLOUDSCRAPER_TIMEOUT_S (default 25) so an unsolvable challenge can't eat the per-URL deadline before the browser tier runs. Lower it (e.g. 12) if Tier 2 rarely wins on your hosts.

Verify readiness on the box with the preflight check (doubles as a healthcheck — exit 0 when the capable tiers are ready):

switchback --doctor          # or: python -m switchback --doctor

Use it from your app

Three interchangeable entry points — all return the same shape ([{url, source_method, markdown}], successes only):

Python library

from switchback import scrape
for r in scrape(["https://arxiv.org/abs/1706.03762"]):
    print(r.source_method, len(r.markdown))

# Need failures + reasons too? scrape_detailed returns a ScrapeOutcome per URL
# (ok, final_outcome, error_class, status_code, and the per-tier attempts):
from switchback import scrape_detailed
for o in scrape_detailed(["https://www.pcmag.com/news"]):
    if not o.ok:
        print(o.url, o.final_outcome, o.error_class, o.status_code)

CLI (JSON on stdout — bridge for any language)

python -m switchback https://example.com/article        # or: switchback <url>

HTTP service (language-agnostic; one warm process keeps the browser pool hot)

switchback-server                                    # listens on :8799
curl -s localhost:8799/scrape -d '{"urls":["https://example.com"]}'
curl 'localhost:8799/search?q=web+scraping'

Non-Python callers: see clients/node_bridge.md. Python callers that want HTTP-with-CLI-fallback can drop in clients/python_client.py.

Cost-scoped residential egress

The dominant reason hard hosts wall you is the datacenter IP, not the fingerprint. When a host repeatedly walls the local tiers (a 403/429 or a bot-wall page, SCRAPER_BOTWALL_EGRESS_AFTER times) it's flagged needs_egress and the cascade reruns through a residential proxy — but only for that host:

export SCRAPER_EGRESS_PROXY="http://user:pass@p.webshare.io:80"

The easy majority that already succeeds free at the datacenter IP stays direct, so you never spend (often metered) residential bandwidth on it. Escalation tries the cheap HTTP tiers through the proxy first (~0.2MB/page) before the heavier browser tiers. Webshare's free plan includes ~1GB/mo of residential bandwidth — enough for low-volume hard-host recovery at $0. Use SCRAPER_PROXY instead to force every request through a proxy.

Metrics & reporting

The engine derives all metrics from its own state files (no external store): the botwall event log (one row per tier attempt, incl. the detected challenge vendor) and the per-host DB (winning tier, per-vendor challenge_counts).

curl localhost:8799/metrics            # cost savings vs Firecrawl, coverage,
                                       # overall + per-tier latency, outcomes
curl localhost:8799/metrics/domains    # per-domain: error codes, challenges, latency
python -m switchback.flags             # periodic digest: domains stuck on Firecrawl,
                                       # escalated to egress, top challenged (cron-friendly)

Both endpoints accept ?minutes=N to window the event-derived sections. The savings figure compares engine spend (Firecrawl invocations only) against a Firecrawl-everything baseline, charging the hard-page credit multiplier (BENCH_FIRECRAWL_HARD_MULT) for URLs that needed a browser/residential tier or hit a challenge — i.e. exactly the ones Firecrawl bills more for.

Configuration

All configuration is via environment variables. The engine runs with missing pieces: each tier imports its deps lazily and a missing one just counts as a tier miss. Tracing no-ops if OTel isn't installed/configured.

Tracing (optional)
export OTEL_SERVICE_NAME=switchback
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
Env gates — enable/disable tiers and integrations
  • SCRAPER_DISABLE_FIRECRAWL — skip Tier 4
  • FIRECRAWL_API_KEY — enable Tier 4
  • SCRAPER_DISABLE_CAMOUFOX — turn off Tier 3b (on by default; needs pip install camoufox + camoufox fetch)
  • BU_CDP_URL — enable Tier 3c residential browser by pointing at a CDP endpoint
  • SCRAPER_PROXY — route all tiers/URLs through a proxy
  • SCRAPER_EGRESS_PROXY — route only walled hosts through a proxy (see Cost-scoped residential egress)
  • SEARXNG_URL — defaults to http://localhost:8888
  • SCRAPER_STATE_DIR — where the botwall DB/event log + session cache live
  • SCRAPER_COOKIES_FILE — Netscape cookies.txt to scrape login-gated hosts (injected into the HTTP and browser tiers)
  • SCRAPER_CAPTCHA_PROVIDER + SCRAPER_CAPTCHA_API_KEY — opt-in, off by default: wire a third-party solver (2captcha/capsolver/capmonster/anticaptcha/deathbycaptcha/9kw) into Tier 2 for Turnstile/reCAPTCHA/hCaptcha on CF hosts. Paid, billed per solve by the provider.
Tunables — budgets, timeouts, caches, backoff
  • SCRAPER_OUTPUT_FORMAT — output shape: markdown (default) · markdown_trimmed · html · html_selectors (see Output formats)
  • SCRAPER_DEADLINE_S — per-URL budget (45s)
  • SCRAPER_FIRECRAWL_FALLBACK_AFTER_S — after this many seconds on a URL, stop trying the local tiers and fall back to Firecrawl, so a hard host doesn't burn the whole deadline before the paid last resort gets a turn (25s; 0 = off)
  • SCRAPER_CAMOUFOX_TIMEOUT_MS — (45000)
  • SCRAPER_BROWSER_CONCURRENCY — max simultaneous headless browsers (default 1)
  • SCRAPER_BOTWALL_URL_SKIP_COOLDOWN_H — auto-skip re-test window (24h; 0 = never)
  • SCRAPER_BOTWALL_EGRESS_AFTER — local-tier failures before a host escalates to the residential tier (default 2)
  • SCRAPER_SESSION_TTL_S — cf_clearance reuse window (1800s)
  • SCRAPER_DISABLE_SESSION_CACHE — turn off cf_clearance reuse
  • SCRAPER_CONTENT_TTL_S — URL→result cache TTL (0 = off; set e.g. 86400 to skip re-scraping a page within a day)
  • SCRAPER_BACKOFF_BASE_MS / SCRAPER_BACKOFF_MAX_MS — exponential backoff between tiers after a rate-limit/timeout (base 0 = off)
  • SCRAPER_TIER_RETRIES — same-tier retries before falling through (default 0 = off; N → up to 1+N tries per tier), with per-tier overrides SCRAPER_TIER_RETRIES_<TIER> (e.g. SCRAPER_TIER_RETRIES_TIER3_BROWSER=2)
  • SCRAPER_TIER_RETRY_ON — failure classes eligible for a same-tier retry (default timeout,rate_limited,connection; widen to include botwall,http_block behind a rotating residential proxy so each retry gets a fresh IP). Retries are bounded by SCRAPER_DEADLINE_S; enabling them on the paid Firecrawl tier bills per attempt
  • SCRAPER_LOGIN_HOOKpkg.module:func returning {cookie: value} for a host (see Logged-in sessions)
  • SCRAPER_EXTRACTION_FILE — per-domain extraction prefs JSON (default config/extraction.json)
  • SCRAPER_TRACE_SESSION — opt-in: capture a Playwright trace (screenshots + DOM + network) per browser-tier attempt, written to state/traces/
  • BENCH_FIRECRAWL_USD / BENCH_FIRECRAWL_HARD_MULT — cost model for the savings report

Logged-in sessions

Beyond a static SCRAPER_COOKIES_FILE, wire SCRAPER_LOGIN_HOOK to a callable func(host) -> {cookie: value}. When an authenticated host trips a login/bot wall, the engine calls the hook once, persists the returned cookies per host, and overlays them on every tier (and future runs), then re-runs that URL on a fresh budget. The hook owns the site-specific login mechanics; the engine stays generic.

Session traces

With SCRAPER_TRACE_SESSION=1, each browser-tier attempt writes a Playwright trace zip to state/traces/. Manage them over HTTP — GET /traces (list), GET /traces/{id} (download), DELETE /traces/{id} — and open one with playwright show-trace <zip>. Off by default (traces are MBs each).

Output formats

Markdown is the default and is unchanged. Pick a different shape globally with SCRAPER_OUTPUT_FORMAT, or per call:

from switchback import scrape
scrape(["https://example.com/article"])                    # markdown (default)
scrape(["https://example.com/article"], fmt="html")        # raw HTML
scrape(["https://example.com/article"], fmt="markdown_trimmed")
switchback --format html_selectors https://example.com/article
curl -s localhost:8799/scrape -d '{"urls":["https://example.com"],"format":"html"}'
format what you get
markdown whole-page markdown (boilerplate stripped + per-domain prefs) — default
markdown_trimmed markdown with extra ad/nav/boilerplate lines removed
html the raw HTML exactly as fetched, untouched
html_selectors cleaned HTML (boilerplate strip + per-domain drop/selector), not converted

The chosen content rides in the result's markdown field; in the CLI/server JSON the key is markdown for markdown formats and html for html formats. The API/PDF tiers (arXiv synth, PDF→text) have no HTML, so html formats fall back to their text for those sources.

Per-domain extraction

Markdown of the whole page is the default. To scope a site to its content node or strip site-specific noise, declare prefs per host in config/extraction.json (see config/extraction.example.json); every tier's normalize step picks them up automatically.

Contributing

Issues and PRs welcome — see CONTRIBUTING.md. Start with the cascade runner in switchback/orchestrator.py.

Responsible use

This engine is for lawful data collection. You are responsible for respecting each target site's Terms of Service, robots.txt, and rate limits, and for having the right to access the content you fetch. The stealth / anti-bot tiers (cloudscraper, patchright, camoufox) exist to handle legitimate access friction (e.g. generic bot interstitials on public pages) — not to evade access controls, paywalls, or authentication you aren't authorized to bypass. The software is provided "as is", without warranty (see LICENSE).

License

MIT — see LICENSE. Third-party dependencies and their licenses are listed in NOTICE; all are permissive (MIT / BSD-3-Clause / Apache-2.0) and compatible with this project's MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

switchback-0.4.0.tar.gz (73.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

switchback-0.4.0-py3-none-any.whl (71.5 kB view details)

Uploaded Python 3

File details

Details for the file switchback-0.4.0.tar.gz.

File metadata

  • Download URL: switchback-0.4.0.tar.gz
  • Upload date:
  • Size: 73.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for switchback-0.4.0.tar.gz
Algorithm Hash digest
SHA256 b745341d77f0529502fa4c09b0920b7b71e11b1b25860c4590e10ae97c7ac01f
MD5 5bf6615bae087e178911e0d8192c25b1
BLAKE2b-256 205724bbd3d12f011ca72d38e3e1ac6a46d0204b0e33b02ab736f4a9c0375cf8

See more details on using hashes here.

Provenance

The following attestation bundles were made for switchback-0.4.0.tar.gz:

Publisher: publish.yml on akash-kr/switchback

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file switchback-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: switchback-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 71.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for switchback-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e46f19d0085073cb9227dd30976de3c7f9fdda4246703c76fcc9a0c7e18033bc
MD5 df061ad42bd04ff686b7467cbcbbd420
BLAKE2b-256 3ced2e0a0e21dfa3f6630d31081b320f5164aa9a582e369dc14bf840903ff93a

See more details on using hashes here.

Provenance

The following attestation bundles were made for switchback-0.4.0-py3-none-any.whl:

Publisher: publish.yml on akash-kr/switchback

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page