Skip to main content

One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool.

Project description

███████╗██╗    ██╗██╗████████╗ ██████╗██╗  ██╗██████╗  █████╗  ██████╗██╗  ██╗
██╔════╝██║    ██║██║╚══██╔══╝██╔════╝██║  ██║██╔══██╗██╔══██╗██╔════╝██║ ██╔╝
███████╗██║ █╗ ██║██║   ██║   ██║     ███████║██████╔╝███████║██║     █████╔╝
╚════██║██║███╗██║██║   ██║   ██║     ██╔══██║██╔══██╗██╔══██║██║     ██╔═██╗
███████║╚███╔███╔╝██║   ██║   ╚██████╗██║  ██║██████╔╝██║  ██║╚██████╗██║  ██╗
╚══════╝ ╚══╝╚══╝ ╚═╝   ╚═╝    ╚═════╝╚═╝  ╚═╝╚═════╝ ╚═╝  ╚═╝ ╚═════╝╚═╝  ╚═╝

One cost-ordered scrape cascade — HTTP → stealth browser → paid — shared by every tool.

Give it a URL; it tries the cheapest way to get clean Markdown first and only escalates to a heavier (slower, costlier) tier when the cheap one is walled. Stops at the first success.

PyPI Python License: MIT CI


Why

Most scrapers either give up on hard pages or send everything through an expensive headless browser / paid API. switchback orders the methods by cost and walks them cheapest-first, per host, learning which tier wins where so the next run starts there. The easy majority stays free; only genuinely-walled hosts pay for the heavy tiers.

  • Cost-ordered cascade — free APIs → cheap HTTP → anti-bot solver → stealth browser → paid API.
  • Per-host memory (botwall) — remembers the winning tier per host, skip-lists hard blockers, auto-skips hosts stuck on the paid tier.
  • Cost-scoped residential egress — routes only walled hosts through a residential proxy, never the easy majority.
  • One shape, three entry points — Python library, CLI (JSON on stdout), or an HTTP service.
  • Observable — every attempt is an OpenTelemetry span; logs ship trace-correlated to any OTLP backend (Jaeger, Tempo, SigNoz).
  • Runs with any subset installed — each tier imports its deps lazily; a missing one is just a tier miss.

Quickstart

pip install switchback                 # core: cheap tiers (0/1) + search
from switchback import scrape

for r in scrape(["https://arxiv.org/abs/1706.03762"]):
    print(r.source_method, len(r.markdown))
python -m switchback https://example.com/article    # JSON on stdout — bridge for any language

That's the whole loop. Add tiers as you need them (see Install).

The cascade (stop at first success)

Tier Strategy Cost
0 Direct APIs / mirrors (arxiv, wikipedia, EuropePMC; extend: job boards) free, cleanest
1 Plain HTTP + TLS impersonation (curl_cffi), incl. PDFs cheap
2 Cloudflare / anti-bot solver (cloudscraper, install .[cloudflare]) cheap-ish (~5s/host)
3 Stealth headless browser (patchright, Chromium) heavy
3b Camoufox (Firefox stealth) — on by default (opt out: SCRAPER_DISABLE_CAMOUFOX) heavy + slow (~40s on hard CF)
3c Residential-IP browser over CDP (BU_CDP_URL) — off unless configured heavy (remote egress)
4 Firecrawl (paid, env-gated, audited) paid, last resort

Every URL has a wall-clock budget (SCRAPER_DEADLINE_S, default 45s) checked between tiers so one URL can't run the whole cascade of timeouts. Each tier attempt records latency + outcome (ok / short_content / rate_limited / miss / not_applicable) to its span and the botwall event log; the root span carries total latency and the final outcome (incl. deadline_exceeded).

Search (query → URLs) is separate from the scrape cascade: switchback.search() / python -m switchback.api --search <query>, backed by a local SearXNG.

Install

pip install switchback                 # core: normalization + cheap tiers (0/1) + search
pip install "switchback[cloudflare]"   # + Tier 2 Cloudflare/anti-bot solver (cloudscraper)
pip install "switchback[server]"       # + HTTP service (fastapi, uvicorn) incl. /metrics + /traces
pip install "switchback[browser]" && patchright install chromium   # + Tier 3 stealth Chromium
pip install "switchback[camoufox]" && camoufox fetch               # + Tier 3b Firefox stealth
pip install "switchback[firecrawl]"    # + Tier 4 paid API (needs FIRECRAWL_API_KEY)
pip install "switchback[tracing]"      # + OpenTelemetry -> any OTLP backend
pip install "switchback[all]"          # everything

For Tier 2's full v3 JS-VM + Turnstile + stealth, install the Enhanced Edition 3.x fork (PyPI's cloudscraper is the older v1/v2 — PyPI forbids pinning a git-URL dep inside a published package, so install it alongside):

pip install "cloudscraper @ git+https://github.com/VeNoMouS/cloudscraper@3.0.0"

Or run the whole thing as a container: docker build -t switchback . && docker run -p 8799:8799 switchback.

Use it from your app

Three interchangeable entry points — all return the same shape ([{url, source_method, markdown}], successes only):

Python library

from switchback import scrape
for r in scrape(["https://arxiv.org/abs/1706.03762"]):
    print(r.source_method, len(r.markdown))

# Need failures + reasons too? scrape_detailed returns a ScrapeOutcome per URL
# (ok, final_outcome, error_class, status_code, and the per-tier attempts):
from switchback import scrape_detailed
for o in scrape_detailed(["https://www.pcmag.com/news"]):
    if not o.ok:
        print(o.url, o.final_outcome, o.error_class, o.status_code)

CLI (JSON on stdout — bridge for any language)

python -m switchback https://example.com/article        # or: switchback <url>

HTTP service (language-agnostic; one warm process keeps the browser pool hot)

switchback-server                                    # listens on :8799
curl -s localhost:8799/scrape -d '{"urls":["https://example.com"]}'
curl 'localhost:8799/search?q=web+scraping'

Non-Python callers: see clients/node_bridge.md. Python callers that want HTTP-with-CLI-fallback can drop in clients/python_client.py.

Cost-scoped residential egress

The dominant reason hard hosts wall you is the datacenter IP, not the fingerprint. When a host repeatedly walls the local tiers (a 403/429 or a bot-wall page, SCRAPER_BOTWALL_EGRESS_AFTER times) it's flagged needs_egress and the cascade reruns through a residential proxy — but only for that host:

export SCRAPER_EGRESS_PROXY="http://user:pass@p.webshare.io:80"

The easy majority that already succeeds free at the datacenter IP stays direct, so you never spend (often metered) residential bandwidth on it. Escalation tries the cheap HTTP tiers through the proxy first (~0.2MB/page) before the heavier browser tiers. Webshare's free plan includes ~1GB/mo of residential bandwidth — enough for low-volume hard-host recovery at $0. Use SCRAPER_PROXY instead to force every request through a proxy.

Metrics & reporting

The engine derives all metrics from its own state files (no external store): the botwall event log (one row per tier attempt, incl. the detected challenge vendor) and the per-host DB (winning tier, per-vendor challenge_counts).

curl localhost:8799/metrics            # cost savings vs Firecrawl, coverage,
                                       # overall + per-tier latency, outcomes
curl localhost:8799/metrics/domains    # per-domain: error codes, challenges, latency
python -m switchback.flags             # periodic digest: domains stuck on Firecrawl,
                                       # escalated to egress, top challenged (cron-friendly)

Both endpoints accept ?minutes=N to window the event-derived sections. The savings figure compares engine spend (Firecrawl invocations only) against a Firecrawl-everything baseline, charging the hard-page credit multiplier (BENCH_FIRECRAWL_HARD_MULT) for URLs that needed a browser/residential tier or hit a challenge — i.e. exactly the ones Firecrawl bills more for.

Configuration

All configuration is via environment variables. The engine runs with missing pieces: each tier imports its deps lazily and a missing one just counts as a tier miss. Tracing no-ops if OTel isn't installed/configured.

Tracing (optional)
export OTEL_SERVICE_NAME=switchback
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
Env gates — enable/disable tiers and integrations
  • SCRAPER_DISABLE_FIRECRAWL — skip Tier 4
  • FIRECRAWL_API_KEY — enable Tier 4
  • SCRAPER_DISABLE_CAMOUFOX — turn off Tier 3b (on by default; needs pip install camoufox + camoufox fetch)
  • BU_CDP_URL — enable Tier 3c residential browser by pointing at a CDP endpoint
  • SCRAPER_PROXY — route all tiers/URLs through a proxy
  • SCRAPER_EGRESS_PROXY — route only walled hosts through a proxy (see Cost-scoped residential egress)
  • SEARXNG_URL — defaults to http://localhost:8888
  • SCRAPER_STATE_DIR — where the botwall DB/event log + session cache live
  • SCRAPER_COOKIES_FILE — Netscape cookies.txt to scrape login-gated hosts (injected into the HTTP and browser tiers)
  • SCRAPER_CAPTCHA_PROVIDER + SCRAPER_CAPTCHA_API_KEY — opt-in, off by default: wire a third-party solver (2captcha/capsolver/capmonster/anticaptcha/deathbycaptcha/9kw) into Tier 2 for Turnstile/reCAPTCHA/hCaptcha on CF hosts. Paid, billed per solve by the provider.
Tunables — budgets, timeouts, caches, backoff
  • SCRAPER_DEADLINE_S — per-URL budget (45s)
  • SCRAPER_CAMOUFOX_TIMEOUT_MS — (45000)
  • SCRAPER_BROWSER_CONCURRENCY — max simultaneous headless browsers (default 1)
  • SCRAPER_BOTWALL_URL_SKIP_COOLDOWN_H — auto-skip re-test window (24h; 0 = never)
  • SCRAPER_BOTWALL_EGRESS_AFTER — local-tier failures before a host escalates to the residential tier (default 2)
  • SCRAPER_SESSION_TTL_S — cf_clearance reuse window (1800s)
  • SCRAPER_DISABLE_SESSION_CACHE — turn off cf_clearance reuse
  • SCRAPER_CONTENT_TTL_S — URL→result cache TTL (0 = off; set e.g. 86400 to skip re-scraping a page within a day)
  • SCRAPER_BACKOFF_BASE_MS / SCRAPER_BACKOFF_MAX_MS — exponential backoff between tiers after a rate-limit/timeout (base 0 = off)
  • SCRAPER_LOGIN_HOOKpkg.module:func returning {cookie: value} for a host (see Logged-in sessions)
  • SCRAPER_EXTRACTION_FILE — per-domain extraction prefs JSON (default config/extraction.json)
  • SCRAPER_TRACE_SESSION — opt-in: capture a Playwright trace (screenshots + DOM + network) per browser-tier attempt, written to state/traces/
  • BENCH_FIRECRAWL_USD / BENCH_FIRECRAWL_HARD_MULT — cost model for the savings report

Logged-in sessions

Beyond a static SCRAPER_COOKIES_FILE, wire SCRAPER_LOGIN_HOOK to a callable func(host) -> {cookie: value}. When an authenticated host trips a login/bot wall, the engine calls the hook once, persists the returned cookies per host, and overlays them on every tier (and future runs), then re-runs that URL on a fresh budget. The hook owns the site-specific login mechanics; the engine stays generic.

Session traces

With SCRAPER_TRACE_SESSION=1, each browser-tier attempt writes a Playwright trace zip to state/traces/. Manage them over HTTP — GET /traces (list), GET /traces/{id} (download), DELETE /traces/{id} — and open one with playwright show-trace <zip>. Off by default (traces are MBs each).

Per-domain extraction

Markdown of the whole page is the default. To scope a site to its content node or strip site-specific noise, declare prefs per host in config/extraction.json (see config/extraction.example.json); every tier's normalize step picks them up automatically.

Contributing

Issues and PRs welcome — see CONTRIBUTING.md. Start with the cascade runner in switchback/orchestrator.py.

Responsible use

This engine is for lawful data collection. You are responsible for respecting each target site's Terms of Service, robots.txt, and rate limits, and for having the right to access the content you fetch. The stealth / anti-bot tiers (cloudscraper, patchright, camoufox) exist to handle legitimate access friction (e.g. generic bot interstitials on public pages) — not to evade access controls, paywalls, or authentication you aren't authorized to bypass. The software is provided "as is", without warranty (see LICENSE).

License

MIT — see LICENSE. Third-party dependencies and their licenses are listed in NOTICE; all are permissive (MIT / BSD-3-Clause / Apache-2.0) and compatible with this project's MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

switchback-0.1.0.tar.gz (61.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

switchback-0.1.0-py3-none-any.whl (61.8 kB view details)

Uploaded Python 3

File details

Details for the file switchback-0.1.0.tar.gz.

File metadata

  • Download URL: switchback-0.1.0.tar.gz
  • Upload date:
  • Size: 61.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for switchback-0.1.0.tar.gz
Algorithm Hash digest
SHA256 21ebd88b94d24a9182c964fa731c8b838bc0782dd5429a49aa7857f0da152b4c
MD5 6422d517cf942a97b46fa55ecc65964d
BLAKE2b-256 c8bed59f85f3ec32dea3e8bf2e88554098d758ccce909a1ea1c6ef6e6c1c03af

See more details on using hashes here.

Provenance

The following attestation bundles were made for switchback-0.1.0.tar.gz:

Publisher: publish.yml on akash-kr/switchback

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file switchback-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: switchback-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 61.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for switchback-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 253b466e0b194ae7e7adbf9db9998360e8d1170bf435703b3e7f153f8fd88850
MD5 748c137237ac7445429b9369d6dd1786
BLAKE2b-256 a079e3daac30741ec750528d6d85b1ccef62d5ffe82688d31deaf7892a6f4189

See more details on using hashes here.

Provenance

The following attestation bundles were made for switchback-0.1.0-py3-none-any.whl:

Publisher: publish.yml on akash-kr/switchback

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page