One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool.
Project description
███████╗██╗ ██╗██╗████████╗ ██████╗██╗ ██╗██████╗ █████╗ ██████╗██╗ ██╗
██╔════╝██║ ██║██║╚══██╔══╝██╔════╝██║ ██║██╔══██╗██╔══██╗██╔════╝██║ ██╔╝
███████╗██║ █╗ ██║██║ ██║ ██║ ███████║██████╔╝███████║██║ █████╔╝
╚════██║██║███╗██║██║ ██║ ██║ ██╔══██║██╔══██╗██╔══██║██║ ██╔═██╗
███████║╚███╔███╔╝██║ ██║ ╚██████╗██║ ██║██████╔╝██║ ██║╚██████╗██║ ██╗
╚══════╝ ╚══╝╚══╝ ╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝╚═════╝ ╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝
One cost-ordered scrape cascade — HTTP → stealth browser → paid — shared by every tool.
Give it a URL; it tries the cheapest way to get clean Markdown first and only escalates to a heavier (slower, costlier) tier when the cheap one is walled. Stops at the first success.
Why
Most scrapers either give up on hard pages or send everything through an expensive headless browser / paid API. switchback orders the methods by cost and walks them cheapest-first, per host, learning which tier wins where so the next run starts there. The easy majority stays free; only genuinely-walled hosts pay for the heavy tiers.
- Cost-ordered cascade — free APIs → cheap HTTP → anti-bot solver → stealth browser → paid API.
- Per-host memory (botwall) — remembers the winning tier per host, skip-lists hard blockers, auto-skips hosts stuck on the paid tier.
- Cost-scoped residential egress — routes only walled hosts through a residential proxy, never the easy majority.
- One shape, three entry points — Python library, CLI (JSON on stdout), or an HTTP service.
- Observable — every attempt is an OpenTelemetry span; logs ship trace-correlated to any OTLP backend (Jaeger, Tempo, SigNoz).
- Runs with any subset installed — each tier imports its deps lazily; a missing one is just a tier miss.
Quickstart
pip install switchback # core: cheap tiers (0/1) + search
from switchback import scrape
for r in scrape(["https://arxiv.org/abs/1706.03762"]):
print(r.source_method, len(r.markdown))
python -m switchback https://example.com/article # JSON on stdout — bridge for any language
That's the whole loop. Add tiers as you need them (see Install).
The cascade (stop at first success)
| Tier | Strategy | Cost |
|---|---|---|
| 0 | Direct APIs / mirrors (arxiv, wikipedia, EuropePMC; extend: job boards) | free, cleanest |
| 1 | Plain HTTP + TLS impersonation (curl_cffi), incl. PDFs |
cheap |
| 2 | Cloudflare / anti-bot solver (cloudscraper, install .[cloudflare]) |
cheap-ish (~5s/host) |
| 3 | Stealth headless browser (patchright, Chromium) |
heavy |
| 3b | Camoufox (Firefox stealth) — on by default (opt out: SCRAPER_DISABLE_CAMOUFOX) |
heavy + slow (~40s on hard CF) |
| 3c | Residential-IP browser over CDP (BU_CDP_URL) — off unless configured |
heavy (remote egress) |
| 4 | Firecrawl (paid, env-gated, audited) | paid, last resort |
Every URL has a wall-clock budget (SCRAPER_DEADLINE_S, default 45s) checked between
tiers so one URL can't run the whole cascade of timeouts. Each tier attempt records
latency + outcome (ok / short_content / rate_limited / miss / not_applicable)
to its span and the botwall event log; the root span carries total latency and the final
outcome (incl. deadline_exceeded).
Search (query → URLs) is separate from the scrape cascade: switchback.search() /
python -m switchback.api --search <query>, backed by a local SearXNG.
Install
pip install switchback # core: normalization + cheap tiers (0/1) + search
pip install "switchback[cloudflare]" # + Tier 2 Cloudflare/anti-bot solver (cloudscraper)
pip install "switchback[server]" # + HTTP service (fastapi, uvicorn) incl. /metrics + /traces
pip install "switchback[browser]" && patchright install chromium # + Tier 3 stealth Chromium
pip install "switchback[camoufox]" && camoufox fetch # + Tier 3b Firefox stealth
pip install "switchback[firecrawl]" # + Tier 4 paid API (needs FIRECRAWL_API_KEY)
pip install "switchback[tracing]" # + OpenTelemetry -> any OTLP backend
pip install "switchback[all]" # everything
For Tier 2's full v3 JS-VM + Turnstile + stealth, install the Enhanced Edition
3.x fork (PyPI's cloudscraper is the older v1/v2 — PyPI forbids pinning a
git-URL dep inside a published package, so install it alongside):
pip install "cloudscraper @ git+https://github.com/VeNoMouS/cloudscraper@3.0.0"
Or run the whole thing as a container:
docker build -t switchback . && docker run -p 8799:8799 switchback.
Production / cold-start deployment
The two heavy tiers pull dependencies that often can't be baked into a base image
and land after boot (e.g. an async install thread on Azure). Until they're
ready, those tiers report unavailable (a distinct outcome carrying the exact
fix) and the cascade falls through — they are never silently skipped. Checklist:
- Tier 3 is the real workhorse for Cloudflare/JS sites — make sure its browser
is installed:
patchright install chromium(note: patchright, not vanillaplaywright). On a cold start, run this in your post-boot install step/thread; Tier 3 flips to ready once it finishes. - Tier 2 needs the cloudscraper 3.x fork (above) to attempt stealth. With the
frozen PyPI
cloudscraperit reportsunavailableand fails fast (no wasted solve budget) instead of erroring mid-cascade. Tier 2 is a weak solver for modern Cloudflare — treat it as a cheap try before the browser, not the primary. - Install Node.js for Tier 2's v3 JS-VM challenges — faster and thread-safe vs. the pure-Python js2py fallback (relevant under concurrent load).
- Bound Tier 2's solve budget with
SCRAPER_CLOUDSCRAPER_TIMEOUT_S(default25) so an unsolvable challenge can't eat the per-URL deadline before the browser tier runs. Lower it (e.g.12) if Tier 2 rarely wins on your hosts.
Verify readiness on the box with the preflight check (doubles as a healthcheck — exit 0 when the capable tiers are ready):
switchback --doctor # or: python -m switchback --doctor
Use it from your app
Three interchangeable entry points — all return the same shape
([{url, source_method, markdown}], successes only):
Python library
from switchback import scrape
for r in scrape(["https://arxiv.org/abs/1706.03762"]):
print(r.source_method, len(r.markdown))
# Need failures + reasons too? scrape_detailed returns a ScrapeOutcome per URL
# (ok, final_outcome, error_class, status_code, and the per-tier attempts):
from switchback import scrape_detailed
for o in scrape_detailed(["https://www.pcmag.com/news"]):
if not o.ok:
print(o.url, o.final_outcome, o.error_class, o.status_code)
CLI (JSON on stdout — bridge for any language)
python -m switchback https://example.com/article # or: switchback <url>
HTTP service (language-agnostic; one warm process keeps the browser pool hot)
switchback-server # listens on :8799
curl -s localhost:8799/scrape -d '{"urls":["https://example.com"]}'
curl 'localhost:8799/search?q=web+scraping'
Non-Python callers: see clients/node_bridge.md. Python callers that want HTTP-with-CLI-fallback can drop in clients/python_client.py.
Cost-scoped residential egress
The dominant reason hard hosts wall you is the datacenter IP, not the
fingerprint. When a host repeatedly walls the local tiers (a 403/429 or a
bot-wall page, SCRAPER_BOTWALL_EGRESS_AFTER times) it's flagged needs_egress
and the cascade reruns through a residential proxy — but only for that host:
export SCRAPER_EGRESS_PROXY="http://user:pass@p.webshare.io:80"
The easy majority that already succeeds free at the datacenter IP stays direct,
so you never spend (often metered) residential bandwidth on it. Escalation tries
the cheap HTTP tiers through the proxy first (~0.2MB/page) before the heavier
browser tiers. Webshare's free plan includes ~1GB/mo
of residential bandwidth — enough for low-volume hard-host recovery at $0. Use
SCRAPER_PROXY instead to force every request through a proxy.
Metrics & reporting
The engine derives all metrics from its own state files (no external store): the
botwall event log (one row per tier attempt, incl. the detected challenge vendor)
and the per-host DB (winning tier, per-vendor challenge_counts).
curl localhost:8799/metrics # cost savings vs Firecrawl, coverage,
# overall + per-tier latency, outcomes
curl localhost:8799/metrics/domains # per-domain: error codes, challenges, latency
python -m switchback.flags # periodic digest: domains stuck on Firecrawl,
# escalated to egress, top challenged (cron-friendly)
Both endpoints accept ?minutes=N to window the event-derived sections. The
savings figure compares engine spend (Firecrawl invocations only) against a
Firecrawl-everything baseline, charging the hard-page credit multiplier
(BENCH_FIRECRAWL_HARD_MULT) for URLs that needed a browser/residential tier or
hit a challenge — i.e. exactly the ones Firecrawl bills more for.
Configuration
All configuration is via environment variables. The engine runs with missing pieces: each tier imports its deps lazily and a missing one just counts as a tier miss. Tracing no-ops if OTel isn't installed/configured.
Tracing (optional)
export OTEL_SERVICE_NAME=switchback
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
Env gates — enable/disable tiers and integrations
SCRAPER_DISABLE_FIRECRAWL— skip Tier 4FIRECRAWL_API_KEY— enable Tier 4SCRAPER_DISABLE_CAMOUFOX— turn off Tier 3b (on by default; needspip install camoufox+camoufox fetch)BU_CDP_URL— enable Tier 3c residential browser by pointing at a CDP endpointSCRAPER_PROXY— route all tiers/URLs through a proxySCRAPER_EGRESS_PROXY— route only walled hosts through a proxy (see Cost-scoped residential egress)SEARXNG_URL— defaults tohttp://localhost:8888SCRAPER_STATE_DIR— where the botwall DB/event log + session cache liveSCRAPER_COOKIES_FILE— Netscapecookies.txtto scrape login-gated hosts (injected into the HTTP and browser tiers)SCRAPER_CAPTCHA_PROVIDER+SCRAPER_CAPTCHA_API_KEY— opt-in, off by default: wire a third-party solver (2captcha/capsolver/capmonster/anticaptcha/deathbycaptcha/9kw) into Tier 2 for Turnstile/reCAPTCHA/hCaptcha on CF hosts. Paid, billed per solve by the provider.
Tunables — budgets, timeouts, caches, backoff
SCRAPER_OUTPUT_FORMAT— output shape:markdown(default) ·markdown_trimmed·html·html_selectors(see Output formats)SCRAPER_DEADLINE_S— per-URL budget (45s)SCRAPER_FIRECRAWL_FALLBACK_AFTER_S— after this many seconds on a URL, stop trying the local tiers and fall back to Firecrawl, so a hard host doesn't burn the whole deadline before the paid last resort gets a turn (25s; 0 = off)SCRAPER_CAMOUFOX_TIMEOUT_MS— (45000)SCRAPER_BROWSER_CONCURRENCY— max simultaneous headless browsers (default 1)SCRAPER_BOTWALL_URL_SKIP_COOLDOWN_H— auto-skip re-test window (24h; 0 = never)SCRAPER_BOTWALL_EGRESS_AFTER— local-tier failures before a host escalates to the residential tier (default 2)SCRAPER_SESSION_TTL_S— cf_clearance reuse window (1800s)SCRAPER_DISABLE_SESSION_CACHE— turn off cf_clearance reuseSCRAPER_CONTENT_TTL_S— URL→result cache TTL (0 = off; set e.g. 86400 to skip re-scraping a page within a day)SCRAPER_BACKOFF_BASE_MS/SCRAPER_BACKOFF_MAX_MS— exponential backoff between tiers after a rate-limit/timeout (base 0 = off)SCRAPER_TIER_RETRIES— same-tier retries before falling through (default 0 = off;N→ up to1+Ntries per tier), with per-tier overridesSCRAPER_TIER_RETRIES_<TIER>(e.g.SCRAPER_TIER_RETRIES_TIER3_BROWSER=2)SCRAPER_TIER_RETRY_ON— failure classes eligible for a same-tier retry (defaulttimeout,rate_limited,connection; widen to includebotwall,http_blockbehind a rotating residential proxy so each retry gets a fresh IP). Retries are bounded bySCRAPER_DEADLINE_S; enabling them on the paid Firecrawl tier bills per attemptSCRAPER_LOGIN_HOOK—pkg.module:funcreturning{cookie: value}for a host (see Logged-in sessions)SCRAPER_EXTRACTION_FILE— per-domain extraction prefs JSON (defaultconfig/extraction.json)SCRAPER_TRACE_SESSION— opt-in: capture a Playwright trace (screenshots + DOM + network) per browser-tier attempt, written tostate/traces/BENCH_FIRECRAWL_USD/BENCH_FIRECRAWL_HARD_MULT— cost model for the savings report
Logged-in sessions
Beyond a static SCRAPER_COOKIES_FILE, wire SCRAPER_LOGIN_HOOK to a callable
func(host) -> {cookie: value}. When an authenticated host trips a login/bot
wall, the engine calls the hook once, persists the returned cookies per host, and
overlays them on every tier (and future runs), then re-runs that URL on a fresh
budget. The hook owns the site-specific login mechanics; the engine stays generic.
Session traces
With SCRAPER_TRACE_SESSION=1, each browser-tier attempt writes a Playwright
trace zip to state/traces/. Manage them over HTTP — GET /traces (list),
GET /traces/{id} (download), DELETE /traces/{id} — and open one with
playwright show-trace <zip>. Off by default (traces are MBs each).
Output formats
Markdown is the default and is unchanged. Pick a different shape globally with
SCRAPER_OUTPUT_FORMAT, or per call:
from switchback import scrape
scrape(["https://example.com/article"]) # markdown (default)
scrape(["https://example.com/article"], fmt="html") # raw HTML
scrape(["https://example.com/article"], fmt="markdown_trimmed")
switchback --format html_selectors https://example.com/article
curl -s localhost:8799/scrape -d '{"urls":["https://example.com"],"format":"html"}'
| format | what you get |
|---|---|
markdown |
whole-page markdown (boilerplate stripped + per-domain prefs) — default |
markdown_trimmed |
markdown with extra ad/nav/boilerplate lines removed |
html |
the raw HTML exactly as fetched, untouched |
html_selectors |
cleaned HTML (boilerplate strip + per-domain drop/selector), not converted |
The chosen content rides in the result's markdown field; in the CLI/server JSON
the key is markdown for markdown formats and html for html formats. The
API/PDF tiers (arXiv synth, PDF→text) have no HTML, so html formats fall back to
their text for those sources.
Per-domain extraction
Markdown of the whole page is the default. To scope a site to its content node or
strip site-specific noise, declare prefs per host in config/extraction.json
(see config/extraction.example.json); every
tier's normalize step picks them up automatically.
Contributing
Issues and PRs welcome — see CONTRIBUTING.md. Start with the
cascade runner in switchback/orchestrator.py.
Responsible use
This engine is for lawful data collection. You are responsible for respecting
each target site's Terms of Service, robots.txt, and rate limits, and for
having the right to access the content you fetch. The stealth / anti-bot tiers
(cloudscraper, patchright, camoufox) exist to handle legitimate access
friction (e.g. generic bot interstitials on public pages) — not to evade access
controls, paywalls, or authentication you aren't authorized to bypass. The
software is provided "as is", without warranty (see LICENSE).
License
MIT — see LICENSE. Third-party dependencies and their licenses are listed in NOTICE; all are permissive (MIT / BSD-3-Clause / Apache-2.0) and compatible with this project's MIT license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file switchback-0.4.0.tar.gz.
File metadata
- Download URL: switchback-0.4.0.tar.gz
- Upload date:
- Size: 73.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b745341d77f0529502fa4c09b0920b7b71e11b1b25860c4590e10ae97c7ac01f
|
|
| MD5 |
5bf6615bae087e178911e0d8192c25b1
|
|
| BLAKE2b-256 |
205724bbd3d12f011ca72d38e3e1ac6a46d0204b0e33b02ab736f4a9c0375cf8
|
Provenance
The following attestation bundles were made for switchback-0.4.0.tar.gz:
Publisher:
publish.yml on akash-kr/switchback
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
switchback-0.4.0.tar.gz -
Subject digest:
b745341d77f0529502fa4c09b0920b7b71e11b1b25860c4590e10ae97c7ac01f - Sigstore transparency entry: 2010600279
- Sigstore integration time:
-
Permalink:
akash-kr/switchback@d0a8ff0d86e1dd1a33761650596664a0a4e4cfa6 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/akash-kr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d0a8ff0d86e1dd1a33761650596664a0a4e4cfa6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file switchback-0.4.0-py3-none-any.whl.
File metadata
- Download URL: switchback-0.4.0-py3-none-any.whl
- Upload date:
- Size: 71.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e46f19d0085073cb9227dd30976de3c7f9fdda4246703c76fcc9a0c7e18033bc
|
|
| MD5 |
df061ad42bd04ff686b7467cbcbbd420
|
|
| BLAKE2b-256 |
3ced2e0a0e21dfa3f6630d31081b320f5164aa9a582e369dc14bf840903ff93a
|
Provenance
The following attestation bundles were made for switchback-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on akash-kr/switchback
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
switchback-0.4.0-py3-none-any.whl -
Subject digest:
e46f19d0085073cb9227dd30976de3c7f9fdda4246703c76fcc9a0c7e18033bc - Sigstore transparency entry: 2010600459
- Sigstore integration time:
-
Permalink:
akash-kr/switchback@d0a8ff0d86e1dd1a33761650596664a0a4e4cfa6 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/akash-kr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d0a8ff0d86e1dd1a33761650596664a0a4e4cfa6 -
Trigger Event:
release
-
Statement type: