Resilient web-fetch MCP server: 3-tier escalation (curl_cffi -> Patchright -> nodriver) that fails honestly, with article-extraction mode and automatic content-type handling (HTML/JSON/PDF/image).
Project description
web-fetch-mcp
A web-fetch MCP server for LLM agents that
fails honestly — it raises FetchBlocked instead of silently handing your model
a CAPTCHA or login page as if it were the article.
Naive fetchers poison an agent's context: when a site returns a JavaScript
interstitial or a login wall with HTTP 200, the agent reads the challenge page as
if it were content and reasons from garbage. web-fetch-mcp detects that and
either escalates to a stronger strategy or fails loudly.
Status: early / alpha. The escalation logic and helpers are unit-tested, but real-world bypass rates are not yet benchmarked — see
assets/benchmarks.mdand the roadmap inTODO.md.
How it works
A cheapest-first escalation ladder. Each tier targets a different layer of bot-detection, and the server only pays for the expensive ones when it has to:
| Tier | Engine | Targets | Speed |
|---|---|---|---|
| 1 | curl_cffi (Chrome TLS/HTTP2 fingerprint) |
TLS (JA3/JA4) + HTTP/2 fingerprinting | ~500 ms |
| 2 | Patchright (real headful Chrome) | JavaScript fingerprinting; renders SPAs | ~1–3 s |
| 3 | nodriver (custom CDP) | automation-protocol (CDP) detection | ~2–4 s |
Every tier's output is checked for hard blocks (403/429/503) and soft
blocks (HTTP-200 challenge or login bodies served in place of content).
Transient failures retry with exponential backoff + jitter (honoring
Retry-After) before escalating. If everything is blocked, it raises
FetchBlocked with a remedy hint — it never returns a block page as content.
Escalation path (mode="auto")
flowchart TD
A["fetch(url)"] --> B{dismiss_selector set?}
B -- "no" --> T1["Tier 1 · curl_cffi<br/>static fetch"]
B -- "yes (can't click)" --> T2
T1 --> C1{blocked or<br/>empty SPA shell?}
C1 -- "no" --> OK["render_by_type → return"]
C1 -- "yes, escalate" --> T2["Tier 2 · Patchright<br/>headful Chrome, JS render"]
T2 --> C2{blocked?}
C2 -- "no" --> OK
C2 -- "yes, escalate" --> T3["Tier 3 · nodriver<br/>custom-CDP stealth"]
T3 --> C3{blocked?}
C3 -- "no" --> OK
C3 -- "yes" --> X["raise FetchBlocked<br/>(suggest residential proxy)"]
OK:::done
X:::fail
classDef done fill:#1f7a1f,color:#fff,stroke:#0d4d0d;
classDef fail fill:#a11,color:#fff,stroke:#600;
Each tier runs through with_retry (exponential backoff + jitter, honoring
Retry-After) before the chain escalates. Tier 1 must clear the strict check
(not blocked and not an unrendered SPA shell); Tiers 2–3 only need to be
not-blocked. The single-tier modes (static/dynamic/stealth) run exactly one
box and skip the chain.
Tools
fetch— retrieve a page asmarkdown/text/html/article(main-content extraction via trafilatura). Non-HTML URLs are auto-handled: JSON is pretty-printed, PDFs are text-extracted, images return a note to usescreenshot.screenshot— render a page in real Chrome and return a PNG.
Architecture
A layered package (src/web_fetch_mcp/), dependencies pointing inward:
controller (FastMCP tools, lifespan) controller/app.py
-> service (retry decorator, strategy registry, escalation, facade)
-> accessor (curl_cffi / Patchright / nodriver, BrowserManager)
-> core (models, config, detection, rendering, proxy, backoff)
- Strategy — the three tiers are interchangeable
async (request) -> FetchResultcallables in a registry (service/strategies.py). - Chain of Responsibility (intent) —
automode walks the tiers cheapest-first, escalating until one yields usable content (service/escalation.py). - Decorator —
with_retryadds exponential-backoff + Retry-After to any tier (service/retry.py), hand-rolled on the stdlib (notenacity). - Manager —
BrowserManagerowns one reused Chromium and closes it on the FastMCP lifespan shutdown (accessor/browser.py).
Quickstart
uv sync
uv pip install -e . # installs the `web-fetch-mcp` console command
web-fetch-mcp # run the stdio MCP server
Register it with any MCP-compatible client as a stdio server that runs the
web-fetch-mcp command (or python -m web_fetch_mcp.controller.app).
fetch("https://example.com/article", output="article") # clean main content
fetch("https://api.site/data.json") # pretty-printed JSON
fetch("https://spa.example.com", mode="dynamic") # force a JS render
Responsible use
This tool is for fetching content you are authorized to access. You are
solely responsible for complying with each site's Terms of Service, robots.txt,
and applicable law. It honors Retry-After and backs off by default; please
rate-limit responsibly. It does not solve CAPTCHAs or bypass authentication
you do not hold. Provided as-is, without warranty.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web_fetch_mcp-0.1.0.tar.gz.
File metadata
- Download URL: web_fetch_mcp-0.1.0.tar.gz
- Upload date:
- Size: 35.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ec0551661ed8942062fea3221e57e9f7b2a3a7db6fc2f8eabdec35c511a2baa
|
|
| MD5 |
6ad56dfde167a2e4bd2b61e52557546d
|
|
| BLAKE2b-256 |
f228fccba3db1e9017f5c8428588a6c5e0171bfbaea3f0a2989874d23b9976d4
|
File details
Details for the file web_fetch_mcp-0.1.0-py3-none-any.whl.
File metadata
- Download URL: web_fetch_mcp-0.1.0-py3-none-any.whl
- Upload date:
- Size: 35.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2ef8a31740002478e8a2c381522ae663c84fcae2411f01b5644792823ddb8ec
|
|
| MD5 |
130e3d9bbfd45f1ac5966016433428f8
|
|
| BLAKE2b-256 |
cc8bbfc8d612779584c576937de4a671a29e320efd260fa6b203bf41645935ae
|