URL in, LLM-ready markdown out. Stealth fetch with anti-bot bypass.
Project description
StealthFetch
URL in, LLM-ready markdown out.
from stealthfetch import fetch_markdown
md = fetch_markdown("https://en.wikipedia.org/wiki/Web_scraping")
Fetches any web page, strips nav, ads, and boilerplate, returns clean markdown. If the site blocks you, it auto-escalates to a stealth browser. One function, no config.
StealthFetch doesn't reinvent the hard parts: curl_cffi, trafilatura, html-to-markdown, Camoufox, and Patchright do the heavy lifting. StealthFetch is the orchestration layer: wiring them together, detecting blocks, deciding when to escalate, and handling the security concerns most tools skip.
How It Works
URL
│
▼
┌───────────────────────────────────────────┐
│ FETCH curl_cffi │
│ Chrome TLS fingerprint │
│ ↓ blocked? │
│ auto-escalate to stealth │
│ browser (Camoufox / │
│ Patchright) │
└─────────────────┬─────────────────────────┘
│
┌─────────────────▼─────────────────────────┐
│ EXTRACT trafilatura │
│ strips nav, ads, │
│ boilerplate │
└─────────────────┬─────────────────────────┘
│
┌─────────────────▼─────────────────────────┐
│ CONVERT html-to-markdown (Rust) │
└─────────────────┬─────────────────────────┘
│
▼
markdown
Each layer is one library call. The libraries do the hard work.
What StealthFetch Owns
Block Detection
Most anti-bot systems give themselves away before you ever see a captcha. StealthFetch uses status codes (403, 429, 503) as a fast first pass, then pattern-matches HTML signatures from Cloudflare, DataDome, PerimeterX, and Akamai. The trick is knowing when not to check: vendor-specific signatures (like _cf_chl_opt or perimeterx) are always checked because they never appear in real content. Generic phrases like "just a moment" or "access denied" are only checked on small pages (< 15k chars) since on a real article those strings are just words.
Auto-Escalation
Headless browsers are slow, heavy, and detectable in their own right. An HTTP request with a Chrome TLS fingerprint (via curl_cffi) gets through most sites just fine. So StealthFetch tries HTTP first always. It only spins up a stealth browser when the response actually looks blocked. The interesting part isn't the browser itself, it's the decision of when to use it.
SSRF Protection
Most scraping tools — including ones with 60-85k GitHub stars — trust whatever URL you hand them. StealthFetch doesn't. A hostname that resolves to 127.0.0.1? Rejected. A redirect chain that bounces through three domains and lands on a private IP? Caught. IPv6-mapped IPv4 bypasses, link-local addresses are all validated before the request goes out, and again after redirects resolve.
Works On
Most sites return clean markdown in under a second. Sites that fight back (Reddit, Amazon) get auto-escalated to a stealth browser — takes 5–8 seconds but you don't have to think about it.
| Site | What You Get |
|---|---|
| Wikipedia, Reuters, BBC News, TechCrunch | Articles and news — straight through |
| Hacker News | Threads and comments |
| Stack Overflow | Q&A with code blocks |
| Medium | Articles — Cloudflare-protected, but no false-positive escalation (passive JS, not a block page) |
| Blocked by challenge page → auto-escalates to browser | |
| Amazon | Blocked by CAPTCHA → auto-escalates to browser |
Install
Try it — no install needed (requires uv):
uvx stealthfetch https://en.wikipedia.org/wiki/Web_scraping
Install as a library:
pip install stealthfetch
Note: trafilatura brings ~20 transitive dependencies (lxml, charset-normalizer, etc.). Total install is ~50 packages.
Add stealth browser support (necessary for escalation logic):
pip install "stealthfetch[browser]"
camoufox fetch
CLI
stealthfetch https://en.wikipedia.org/wiki/Web_scraping
stealthfetch https://spa-app.com -m browser
stealthfetch https://example.com --no-links --no-tables
stealthfetch https://example.com --header "Cookie: session=abc"
MCP Server
StealthFetch is an MCP server — any MCP client (Claude Desktop, Claude Code, Cursor, etc.) can call it as a tool to fetch web pages as markdown.
No install needed — add this to your MCP client config:
{
"mcpServers": {
"stealthfetch": {
"command": "uvx",
"args": ["--from", "stealthfetch[mcp]", "stealthfetch-mcp"]
}
}
}
Or if you prefer a persistent install:
pip install "stealthfetch[mcp]"
{
"mcpServers": {
"stealthfetch": {
"command": "stealthfetch-mcp"
}
}
}
API
fetch_markdown(url, **kwargs) -> str
Also available as afetch_markdown — same signature, async. Extraction and conversion run off the event loop via asyncio.to_thread.
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str |
required | URL to fetch |
method |
str |
"auto" |
"auto", "http", or "browser" |
browser_backend |
str |
"auto" |
"auto", "camoufox", or "patchright" |
include_links |
bool |
True |
Preserve hyperlinks |
include_images |
bool |
False |
Preserve image references |
include_tables |
bool |
True |
Preserve tables |
timeout |
int |
30 |
Timeout in seconds |
proxy |
dict |
None |
{"server": "...", "username": "...", "password": "..."} |
headers |
dict |
None |
Additional HTTP headers |
fetch_result(url, **kwargs) -> FetchResult
Same fetch/extract/convert pipeline as fetch_markdown, but returns a structured dataclass with the markdown and page metadata extracted as a free side-effect of parsing.
from stealthfetch import fetch_result
r = fetch_result("https://en.wikipedia.org/wiki/Web_scraping", method="http")
print(r.title) # "Web scraping"
print(r.author) # "Wikipedia contributors" (when available)
print(r.date) # ISO 8601 date (when available)
print(r.markdown[:200])
FetchResult fields:
| Field | Type | Description |
|---|---|---|
markdown |
str |
Cleaned markdown content |
title |
str | None |
Page title |
author |
str | None |
Author name |
date |
str | None |
Publication date (ISO 8601 when available) |
description |
str | None |
Meta description |
url |
str | None |
Canonical URL (may differ from input) |
hostname |
str | None |
Hostname |
sitename |
str | None |
Publisher name |
To get a plain dict: dataclasses.asdict(result).
afetch_result has the same signature, async.
Optional Dependencies
| Extra | What it adds |
|---|---|
stealthfetch[camoufox] |
Camoufox stealth Firefox |
stealthfetch[patchright] |
Patchright stealth Chromium |
stealthfetch[browser] |
Both |
stealthfetch[mcp] |
MCP server |
Python 3.10+. Tested on 3.10–3.13, Linux and macOS.
Roadmap
Things that would make sense if this gets traction:
- Homebrew tap —
brew install stealthfetchfor people who don't want to think about Python - Docker image — bundle browser backends pre-installed, no
camoufox fetchstep, plays well with Docker's MCP Catalog
Contributions welcome.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stealthfetch-0.2.0.tar.gz.
File metadata
- Download URL: stealthfetch-0.2.0.tar.gz
- Upload date:
- Size: 41.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26987e631ec1be2b87fda126cf1e4527a3f82eb7f502b146b6540ba029c508a4
|
|
| MD5 |
7cde7933bca883c4dd954420c594be1b
|
|
| BLAKE2b-256 |
70d265beaf7af5cb980f7ee80eb8f0f1586ada1702a3113429434d00800ea48e
|
Provenance
The following attestation bundles were made for stealthfetch-0.2.0.tar.gz:
Publisher:
publish.yml on leba01/stealthfetch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stealthfetch-0.2.0.tar.gz -
Subject digest:
26987e631ec1be2b87fda126cf1e4527a3f82eb7f502b146b6540ba029c508a4 - Sigstore transparency entry: 1006429026
- Sigstore integration time:
-
Permalink:
leba01/stealthfetch@2af41f53565343bd2bb6b2097f9e7b433c12d00a -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/leba01
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2af41f53565343bd2bb6b2097f9e7b433c12d00a -
Trigger Event:
push
-
Statement type:
File details
Details for the file stealthfetch-0.2.0-py3-none-any.whl.
File metadata
- Download URL: stealthfetch-0.2.0-py3-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0ca0cf1324d1e52f8c7eec39f9542663a47777c8b5f2a8daebae2ccb40a7589
|
|
| MD5 |
77fefe54414a7de2543670e6bb3e83fe
|
|
| BLAKE2b-256 |
f76770c35e8d4eb2a9de76022f2251f9f9692d5b419b86ef5981362e40bcf48e
|
Provenance
The following attestation bundles were made for stealthfetch-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on leba01/stealthfetch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stealthfetch-0.2.0-py3-none-any.whl -
Subject digest:
a0ca0cf1324d1e52f8c7eec39f9542663a47777c8b5f2a8daebae2ccb40a7589 - Sigstore transparency entry: 1006429030
- Sigstore integration time:
-
Permalink:
leba01/stealthfetch@2af41f53565343bd2bb6b2097f9e7b433c12d00a -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/leba01
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2af41f53565343bd2bb6b2097f9e7b433c12d00a -
Trigger Event:
push
-
Statement type: