Unified interface for web scraping engines — site to markdown with stealth, JS rendering, and LLM-ready output.
Project description
scrapefold
Unified Python library for web scraping — single URL or whole-site → markdown, with stealth, JS rendering, and LLM-ready output. Wraps 16 vendor APIs and local stealth browsers behind one async interface.
Status: v0.1.0a0 — scaffold. Engines land incrementally; see docs/README.md for the roadmap.
Why
The web is hostile. A real scraping pipeline has to cascade through cheap-and-fast → stealth-browser → paid-residential-proxy until something works. Hand-rolling that cascade per project means 2000 LOC of glue code per repo. scrapefold gives you one async call:
from scrapefold import scrape, ScrapeOptions
res = await scrape("https://example.com")
res.text # always
res.markdown # always
res.html # when the engine returned HTML
res.json # when the engine returned structured data
The same call works against a static blog (one requests call, ~200 ms, $0) and against a Datadome-protected site (auto-escalates through Scrapling → Cloakbrowser → Firecrawl → Bright Data Unlocker, stops at the first one that succeeds).
Install
pip install scrapefold # core + baseline requests engine
pip install "scrapefold[firecrawl]" # one specific vendor
pip install "scrapefold[all]" # everything
pip install "scrapefold[mcp]" # for the MCP server
Quick start
import asyncio
from scrapefold import scrape, crawl_site, ScrapeOptions
async def main():
# Single URL, auto-engine
res = await scrape("https://example.com")
print(res.markdown)
# Russian-domain example — same opts work for every engine
opts = ScrapeOptions(language="ru", country="ru", render_js=True, stealth=True)
res = await scrape("https://lenta.ru", opts=opts)
# Whole site → one big markdown file
await crawl_site(
"https://docs.example.com",
opts=ScrapeOptions(max_pages=50, max_depth=3),
output="site.md",
cache_dir="~/.scrapefold/cache",
cache_ttl_hours=24,
)
asyncio.run(main())
CLI
scrapefold scrape https://example.com --engine firecrawl --language ru --json
scrapefold crawl https://docs.example.com --max-pages 50 --output site.md
scrapefold list-engines
scrapefold inspect-opts firecrawl
MCP server (for Claude Code, Cursor, agents)
pip install "scrapefold[mcp]"
scrapefold-mcp
Drop into ~/.claude/mcp.json:
{ "mcpServers": { "scrapefold": { "command": "scrapefold-mcp", "args": [] } } }
Exposes scrape_url, crawl_site, list_engines, inspect_options tools and scrapefold://cache/*, scrapefold://engines resources.
Engines (v0.1, 16 total)
Local (free, no key): requests, scrapling, crawl4ai, cloakbrowser, obscura, selenium (deprecated).
SaaS (paid): firecrawl, scrapingbee, scrapingdog, jina, cloudflare, outscraper, apify_linkedin, anysite, brightdata_unlocker, brightdata_browser.
See docs/architecture/overview.md § Anti-bot escalation ladder for the full cascade.
Documentation
- docs/README.md — index
- docs/architecture/overview.md — module map, data flow, escalation ladder
- docs/workflows/development.md — clone, install, run
- docs/workflows/testing.md — marker strategy
- docs/conventions/golden-rules.md — invariants
- docs/tools/agent-mode.md —
--json, MCP server - CONTRIBUTING.md — how to add a new engine
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapefold-0.1.0a2.tar.gz.
File metadata
- Download URL: scrapefold-0.1.0a2.tar.gz
- Upload date:
- Size: 179.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0ddfc56ae1d2d86962545b42227166b0bb241ef5f8445794a846a510df3e11c
|
|
| MD5 |
59b90884055a22de8093eb672ed44b34
|
|
| BLAKE2b-256 |
fd963cadfccc0de0d632a75c13789b99a29def82a19014f05f49d0a4324dd7d5
|
Provenance
The following attestation bundles were made for scrapefold-0.1.0a2.tar.gz:
Publisher:
ci.yml on Mihailorama/scrapefold
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapefold-0.1.0a2.tar.gz -
Subject digest:
e0ddfc56ae1d2d86962545b42227166b0bb241ef5f8445794a846a510df3e11c - Sigstore transparency entry: 1624814504
- Sigstore integration time:
-
Permalink:
Mihailorama/scrapefold@9737f96639068a301fdd7604aa2d30546e0a6c54 -
Branch / Tag:
refs/tags/v0.1.0a2 - Owner: https://github.com/Mihailorama
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@9737f96639068a301fdd7604aa2d30546e0a6c54 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scrapefold-0.1.0a2-py3-none-any.whl.
File metadata
- Download URL: scrapefold-0.1.0a2-py3-none-any.whl
- Upload date:
- Size: 69.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
413e3bf9ac621a74e1d45b44477322051a3d5ddba80370ce099d6e1ef6863493
|
|
| MD5 |
0213f3760c47eef3a030d57dce35b27e
|
|
| BLAKE2b-256 |
99b1a531026de17fa1ba5f63210fe010479a649ace57117824e558dc623cb68e
|
Provenance
The following attestation bundles were made for scrapefold-0.1.0a2-py3-none-any.whl:
Publisher:
ci.yml on Mihailorama/scrapefold
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapefold-0.1.0a2-py3-none-any.whl -
Subject digest:
413e3bf9ac621a74e1d45b44477322051a3d5ddba80370ce099d6e1ef6863493 - Sigstore transparency entry: 1624814547
- Sigstore integration time:
-
Permalink:
Mihailorama/scrapefold@9737f96639068a301fdd7604aa2d30546e0a6c54 -
Branch / Tag:
refs/tags/v0.1.0a2 - Owner: https://github.com/Mihailorama
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@9737f96639068a301fdd7604aa2d30546e0a6c54 -
Trigger Event:
push
-
Statement type: