Skip to main content

Unified interface for web scraping engines — site to markdown with stealth, JS rendering, and LLM-ready output.

Project description

scrapefold

Unified Python library for web scraping — single URL or whole-site → markdown, with stealth, JS rendering, and LLM-ready output. Wraps 16 vendor APIs and local stealth browsers behind one async interface.

Status: v0.1.0a0 — scaffold. Engines land incrementally; see docs/README.md for the roadmap.

Why

The web is hostile. A real scraping pipeline has to cascade through cheap-and-fast → stealth-browser → paid-residential-proxy until something works. Hand-rolling that cascade per project means 2000 LOC of glue code per repo. scrapefold gives you one async call:

from scrapefold import scrape, ScrapeOptions

res = await scrape("https://example.com")
res.text       # always
res.markdown   # always
res.html       # when the engine returned HTML
res.json       # when the engine returned structured data

The same call works against a static blog (one requests call, ~200 ms, $0) and against a Datadome-protected site (auto-escalates through Scrapling → Cloakbrowser → Firecrawl → Bright Data Unlocker, stops at the first one that succeeds).

Install

pip install scrapefold                      # core + baseline requests engine
pip install "scrapefold[firecrawl]"         # one specific vendor
pip install "scrapefold[all]"               # everything
pip install "scrapefold[mcp]"               # for the MCP server

Quick start

import asyncio
from scrapefold import scrape, crawl_site, ScrapeOptions

async def main():
    # Single URL, auto-engine
    res = await scrape("https://example.com")
    print(res.markdown)

    # Russian-domain example — same opts work for every engine
    opts = ScrapeOptions(language="ru", country="ru", render_js=True, stealth=True)
    res = await scrape("https://lenta.ru", opts=opts)

    # Whole site → one big markdown file
    await crawl_site(
        "https://docs.example.com",
        opts=ScrapeOptions(max_pages=50, max_depth=3),
        output="site.md",
        cache_dir="~/.scrapefold/cache",
        cache_ttl_hours=24,
    )

asyncio.run(main())

CLI

scrapefold scrape https://example.com --engine firecrawl --language ru --json
scrapefold crawl https://docs.example.com --max-pages 50 --output site.md
scrapefold list-engines
scrapefold inspect-opts firecrawl

MCP server (for Claude Code, Cursor, agents)

pip install "scrapefold[mcp]"
scrapefold-mcp

Drop into ~/.claude/mcp.json:

{ "mcpServers": { "scrapefold": { "command": "scrapefold-mcp", "args": [] } } }

Exposes scrape_url, crawl_site, list_engines, inspect_options tools and scrapefold://cache/*, scrapefold://engines resources.

Engines (v0.1, 16 total)

Local (free, no key): requests, scrapling, crawl4ai, cloakbrowser, obscura, selenium (deprecated).

SaaS (paid): firecrawl, scrapingbee, scrapingdog, jina, cloudflare, outscraper, apify_linkedin, anysite, brightdata_unlocker, brightdata_browser.

See docs/architecture/overview.md § Anti-bot escalation ladder for the full cascade.

Documentation

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapefold-0.1.0a2.tar.gz (179.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapefold-0.1.0a2-py3-none-any.whl (69.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapefold-0.1.0a2.tar.gz.

File metadata

  • Download URL: scrapefold-0.1.0a2.tar.gz
  • Upload date:
  • Size: 179.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapefold-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 e0ddfc56ae1d2d86962545b42227166b0bb241ef5f8445794a846a510df3e11c
MD5 59b90884055a22de8093eb672ed44b34
BLAKE2b-256 fd963cadfccc0de0d632a75c13789b99a29def82a19014f05f49d0a4324dd7d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapefold-0.1.0a2.tar.gz:

Publisher: ci.yml on Mihailorama/scrapefold

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapefold-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: scrapefold-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 69.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapefold-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 413e3bf9ac621a74e1d45b44477322051a3d5ddba80370ce099d6e1ef6863493
MD5 0213f3760c47eef3a030d57dce35b27e
BLAKE2b-256 99b1a531026de17fa1ba5f63210fe010479a649ace57117824e558dc623cb68e

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapefold-0.1.0a2-py3-none-any.whl:

Publisher: ci.yml on Mihailorama/scrapefold

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page