Skip to main content

Headless web-scraping MCP server built on Scrapy: fetch, extract (CSS/XPath), links, tables, sitemaps, robots, and async crawls.

Project description

scrapy-mcp

A headless web-scraping MCP server built on Scrapy. It exposes Scrapy's scraping primitives — polite fetching, CSS/XPath extraction, link and table extraction, sitemap and robots.txt reading, and bounded asynchronous crawls — as MCP tools an agent can call over stdio.

  • Headless, no rendering. Pages are fetched and parsed as HTML; no browser, no JavaScript execution. This keeps the footprint tiny — it runs comfortably on weak machines.
  • Reactor-safe. Every operation runs in a short-lived Scrapy subprocess, so Twisted's reactor never lives inside the asyncio MCP server (no ReactorNotRestartable), and memory is reclaimed after each call.
  • Polite by default. Obeys robots.txt, throttles with AutoThrottle, and enforces hard page/depth caps so a crawl can't run away.

Install / run

Run straight from PyPI with uv — no install step:

uvx scrapy-mcp

Or install it:

uv pip install scrapy-mcp
scrapy-mcp

The server speaks MCP over stdio. Point any MCP client at it. For Claude Desktop, add to claude_desktop_config.json:

{
  "mcpServers": {
    "scrapy": {
      "command": "uvx",
      "args": ["scrapy-mcp"]
    }
  }
}

Tools

Tool What it does
fetch_page(url, format, max_bytes, obey_robots) Fetch one page as markdown (default), text, or html.
extract(url, selectors, obey_robots) Pull structured fields with CSS/XPath selectors.
extract_tables(url, max_tables, obey_robots) Extract every HTML <table> as {headers, rows}.
extract_links(url, same_domain, pattern, limit, obey_robots) List de-duplicated links on a page.
get_sitemap(url, limit, obey_robots) Read a sitemap (gzip + sitemap-index aware).
check_robots(url, user_agent) Is a URL crawlable? Returns the crawl-delay and sitemaps.
start_crawl(start_url, allow_patterns, deny_patterns, max_pages, max_depth, same_domain, selectors, ...) Start a bounded BFS crawl; returns a job_id.
crawl_status(job_id) State + pages scraped for a crawl.
crawl_results(job_id, cursor, limit) Page through a crawl's scraped items.
cancel_crawl(job_id) Stop a running crawl; keep results so far.

Selector format (extract / start_crawl)

selectors maps an output field to a selector. Each value is either a CSS string (first match) or an object for more control:

{
  "title": "h1::text",
  "price": "span.price::text",
  "all_links": {"css": "a::attr(href)", "all": true},
  "first_heading": {"xpath": "//h1/text()"}
}

"all": true returns every match as a list; otherwise the first match is returned.

Crawls are asynchronous

start_crawl returns immediately with a job_id. The crawl runs as a detached worker that streams results to disk, so it survives a server restart. Poll crawl_status(job_id), then read items with crawl_results(job_id) (safe to call mid-crawl for partial results). Jobs are stored under the system temp dir and reclaimed after 7 days (configurable).

Configuration

All settings are optional environment variables (sensible, polite defaults tuned for a weak host). They're how you tune a uvx scrapy-mcp deployment.

Variable Default Meaning
SCRAPY_MCP_USER_AGENT scrapy-mcp/<version> … User-Agent header.
SCRAPY_MCP_OBEY_ROBOTS true Obey robots.txt.
SCRAPY_MCP_DOWNLOAD_DELAY 0.5 Seconds between requests to a host.
SCRAPY_MCP_CONCURRENT_REQUESTS 8 Global concurrency.
SCRAPY_MCP_CONCURRENT_REQUESTS_PER_DOMAIN 4 Per-host concurrency.
SCRAPY_MCP_DOWNLOAD_TIMEOUT 30 Per-request timeout (s).
SCRAPY_MCP_RETRY_TIMES 2 Retries on transient failures.
SCRAPY_MCP_AUTOTHROTTLE true Adapt delay to server latency.
SCRAPY_MCP_MAX_BYTES 50000 Max characters returned per page (then truncated).
SCRAPY_MCP_REQUEST_TIMEOUT 60 Wall-clock cap for a blocking single fetch (s).
SCRAPY_MCP_DEFAULT_MAX_PAGES / _MAX_PAGES_CAP 50 / 1000 Crawl page default / hard cap.
SCRAPY_MCP_DEFAULT_MAX_DEPTH / _MAX_DEPTH_CAP 2 / 10 Crawl depth default / hard cap.
SCRAPY_MCP_JOB_DIR <tmp>/scrapy_mcp_jobs Where crawl jobs are stored.
SCRAPY_MCP_JOB_TTL_DAYS 7 Delete crawl jobs older than this (0 disables).
SCRAPY_MCP_LOG_LEVEL ERROR Scrapy log level (to stderr).

Development

uv venv
uv pip install -e ".[dev]"
uv run pytest          # unit tests (no network)
uv build               # build wheel + sdist into dist/

License

MIT © Eitan Hadar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_mcp-0.1.0.tar.gz (125.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_mcp-0.1.0-py3-none-any.whl (32.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_mcp-0.1.0.tar.gz.

File metadata

  • Download URL: scrapy_mcp-0.1.0.tar.gz
  • Upload date:
  • Size: 125.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for scrapy_mcp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 69a6cf0be1b6c50f381be6e5227a6d8985ecf45c42c113b06df4c12ba3beb1b7
MD5 215e636a66872b212453215ecdd52bb4
BLAKE2b-256 0bc2b86fbaba653707aa509fadcb609f241daa69cf892e5bbe091453009bc0c5

See more details on using hashes here.

File details

Details for the file scrapy_mcp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scrapy_mcp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for scrapy_mcp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 15562d2e9aa00d28d7b056413103a93534f5eaa61350ab280961dd38ac6f9d64
MD5 1c800946f1ed61956e587c88bb42456f
BLAKE2b-256 8a1ca4ed3ab0e2aadaac1cada8cdae5f982661b02c1c88951fed983add340fba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page