Headless web-scraping MCP server built on Scrapy: fetch, extract (CSS/XPath), links, tables, sitemaps, robots, and async crawls.

These details have not been verified by PyPI

Project links

Project description

scrapy-mcp

A headless web-scraping MCP server built on Scrapy. It exposes Scrapy's scraping primitives — polite fetching, CSS/XPath extraction, link and table extraction, sitemap and robots.txt reading, and bounded asynchronous crawls — as MCP tools an agent can call over stdio.

Headless, no rendering. Pages are fetched and parsed as HTML; no browser, no JavaScript execution. This keeps the footprint tiny — it runs comfortably on weak machines.
Reactor-safe. Every operation runs in a short-lived Scrapy subprocess, so Twisted's reactor never lives inside the asyncio MCP server (no ReactorNotRestartable), and memory is reclaimed after each call.
Polite by default. Obeys robots.txt, throttles with AutoThrottle, and enforces hard page/depth caps so a crawl can't run away.

Install / run

Run straight from PyPI with uv — no install step:

uvx scrapy-mcp

Or install it:

uv pip install scrapy-mcp
scrapy-mcp

The server speaks MCP over stdio. Point any MCP client at it. For Claude Desktop, add to claude_desktop_config.json:

{
  "mcpServers": {
    "scrapy": {
      "command": "uvx",
      "args": ["scrapy-mcp"]
    }
  }
}

Tools

Tool	What it does
`fetch_page(url, format, max_bytes, obey_robots)`	Fetch one page as `markdown` (default), `text`, or `html`.
`extract(url, selectors, obey_robots)`	Pull structured fields with CSS/XPath selectors.
`extract_tables(url, max_tables, obey_robots)`	Extract every HTML `<table>` as `{headers, rows}`.
`extract_links(url, same_domain, pattern, limit, obey_robots)`	List de-duplicated links on a page.
`get_sitemap(url, limit, obey_robots)`	Read a sitemap (gzip + sitemap-index aware).
`check_robots(url, user_agent)`	Is a URL crawlable? Returns the crawl-delay and sitemaps.
`start_crawl(start_url, allow_patterns, deny_patterns, max_pages, max_depth, same_domain, selectors, ...)`	Start a bounded BFS crawl; returns a `job_id`.
`crawl_status(job_id)`	State + pages scraped for a crawl.
`crawl_results(job_id, cursor, limit)`	Page through a crawl's scraped items.
`cancel_crawl(job_id)`	Stop a running crawl; keep results so far.

Selector format (`extract` / `start_crawl`)

selectors maps an output field to a selector. Each value is either a CSS string (first match) or an object for more control:

{
  "title": "h1::text",
  "price": "span.price::text",
  "all_links": {"css": "a::attr(href)", "all": true},
  "first_heading": {"xpath": "//h1/text()"}
}

"all": true returns every match as a list; otherwise the first match is returned.

Crawls are asynchronous

start_crawl returns immediately with a job_id. The crawl runs as a detached worker that streams results to disk, so it survives a server restart. Poll crawl_status(job_id), then read items with crawl_results(job_id) (safe to call mid-crawl for partial results). Jobs are stored under the system temp dir and reclaimed after 7 days (configurable).

Configuration

All settings are optional environment variables (sensible, polite defaults tuned for a weak host). They're how you tune a uvx scrapy-mcp deployment.

Variable	Default	Meaning
`SCRAPY_MCP_USER_AGENT`	`scrapy-mcp/<version> …`	User-Agent header.
`SCRAPY_MCP_OBEY_ROBOTS`	`true`	Obey `robots.txt`.
`SCRAPY_MCP_DOWNLOAD_DELAY`	`0.5`	Seconds between requests to a host.
`SCRAPY_MCP_CONCURRENT_REQUESTS`	`8`	Global concurrency.
`SCRAPY_MCP_CONCURRENT_REQUESTS_PER_DOMAIN`	`4`	Per-host concurrency.
`SCRAPY_MCP_DOWNLOAD_TIMEOUT`	`30`	Per-request timeout (s).
`SCRAPY_MCP_RETRY_TIMES`	`2`	Retries on transient failures.
`SCRAPY_MCP_AUTOTHROTTLE`	`true`	Adapt delay to server latency.
`SCRAPY_MCP_MAX_BYTES`	`50000`	Max characters returned per page (then truncated).
`SCRAPY_MCP_REQUEST_TIMEOUT`	`60`	Wall-clock cap for a blocking single fetch (s).
`SCRAPY_MCP_DEFAULT_MAX_PAGES` / `_MAX_PAGES_CAP`	`50` / `1000`	Crawl page default / hard cap.
`SCRAPY_MCP_DEFAULT_MAX_DEPTH` / `_MAX_DEPTH_CAP`	`2` / `10`	Crawl depth default / hard cap.
`SCRAPY_MCP_JOB_DIR`	`<tmp>/scrapy_mcp_jobs`	Where crawl jobs are stored.
`SCRAPY_MCP_JOB_TTL_DAYS`	`7`	Delete crawl jobs older than this (0 disables).
`SCRAPY_MCP_LOG_LEVEL`	`ERROR`	Scrapy log level (to stderr).

Development

uv venv
uv pip install -e ".[dev]"
uv run pytest          # unit tests (no network)
uv build               # build wheel + sdist into dist/

License

MIT © Eitan Hadar

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_mcp-0.1.0.tar.gz (125.7 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapy_mcp-0.1.0-py3-none-any.whl (32.2 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file scrapy_mcp-0.1.0.tar.gz.

File metadata

Download URL: scrapy_mcp-0.1.0.tar.gz
Upload date: Jun 10, 2026
Size: 125.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for scrapy_mcp-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`69a6cf0be1b6c50f381be6e5227a6d8985ecf45c42c113b06df4c12ba3beb1b7`
MD5	`215e636a66872b212453215ecdd52bb4`
BLAKE2b-256	`0bc2b86fbaba653707aa509fadcb609f241daa69cf892e5bbe091453009bc0c5`

See more details on using hashes here.

File details

Details for the file scrapy_mcp-0.1.0-py3-none-any.whl.

File metadata

Download URL: scrapy_mcp-0.1.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 32.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for scrapy_mcp-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`15562d2e9aa00d28d7b056413103a93534f5eaa61350ab280961dd38ac6f9d64`
MD5	`1c800946f1ed61956e587c88bb42456f`
BLAKE2b-256	`8a1ca4ed3ab0e2aadaac1cada8cdae5f982661b02c1c88951fed983add340fba`

See more details on using hashes here.

scrapy-mcp 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scrapy-mcp

Install / run

Tools

Selector format (`extract` / `start_crawl`)

Crawls are asynchronous

Configuration

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

scrapy-mcp 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scrapy-mcp

Install / run

Tools

Selector format (extract / start_crawl)

Crawls are asynchronous

Configuration

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Selector format (`extract` / `start_crawl`)