Web scraping MCP server with SSRF protection

These details have not been verified by PyPI

Project links

Project description

ScrapeMCP — Web Scraping MCP Server

Servidor MCP para extracción estructurada de datos web. Scrapea páginas, tablas, listas, sitemaps y más. Incluye protección SSRF integrada.

Features / Funcionalidades

Tool / Herramienta	Description / Descripción
`scrape`	Extrae contenido de una URL usando selectores CSS personalizados
`inspect`	Analiza la estructura de una página (meta tags, headings, links, images, forms, scripts)
`tables`	Extrae todas las tablas HTML de una página
`scrape_list`	Extrae una lista de items con campos personalizados desde selectores CSS
`scrape_recursive`	Navega páginas enlazadas recursivamente extrayendo datos
`sitemap`	Parsea el sitemap.xml de un sitio web
`scrape_sitemap`	Scrapea todas las URLs de un sitemap
`export`	Exporta datos a CSV, Markdown o JSON

SSRF Protection / Protección SSRF

El servidor bloquea automáticamente accesos a:

IPs privadas (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
Localhost/loopback (127.0.0.0/8, ::1)
Link-local (169.254.0.0/16, fe80::/10)
Hostnames bloqueados: localhost, metadata.google.internal, 169.254.169.254
Dominios .internal y .local

Solo permite esquemas http:// y https://.

Tech Stack

Python — >=3.11
Framework: mcp (FastMCP) via stdio JSON-RPC
HTTP: httpx (async) — replaced sync requests for non-blocking I/O
Parsing: beautifulsoup4 + html5lib
Export: CSV (sanitized), Markdown, JSON
Config: python-dotenv — reads .env file for all settings

Quick Start

# Instalar dependencias
pip install mcp requests beautifulsoup4 html5lib httpx

# Ejecutar servidor
python server.py

Ejemplos

# Scrapear página completa
result = await session.call_tool("scrape", {"url": "https://example.com"})

# Scrapear con selectores personalizados
result = await session.call_tool("scrape", {
    "url": "https://example.com",
    "selectors": {"title": "h1", "price": ".price"}
})

# Inspeccionar estructura de página
result = await session.call_tool("inspect", {"url": "https://example.com"})

# Extraer tablas
result = await session.call_tool("tables", {
    "url": "https://example.com",
    "selector": "table"
})

# Scrapear lista
result = await session.call_tool("scrape_list", {
    "url": "https://example.com/items",
    "item_selector": ".item",
    "fields": {"name": "h2", "price": ".price"}
})

# Scrapear recursivamente
result = await session.call_tool("scrape_recursive", {
    "start_url": "https://example.com/blog",
    "link_selector": "a.post-link",
    "item_selector": "article",
    "fields": {"title": "h1", "content": "p"},
    "max_pages": 10
})

# Exportar a CSV
result = await session.call_tool("export", {
    "data": '[{"name": "Alice", "age": 30}]',
    "format": "csv"
})

Project Structure

scrapemcp/
├── server.py              # MCP server entry point (tools)
├── scrapers/
│   ├── __init__.py
│   ├── base.py            # BaseScraper, ScrapeResult, SSRF validation
│   ├── page.py            # PageScraper (scrape, inspect)
│   ├── table.py           # TableScraper (tables)
│   ├── list_scraper.py    # ListScraper (scrape_list, scrape_recursive)
│   └── sitemap.py         # SitemapScraper (sitemap, scrape_sitemap)
├── exporters.py           # CSV, Markdown, JSON export
├── client.py              # Test client CLI
└── pyproject.toml

🔧 Recent Improvements

SSRF Bypass Fixed — DNS resolution added: URL-encoded private IPs are now properly blocked
Async HTTP — Replaced sync requests with httpx.AsyncClient (non-blocking, async-compatible)
State Race Fixed — Removed shared mutable _last_url/_last_soup (no more cross-call contamination)
.env Support — Reads HTTP_TIMEOUT, SITEMAP_URL_LIMIT, MAX_RECURSIVE_PAGES from environment
Sitemap Discovery — Tries robots.txt for Sitemap: before falling back to /sitemap.xml
Recursive Timeout — 120s wall-clock timeout prevents runaway crawling
Configurable Rate Limit — Recursive crawl delay via RECURSIVE_DELAY env var
Export Size Limit — Rejects data > 10MB before parsing

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Jun 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemcp-1.0.0.tar.gz (12.2 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapemcp-1.0.0-py3-none-any.whl (29.9 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file scrapemcp-1.0.0.tar.gz.

File metadata

Download URL: scrapemcp-1.0.0.tar.gz
Upload date: Jun 3, 2026
Size: 12.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for scrapemcp-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`bc4b11705332998d7c1030b61c7f9d203727f47446a96b9c5c14dfb7c9bd093a`
MD5	`7cef0d05287a07030d694d3b38e10ec4`
BLAKE2b-256	`a097413a4e6fcc65408e433ae057a5a5c2b8380622231ba034de37a39ab1b556`

See more details on using hashes here.

File details

Details for the file scrapemcp-1.0.0-py3-none-any.whl.

File metadata

Download URL: scrapemcp-1.0.0-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 29.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for scrapemcp-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5dc3cc979023aa6d0b2d7b1dc433b0493c17735fed7d0959cd282ef8330444a`
MD5	`1e98aa7800b778a7c99efbe432d62dcc`
BLAKE2b-256	`1c7f97e6c8a962af72ee73bcf6ffffc576393a0d259a5ad8d394ea84937c0eb4`

See more details on using hashes here.

scrapemcp 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ScrapeMCP — Web Scraping MCP Server

Features / Funcionalidades

SSRF Protection / Protección SSRF

Tech Stack

Quick Start

Ejemplos

Project Structure

🔧 Recent Improvements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes