Web scraping MCP server with SSRF protection
Project description
ScrapeMCP — Web Scraping MCP Server
Servidor MCP para extracción estructurada de datos web. Scrapea páginas, tablas, listas, sitemaps y más. Incluye protección SSRF integrada.
Features / Funcionalidades
| Tool / Herramienta | Description / Descripción |
|---|---|
scrape |
Extrae contenido de una URL usando selectores CSS personalizados |
inspect |
Analiza la estructura de una página (meta tags, headings, links, images, forms, scripts) |
tables |
Extrae todas las tablas HTML de una página |
scrape_list |
Extrae una lista de items con campos personalizados desde selectores CSS |
scrape_recursive |
Navega páginas enlazadas recursivamente extrayendo datos |
sitemap |
Parsea el sitemap.xml de un sitio web |
scrape_sitemap |
Scrapea todas las URLs de un sitemap |
export |
Exporta datos a CSV, Markdown o JSON |
SSRF Protection / Protección SSRF
El servidor bloquea automáticamente accesos a:
- IPs privadas (
10.0.0.0/8,172.16.0.0/12,192.168.0.0/16) - Localhost/loopback (
127.0.0.0/8,::1) - Link-local (
169.254.0.0/16,fe80::/10) - Hostnames bloqueados:
localhost,metadata.google.internal,169.254.169.254 - Dominios
.internaly.local
Solo permite esquemas http:// y https://.
Tech Stack
- Python —
>=3.11 - Framework:
mcp(FastMCP) via stdio JSON-RPC - HTTP:
httpx(async) — replaced syncrequestsfor non-blocking I/O - Parsing:
beautifulsoup4+html5lib - Export: CSV (sanitized), Markdown, JSON
- Config:
python-dotenv— reads.envfile for all settings
Quick Start
# Instalar dependencias
pip install mcp requests beautifulsoup4 html5lib httpx
# Ejecutar servidor
python server.py
Ejemplos
# Scrapear página completa
result = await session.call_tool("scrape", {"url": "https://example.com"})
# Scrapear con selectores personalizados
result = await session.call_tool("scrape", {
"url": "https://example.com",
"selectors": {"title": "h1", "price": ".price"}
})
# Inspeccionar estructura de página
result = await session.call_tool("inspect", {"url": "https://example.com"})
# Extraer tablas
result = await session.call_tool("tables", {
"url": "https://example.com",
"selector": "table"
})
# Scrapear lista
result = await session.call_tool("scrape_list", {
"url": "https://example.com/items",
"item_selector": ".item",
"fields": {"name": "h2", "price": ".price"}
})
# Scrapear recursivamente
result = await session.call_tool("scrape_recursive", {
"start_url": "https://example.com/blog",
"link_selector": "a.post-link",
"item_selector": "article",
"fields": {"title": "h1", "content": "p"},
"max_pages": 10
})
# Exportar a CSV
result = await session.call_tool("export", {
"data": '[{"name": "Alice", "age": 30}]',
"format": "csv"
})
Project Structure
scrapemcp/
├── server.py # MCP server entry point (tools)
├── scrapers/
│ ├── __init__.py
│ ├── base.py # BaseScraper, ScrapeResult, SSRF validation
│ ├── page.py # PageScraper (scrape, inspect)
│ ├── table.py # TableScraper (tables)
│ ├── list_scraper.py # ListScraper (scrape_list, scrape_recursive)
│ └── sitemap.py # SitemapScraper (sitemap, scrape_sitemap)
├── exporters.py # CSV, Markdown, JSON export
├── client.py # Test client CLI
└── pyproject.toml
🔧 Recent Improvements
- SSRF Bypass Fixed — DNS resolution added: URL-encoded private IPs are now properly blocked
- Async HTTP — Replaced sync
requestswithhttpx.AsyncClient(non-blocking, async-compatible) - State Race Fixed — Removed shared mutable
_last_url/_last_soup(no more cross-call contamination) .envSupport — ReadsHTTP_TIMEOUT,SITEMAP_URL_LIMIT,MAX_RECURSIVE_PAGESfrom environment- Sitemap Discovery — Tries
robots.txtforSitemap:before falling back to/sitemap.xml - Recursive Timeout — 120s wall-clock timeout prevents runaway crawling
- Configurable Rate Limit — Recursive crawl delay via
RECURSIVE_DELAYenv var - Export Size Limit — Rejects data > 10MB before parsing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapemcp-1.0.0.tar.gz
(12.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
scrapemcp-1.0.0-py3-none-any.whl
(29.9 kB
view details)
File details
Details for the file scrapemcp-1.0.0.tar.gz.
File metadata
- Download URL: scrapemcp-1.0.0.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc4b11705332998d7c1030b61c7f9d203727f47446a96b9c5c14dfb7c9bd093a
|
|
| MD5 |
7cef0d05287a07030d694d3b38e10ec4
|
|
| BLAKE2b-256 |
a097413a4e6fcc65408e433ae057a5a5c2b8380622231ba034de37a39ab1b556
|
File details
Details for the file scrapemcp-1.0.0-py3-none-any.whl.
File metadata
- Download URL: scrapemcp-1.0.0-py3-none-any.whl
- Upload date:
- Size: 29.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5dc3cc979023aa6d0b2d7b1dc433b0493c17735fed7d0959cd282ef8330444a
|
|
| MD5 |
1e98aa7800b778a7c99efbe432d62dcc
|
|
| BLAKE2b-256 |
1c7f97e6c8a962af72ee73bcf6ffffc576393a0d259a5ad8d394ea84937c0eb4
|