Skip to main content

Web scraping MCP server with SSRF protection

Project description

ScrapeMCP — Web Scraping MCP Server

CI PyPI Python License

Servidor MCP para extracción estructurada de datos web. Scrapea páginas, tablas, listas, sitemaps y más. Incluye protección SSRF integrada.

Features / Funcionalidades

Tool / Herramienta Description / Descripción
scrape Extrae contenido de una URL usando selectores CSS personalizados
inspect Analiza la estructura de una página (meta tags, headings, links, images, forms, scripts)
tables Extrae todas las tablas HTML de una página
scrape_list Extrae una lista de items con campos personalizados desde selectores CSS
scrape_recursive Navega páginas enlazadas recursivamente extrayendo datos
sitemap Parsea el sitemap.xml de un sitio web
scrape_sitemap Scrapea todas las URLs de un sitemap
export Exporta datos a CSV, Markdown o JSON

SSRF Protection / Protección SSRF

El servidor bloquea automáticamente accesos a:

  • IPs privadas (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
  • Localhost/loopback (127.0.0.0/8, ::1)
  • Link-local (169.254.0.0/16, fe80::/10)
  • Hostnames bloqueados: localhost, metadata.google.internal, 169.254.169.254
  • Dominios .internal y .local

Solo permite esquemas http:// y https://.

Tech Stack

  • Python>=3.11
  • Framework: mcp (FastMCP) via stdio JSON-RPC
  • HTTP: httpx (async) — replaced sync requests for non-blocking I/O
  • Parsing: beautifulsoup4 + html5lib
  • Export: CSV (sanitized), Markdown, JSON
  • Config: python-dotenv — reads .env file for all settings

Quick Start

# Instalar dependencias
pip install mcp requests beautifulsoup4 html5lib httpx

# Ejecutar servidor
python server.py

Ejemplos

# Scrapear página completa
result = await session.call_tool("scrape", {"url": "https://example.com"})

# Scrapear con selectores personalizados
result = await session.call_tool("scrape", {
    "url": "https://example.com",
    "selectors": {"title": "h1", "price": ".price"}
})

# Inspeccionar estructura de página
result = await session.call_tool("inspect", {"url": "https://example.com"})

# Extraer tablas
result = await session.call_tool("tables", {
    "url": "https://example.com",
    "selector": "table"
})

# Scrapear lista
result = await session.call_tool("scrape_list", {
    "url": "https://example.com/items",
    "item_selector": ".item",
    "fields": {"name": "h2", "price": ".price"}
})

# Scrapear recursivamente
result = await session.call_tool("scrape_recursive", {
    "start_url": "https://example.com/blog",
    "link_selector": "a.post-link",
    "item_selector": "article",
    "fields": {"title": "h1", "content": "p"},
    "max_pages": 10
})

# Exportar a CSV
result = await session.call_tool("export", {
    "data": '[{"name": "Alice", "age": 30}]',
    "format": "csv"
})

Project Structure

scrapemcp/
├── server.py              # MCP server entry point (tools)
├── scrapers/
│   ├── __init__.py
│   ├── base.py            # BaseScraper, ScrapeResult, SSRF validation
│   ├── page.py            # PageScraper (scrape, inspect)
│   ├── table.py           # TableScraper (tables)
│   ├── list_scraper.py    # ListScraper (scrape_list, scrape_recursive)
│   └── sitemap.py         # SitemapScraper (sitemap, scrape_sitemap)
├── exporters.py           # CSV, Markdown, JSON export
├── client.py              # Test client CLI
└── pyproject.toml

🔧 Recent Improvements

  • SSRF Bypass Fixed — DNS resolution added: URL-encoded private IPs are now properly blocked
  • Async HTTP — Replaced sync requests with httpx.AsyncClient (non-blocking, async-compatible)
  • State Race Fixed — Removed shared mutable _last_url/_last_soup (no more cross-call contamination)
  • .env Support — Reads HTTP_TIMEOUT, SITEMAP_URL_LIMIT, MAX_RECURSIVE_PAGES from environment
  • Sitemap Discovery — Tries robots.txt for Sitemap: before falling back to /sitemap.xml
  • Recursive Timeout — 120s wall-clock timeout prevents runaway crawling
  • Configurable Rate Limit — Recursive crawl delay via RECURSIVE_DELAY env var
  • Export Size Limit — Rejects data > 10MB before parsing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemcp-1.0.0.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemcp-1.0.0-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapemcp-1.0.0.tar.gz.

File metadata

  • Download URL: scrapemcp-1.0.0.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for scrapemcp-1.0.0.tar.gz
Algorithm Hash digest
SHA256 bc4b11705332998d7c1030b61c7f9d203727f47446a96b9c5c14dfb7c9bd093a
MD5 7cef0d05287a07030d694d3b38e10ec4
BLAKE2b-256 a097413a4e6fcc65408e433ae057a5a5c2b8380622231ba034de37a39ab1b556

See more details on using hashes here.

File details

Details for the file scrapemcp-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: scrapemcp-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 29.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for scrapemcp-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5dc3cc979023aa6d0b2d7b1dc433b0493c17735fed7d0959cd282ef8330444a
MD5 1e98aa7800b778a7c99efbe432d62dcc
BLAKE2b-256 1c7f97e6c8a962af72ee73bcf6ffffc576393a0d259a5ad8d394ea84937c0eb4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page