A Model Context Protocol (MCP) server for web scraping using Crawl4AI

These details have not been verified by PyPI

Project description

Web Scraping MCP Server

A Model Context Protocol (MCP) server that provides utility tools for web scraping using Crawl4AI as the core engine. Supports multiple extraction modes including markdown generation, HTML processing, structured data extraction via CSS selectors, and AI-powered semantic analysis.

Features

Multiple Extraction Modes: Support for markdown, HTML, schema-based, and LLM-powered extraction
Advanced Browser Automation: Connects to a browser via CDP for full control, with proxy support and stealth features
Structured Output: Returns structured JSON data for programmatic consumption
Session Management: Persistent sessions for multi-step workflows
Comprehensive Error Handling: Robust error handling with detailed error information
Type Safety: Full type hints and Pydantic validation
Cost-Aware LLM Usage: Smart LLM usage with budget controls and caching

Supported Extraction Modes

1. Markdown Extraction (`markdown`)

Use Case: Clean content extraction for analysis and documentation
Output: Clean markdown with citations and structured formatting
Best For: Articles, blog posts, documentation

2. HTML Extraction (`html`)

Variants: Clean HTML or raw HTML
Use Case: HTML processing and structure analysis
Best For: Template extraction, HTML analysis

3. Schema-Based Extraction (`schema`)

Use Case: Structured data extraction using CSS selectors
Output: JSON array of extracted objects
Best For: Product catalogs, data tables, listings

4. LLM-Powered Extraction (`llm`)

Use Case: AI-powered semantic extraction and analysis
Output: Structured JSON or text based on instructions
Best For: Content analysis, sentiment extraction, complex reasoning

Quick Start

Installation

# Install via pip
pip install web-scraping-mcp-server

# Or use with uvx (no installation required)
uvx web-scraping-mcp-server

Bootstrap Setup

The server includes automatic bootstrap functionality to ensure crawl4ai and Playwright are properly configured:

# Manual bootstrap (optional - server does this automatically)
web-scraping-mcp-bootstrap

# Skip auto-setup if needed
export WSMCP_SKIP_AUTO_SETUP=1
web-scraping-mcp-server

The bootstrap process:

Checks if crawl4ai is installed
Runs crawl4ai-setup to install Playwright browsers
Verifies setup with crawl4ai-doctor
Falls back to manual Playwright installation if needed

This ensures the server is ready to use immediately after installation.

MCP Client Configuration

Claude Desktop

{
  "mcpServers": {
    "web-scraping": {
      "command": "uvx",
      "args": ["web-scraping-mcp-server"],
      "env": {
        "LITELLM_API_KEY": "your-litellm-api-key",
        "LITELLM_API_BASE": "https://api.openai.com/v1"
      }
    }
  }
}

Python MCP Client

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def scraping_example():
    server_params = StdioServerParameters(
        command="uvx",
        args=["web-scraping-mcp-server"],
        env={
            "LITELLM_API_KEY": "your-litellm-api-key",
            "LITELLM_API_BASE": "https://api.openai.com/v1"
        }
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # List available tools
            tools = await session.list_tools()
            print(f"Available tools: {[tool.name for tool in tools.tools]}")

            # Perform web scraping
            result = await session.call_tool("web_scrape", {
                "urls": ["https://example.com/article"],
                "extraction": {
                    "mode": "markdown"
                }
            })

            print(f"Scraped content: {result.content}")

asyncio.run(scraping_example())

Available Tools

`web_scrape`

Performs web scraping using Crawl4AI with configurable extraction modes and browser settings.

Parameters:

urls (array, required): List of URLs to scrape (1-100 URLs)
extraction (object, required): Extraction configuration
- mode (string): "markdown", "html", "schema", or "llm"
- htmlVariant (string, optional): "clean" or "raw" (for HTML mode)
- schema (object, optional): Schema configuration (for schema mode)
- llm (object, optional): LLM configuration (for LLM mode)
browser (object, optional): Browser configuration
page (object, optional): Page interaction settings
retry (object, optional): Retry configuration

Example:

{
  "urls": ["https://example.com/products"],
  "extraction": {
    "mode": "schema",
    "schema": {
      "baseSelector": ".product-card",
      "fields": [
        {"name": "title", "selector": "h3", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"}
      ]
    }
  },
  "browser": {
    "headless": true,
    "userAgentMode": "random"
  }
}

`batch_scrape`

Performs batch web scraping with parallel processing for multiple URLs.

Parameters:

Same as web_scrape but optimized for batch processing
concurrency (integer, optional): Maximum concurrent requests (default: 3)

`scrape_with_session`

Performs web scraping using persistent browser sessions for multi-step workflows.

Parameters:

Same as web_scrape plus:
sessionId (string, required): Unique session identifier
jsCode (array, optional): JavaScript code to execute
waitFor (string, optional): Wait condition before extraction

Configuration Examples

Basic Markdown Extraction

{
  "urls": ["https://example.com/article"],
  "extraction": {
    "mode": "markdown"
  }
}

Structured Data Extraction

{
  "urls": ["https://shop.example.com/products"],
  "extraction": {
    "mode": "schema",
    "schema": {
      "baseSelector": ".product-item",
      "fields": [
        {"name": "name", "selector": ".product-name", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"}
      ]
    }
  },
  "browser": {
    "headless": true,
    "textMode": true
  }
}

AI-Powered Content Analysis

{
  "urls": ["https://news.example.com/article"],
  "extraction": {
    "mode": "llm",
    "llm": {
      "instruction": "Extract the article title, author, key topics, and sentiment. Return as structured JSON.",
      "responseFormat": "json",
      "model": "openai/gpt-4o-mini",
      "temperature": 0.1
    }
  }
}

Multi-Step Session Workflow

{
  "urls": ["https://example.com/login"],
  "extraction": {
    "mode": "markdown"
  },
  "sessionId": "workflow_session",
  "jsCode": [
    "document.querySelector('#username').value = 'user';",
    "document.querySelector('#password').value = 'pass';",
    "document.querySelector('#login').click();"
  ],
  "waitFor": "css:.dashboard"
}

Response Structure

All scraping tools return structured results:

Successful Response

{
  "success": true,
  "results": [
    {
      "url": "https://example.com",
      "success": true,
      "content": "Extracted content based on extraction mode",
      "metadata": {
        "title": "Page Title",
        "status_code": 200,
        "final_url": "https://example.com",
        "links": {
          "internal": ["https://example.com/page1"],
          "external": ["https://external.com"]
        },
        "media": {
          "images": 5,
          "videos": 1,
          "audio": 0
        }
      }
    }
  ],
  "summary": {
    "total_urls": 1,
    "successful": 1,
    "failed": 0,
    "extraction_mode": "markdown"
  }
}

Content by Extraction Mode

Markdown: Clean markdown string with citations
HTML: Raw or cleaned HTML string
Schema: Array of structured objects matching the schema
LLM: Structured JSON or text based on response format

Environment Variables

Required for LLM Extraction

# LiteLLM Configuration (supports all major LLM providers)
export LITELLM_API_KEY="your-api-key"        # Required for LLM extraction
export LITELLM_API_BASE="https://api.openai.com/v1"  # Optional, provider base URL

# Examples for different providers:
# OpenAI: LITELLM_API_KEY="sk-..." LITELLM_API_BASE="https://api.openai.com/v1"
# Anthropic: LITELLM_API_KEY="sk-ant-..." LITELLM_API_BASE="https://api.anthropic.com"
# Groq: LITELLM_API_KEY="gsk_..." LITELLM_API_BASE="https://api.groq.com/openai/v1"

Optional Configuration

# Server configuration
export WEB_SCRAPING_SERVER_NAME="web-scraping-mcp"
export WEB_SCRAPING_DEFAULT_TIMEOUT="60000"
export WEB_SCRAPING_MAX_CONCURRENT="5"

# Browser configuration
export WEB_SCRAPING_CDP_URL="http://localhost:9222" # Optional: CDP URL for managed browsers
export WEB_SCRAPING_DEFAULT_HEADLESS="true"

# Logging
export LOG_LEVEL="INFO"

Error Handling

The server provides comprehensive error handling:

Configuration Errors: Invalid parameters or missing required fields
Network Errors: Connection failures, timeouts, DNS issues
Browser Errors: Page load failures, JavaScript errors
Extraction Errors: CSS selector failures, LLM API errors
Validation Errors: Invalid URLs or malformed requests

Error responses include detailed information for debugging:

{
  "success": false,
  "error": "Page load timeout after 60000ms",
  "error_type": "TimeoutError",
  "url": "https://example.com/slow-page",
  "retry_count": 2
}

Performance Considerations

Optimization Strategies

Selective Extraction: Use CSS selectors to extract only needed content
Browser Configuration: Use headless mode and disable images for faster crawling
Concurrent Processing: Use batch_scrape for multiple URLs
Session Reuse: Use scrape_with_session for multi-step workflows
LLM Cost Control: Use schema extraction when possible, LLM only for semantic analysis

Resource Management

Memory Usage: Automatic cleanup of browser resources
Network Bandwidth: Configurable image loading and content filtering
Rate Limiting: Built-in delays and retry logic
Connection Pooling: Efficient browser session management

Security and Ethics

Respectful Scraping

Rate Limiting: Built-in delays and retry logic
User Agent: Configurable user agent identification
Robots.txt: Respect website scraping policies (manual compliance)
Terms of Service: Comply with website terms of use

Security Features

SSL Verification: Certificate verification enabled by default
URL Validation: Input validation and sanitization
Timeout Protection: Prevents hanging requests
Resource Limits: Configurable memory and connection limits

Development

Setup

git clone https://github.com/realtimex/web-scraping-mcp-server
cd web-scraping-mcp-server
pip install -e ".[dev]"

Testing

pytest                    # Run all tests
pytest --cov            # Run with coverage
pytest -m integration   # Run integration tests

Debug Mode

LITELLM_API_KEY=your-key \
LITELLM_API_BASE=https://api.openai.com/v1 \
LOG_LEVEL=DEBUG \
web-scraping-mcp-server

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please read our contributing guidelines and submit pull requests to our GitHub repository.

Support

For support, please open an issue on our GitHub repository or contact support@realtimex.com.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Nov 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

realtimex_web_scraping_mcp_server-0.1.0.tar.gz (23.4 kB view details)

Uploaded Nov 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

realtimex_web_scraping_mcp_server-0.1.0-py3-none-any.whl (26.4 kB view details)

Uploaded Nov 18, 2025 Python 3

File details

Details for the file realtimex_web_scraping_mcp_server-0.1.0.tar.gz.

File metadata

Download URL: realtimex_web_scraping_mcp_server-0.1.0.tar.gz
Upload date: Nov 18, 2025
Size: 23.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for realtimex_web_scraping_mcp_server-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2053b5d39c2df6f6161191a56e5a4d008d3f8d805b11d47dfd97c98b2f5d7df8`
MD5	`0fa55ead246f911472311c71142e85df`
BLAKE2b-256	`0be1c7d1e8a36f9924e61024f5f9f09531c638daa62cbe9ea25d63472db18eec`

See more details on using hashes here.

File details

Details for the file realtimex_web_scraping_mcp_server-0.1.0-py3-none-any.whl.

File metadata

Download URL: realtimex_web_scraping_mcp_server-0.1.0-py3-none-any.whl
Upload date: Nov 18, 2025
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for realtimex_web_scraping_mcp_server-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e06af9bc5777a99f2ea35abed079238782a2d1b198946a7f1f64e90b4790493`
MD5	`b320e8d48e7c7ba2d4aa906ace8039de`
BLAKE2b-256	`75d7a8e8ae60f351811ca2394f95bfe9289706baedeb65491344fea8348df003`

See more details on using hashes here.

realtimex-web-scraping-mcp-server 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Web Scraping MCP Server

Features

Supported Extraction Modes

1. Markdown Extraction (markdown)

2. HTML Extraction (html)

3. Schema-Based Extraction (schema)

4. LLM-Powered Extraction (llm)

Quick Start

Installation

Bootstrap Setup

MCP Client Configuration

Claude Desktop

Python MCP Client

Available Tools

web_scrape

batch_scrape

scrape_with_session

Configuration Examples

Basic Markdown Extraction

Structured Data Extraction

AI-Powered Content Analysis

Multi-Step Session Workflow

Response Structure

Successful Response

Content by Extraction Mode

Environment Variables

Required for LLM Extraction

Optional Configuration

Error Handling

Performance Considerations

Optimization Strategies

Resource Management

Security and Ethics

Respectful Scraping

Security Features

Development

Setup

Testing

Debug Mode

License

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Markdown Extraction (`markdown`)

2. HTML Extraction (`html`)

3. Schema-Based Extraction (`schema`)

4. LLM-Powered Extraction (`llm`)

`web_scrape`

`batch_scrape`

`scrape_with_session`