Skip to main content

MCP (Model Context Protocol) server for Crawl4AI - Universal web crawling and data extraction

Project description

๐Ÿ•ท๏ธ Crawl4AI MCP Server

PyPI version Python License: MIT Downloads

MCP (Model Context Protocol) server for Crawl4AI - Universal web crawling and data extraction for AI agents.

Integrate powerful web scraping capabilities into Claude, ChatGPT, and any MCP-compatible AI assistant.

๐Ÿ“‘ Table of Contents

๐ŸŽฏ Why This Tool?

The Problem

  • ๐Ÿ”ด No MCP servers for web scraping - AI agents can't access web content
  • ๐Ÿ”ด Complex scraping setup - Crawl4AI requires custom integration
  • ๐Ÿ”ด Limited protocol support - Most tools only support one transport
  • ๐Ÿ”ด Poor AI integration - Existing scrapers aren't optimized for LLMs

Our Solution

  • โœ… First Crawl4AI MCP server - Native MCP integration
  • โœ… All MCP transports - STDIO, SSE, and HTTP support
  • โœ… AI-optimized extraction - Clean markdown, structured data
  • โœ… One-line execution - crawl4ai-mcp --stdio
  • โœ… Production ready - Type hints, tests, Docker support

โšก Quick Start

One-Line Execution

# Install and run
pip install crawl4ai-mcp
crawl4ai-mcp --stdio

With Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "crawl4ai": {
      "command": "crawl4ai-mcp",
      "args": ["--stdio"]
    }
  }
}

With Any MCP Client

# STDIO mode (for CLI tools)
crawl4ai-mcp --stdio

# SSE mode (for web clients)  
crawl4ai-mcp --sse

# HTTP mode (for REST API)
crawl4ai-mcp --http

๐Ÿš€ Features

Core Capabilities

  • ๐ŸŒ Universal Web Scraping - Extract content from any website
  • ๐Ÿ“ Markdown Conversion - Clean, formatted markdown output
  • ๐Ÿ“ธ Screenshots - Capture visual content
  • ๐Ÿ“„ PDF Generation - Save pages as PDF
  • ๐ŸŽญ JavaScript Execution - Interact with dynamic content
  • ๐Ÿ”„ Multiple Transports - STDIO, SSE, HTTP protocols

Why Choose crawl4ai-mcp?

Feature crawl4ai-mcp Other Tools
MCP Protocol Support โœ… Full โŒ None
Crawl4AI Integration โœ… Native โŒ Manual
Transport Protocols โœ… All 3 โš ๏ธ Usually 1
AI Optimization โœ… Built-in โŒ Generic
Production Ready โœ… Yes โš ๏ธ Varies
Docker Support โœ… Yes โš ๏ธ Limited

๐Ÿ“ฆ Installation

From PyPI

pip install crawl4ai-mcp

From Source

git clone https://github.com/stgmt/crawl4ai-mcp.git
cd crawl4ai-mcp
pip install -e .

With Docker

docker pull stgmt/crawl4ai-mcp
docker run -p 3000:3000 stgmt/crawl4ai-mcp

๐Ÿ”ง Usage

Basic Command Line

# Default HTTP mode
crawl4ai-mcp

# STDIO mode for CLI integration
crawl4ai-mcp --stdio

# SSE mode for real-time streaming
crawl4ai-mcp --sse

# HTTP mode explicitly
crawl4ai-mcp --http

Python Integration

import asyncio
from crawl4ai_mcp import Crawl4AIMCPServer

async def main():
    server = Crawl4AIMCPServer()
    
    # Run in STDIO mode
    await server.run_stdio()
    
    # Or run in HTTP mode
    # server.run_http(host="0.0.0.0", port=3000)

asyncio.run(main())

With MCP Clients

Using mcp-client SDK

from mcp import ClientSession, StdioServerParameters
import asyncio

async def main():
    server_params = StdioServerParameters(
        command="crawl4ai-mcp",
        args=["--stdio"]
    )
    
    async with ClientSession(server_params) as session:
        # List available tools
        tools = await session.list_tools()
        
        # Crawl a webpage
        result = await session.call_tool(
            "crawl",
            {"url": "https://example.com"}
        )
        print(result)

asyncio.run(main())

๐Ÿ› ๏ธ Available Tools

1. crawl - Full Web Crawling

Extract complete content from any webpage.

{
  "name": "crawl",
  "arguments": {
    "url": "https://example.com",
    "wait_for": "css:.content",
    "timeout": 30000
  }
}

2. md - Markdown Extraction

Get clean markdown content from webpages.

{
  "name": "md", 
  "arguments": {
    "url": "https://docs.example.com",
    "clean": true
  }
}

3. html - Raw HTML

Retrieve raw HTML content.

{
  "name": "html",
  "arguments": {
    "url": "https://example.com"
  }
}

4. screenshot - Visual Capture

Take screenshots of webpages.

{
  "name": "screenshot",
  "arguments": {
    "url": "https://example.com",
    "full_page": true
  }
}

5. pdf - PDF Generation

Convert webpages to PDF.

{
  "name": "pdf",
  "arguments": {
    "url": "https://example.com",
    "format": "A4"
  }
}

6. execute_js - JavaScript Execution

Execute JavaScript on webpages.

{
  "name": "execute_js",
  "arguments": {
    "url": "https://example.com",
    "script": "document.title"
  }
}

๐ŸŒ Transport Protocols

STDIO Transport

Best for command-line tools and local development.

crawl4ai-mcp --stdio

Use cases:

  • Claude Desktop app
  • Terminal-based MCP clients
  • Local development and testing
  • CI/CD pipelines

SSE Transport (Server-Sent Events)

Ideal for real-time web applications.

crawl4ai-mcp --sse

Use cases:

  • Web-based MCP clients
  • Real-time streaming applications
  • Browser extensions
  • Progressive web apps

HTTP Transport

Standard REST API for maximum compatibility.

crawl4ai-mcp --http

Use cases:

  • REST API clients
  • Microservice architectures
  • Cloud deployments
  • Load-balanced environments

โš™๏ธ Configuration

Environment Variables

# Crawl4AI endpoint (if using remote instance)
CRAWL4AI_ENDPOINT=http://localhost:8000

# Server ports
HTTP_PORT=3000
SSE_PORT=3001

# Logging
LOG_LEVEL=INFO

# Performance
MAX_CONCURRENT_REQUESTS=10
REQUEST_TIMEOUT=30

Configuration File

Create .env file:

CRAWL4AI_ENDPOINT=http://your-crawl4ai-instance.com
HTTP_PORT=3000
SSE_PORT=3001
LOG_LEVEL=DEBUG
DEBUG=true

Advanced Settings

# config/settings.py
from pydantic import BaseSettings

class Settings(BaseSettings):
    crawl4ai_endpoint: str = "http://localhost:8000"
    http_port: int = 3000
    sse_port: int = 3001
    log_level: str = "INFO"
    debug: bool = False
    max_concurrent_requests: int = 10
    request_timeout: int = 30
    
    class Config:
        env_file = ".env"

settings = Settings()

๐Ÿณ Docker Support

Quick Start with Docker

# Run with default settings
docker run -p 3000:3000 stgmt/crawl4ai-mcp

# With environment variables
docker run -p 3000:3000 \
  -e CRAWL4AI_ENDPOINT=http://crawl4ai:8000 \
  -e LOG_LEVEL=DEBUG \
  stgmt/crawl4ai-mcp

# With Docker Compose
docker-compose up

Docker Compose

version: '3.8'

services:
  crawl4ai-mcp:
    image: stgmt/crawl4ai-mcp
    ports:
      - "3000:3000"
      - "3001:3001"
    environment:
      - CRAWL4AI_ENDPOINT=http://crawl4ai:8000
      - LOG_LEVEL=INFO
    depends_on:
      - crawl4ai
      
  crawl4ai:
    image: crawl4ai/crawl4ai
    ports:
      - "8000:8000"

Building Custom Image

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

RUN pip install -e .

EXPOSE 3000 3001

CMD ["crawl4ai-mcp", "--http"]

๐Ÿงช Testing

Running Tests

# Install test dependencies
pip install -e .[test]

# Run all tests
pytest

# Run with coverage
pytest --cov=crawl4ai_mcp

# Run specific test
pytest tests/test_server.py::test_crawl_tool

Testing with MCP Server Tester

This MCP server can be comprehensively tested using the MCP Server Tester that supports all three transport protocols (STDIO, SSE, HTTP).

Install MCP Server Tester

# Option 1: Using Docker (recommended)
docker run -it stgmt/mcp-server-tester test --help

# Option 2: Using NPM
npm install -g mcp-server-tester-sse-http-stdio
mcp-server-tester test --help

# Option 3: Using Python
pip install mcp-server-tester-sse-http-stdio
mcp-server-tester test --help

Test Examples

Test STDIO mode:

# Docker
docker run -it stgmt/mcp-server-tester test \
  --transport stdio \
  --command "crawl4ai-mcp --stdio"

# NPM/Python
mcp-server-tester test \
  --transport stdio \
  --command "crawl4ai-mcp --stdio"

Test HTTP mode:

# Start the server first
crawl4ai-mcp --http &

# Run tests
mcp-server-tester test \
  --transport http \
  --url http://localhost:3000

Test SSE mode:

# Start the server first
crawl4ai-mcp --sse &

# Run tests
mcp-server-tester test \
  --transport sse \
  --url http://localhost:3001

Test with configuration file:

Create test-config.yaml:

name: Crawl4AI Comprehensive Tests
transport: stdio
command: crawl4ai-mcp --stdio
tests:
  - name: Test markdown extraction
    tool: md
    arguments:
      url: https://example.com
      f: fit
    assert:
      - type: contains
        value: "Example Domain"
  
  - name: Test screenshot
    tool: screenshot
    arguments:
      url: https://example.com
      screenshot_wait_for: 2
    assert:
      - type: exists
        path: result
  
  - name: Test HTML extraction
    tool: html
    arguments:
      url: https://example.com
    assert:
      - type: contains
        value: "<html"
  
  - name: Test JavaScript execution
    tool: execute_js
    arguments:
      url: https://example.com
      scripts: ["document.title"]
    assert:
      - type: contains
        value: "Example"

Run the test:

mcp-server-tester test -f test-config.yaml

Interactive Testing

The tester also provides an interactive mode for manual testing:

# Interactive STDIO mode
mcp-server-tester interactive \
  --transport stdio \
  --command "crawl4ai-mcp --stdio"

# Interactive HTTP mode
mcp-server-tester interactive \
  --transport http \
  --url http://localhost:3000

๐Ÿ“š Examples

Example 1: Extract Documentation

# Extract markdown from documentation
result = await session.call_tool("md", {
    "url": "https://docs.python.org/3/",
    "clean": True
})

Example 2: Monitor Price Changes

# Screenshot for visual comparison
screenshot = await session.call_tool("screenshot", {
    "url": "https://store.example.com/product",
    "full_page": False
})

# Extract price via JavaScript
price = await session.call_tool("execute_js", {
    "url": "https://store.example.com/product",
    "script": "document.querySelector('.price').innerText"
})

Example 3: Generate Reports

# Generate PDF report
pdf = await session.call_tool("pdf", {
    "url": "https://analytics.example.com/report",
    "format": "A4",
    "landscape": True
})

๐Ÿค Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Development Setup

# Clone the repository
git clone https://github.com/stgmt/crawl4ai-mcp.git
cd crawl4ai-mcp

# Install in development mode
pip install -e .[dev]

# Run tests
pytest

# Format code
black crawl4ai_mcp tests
ruff check --fix crawl4ai_mcp tests

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ”— Links

๐Ÿ™ Acknowledgments

  • Crawl4AI - The powerful crawling engine
  • MCP - Model Context Protocol specification
  • Anthropic - For creating the MCP standard

Made with โค๏ธ for the AI community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl4ai_mcp_sse_stdio-1.0.4.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawl4ai_mcp_sse_stdio-1.0.4-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file crawl4ai_mcp_sse_stdio-1.0.4.tar.gz.

File metadata

  • Download URL: crawl4ai_mcp_sse_stdio-1.0.4.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for crawl4ai_mcp_sse_stdio-1.0.4.tar.gz
Algorithm Hash digest
SHA256 0a333b7079f0f5d9aed6a015a7277a1d531ad1bd71b1b681c3afc51c32f661a3
MD5 2c022a449607708e1333c9d26c609bb0
BLAKE2b-256 1e650e9f296f629c1ad66903441896280924f3829bd01d800daad216ca08e848

See more details on using hashes here.

File details

Details for the file crawl4ai_mcp_sse_stdio-1.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for crawl4ai_mcp_sse_stdio-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f7aff9c86eb91d63619fe7de7fcc73672383d6025889e3686a7535adab1778e3
MD5 ab0a93ec92caec2a8207bc4d75ef2192
BLAKE2b-256 2233d6171705a09c469a96cee9bc5220b6f354f142d52aa7c0aa0a72076a401a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page