MCP (Model Context Protocol) server for Crawl4AI - Universal web crawling and data extraction
Project description
๐ท๏ธ Crawl4AI MCP Server
MCP (Model Context Protocol) server for Crawl4AI - Universal web crawling and data extraction for AI agents.
Integrate powerful web scraping capabilities into Claude, ChatGPT, and any MCP-compatible AI assistant.
๐ Table of Contents
- ๐ฏ Why This Tool?
- โก Quick Start
- ๐ Features
- ๐ฆ Installation
- ๐ง Usage
- ๐ ๏ธ Available Tools
- ๐ Transport Protocols
- โ๏ธ Configuration
- ๐ณ Docker Support
- ๐ค Contributing
- ๐ License
๐ฏ Why This Tool?
The Problem
- ๐ด No MCP servers for web scraping - AI agents can't access web content
- ๐ด Complex scraping setup - Crawl4AI requires custom integration
- ๐ด Limited protocol support - Most tools only support one transport
- ๐ด Poor AI integration - Existing scrapers aren't optimized for LLMs
Our Solution
- โ First Crawl4AI MCP server - Native MCP integration
- โ All MCP transports - STDIO, SSE, and HTTP support
- โ AI-optimized extraction - Clean markdown, structured data
- โ
One-line execution -
crawl4ai-mcp --stdio - โ Production ready - Type hints, tests, Docker support
โก Quick Start
One-Line Execution
# Install and run
pip install crawl4ai-mcp
crawl4ai-mcp --stdio
With Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"crawl4ai": {
"command": "crawl4ai-mcp",
"args": ["--stdio"]
}
}
}
With Any MCP Client
# STDIO mode (for CLI tools)
crawl4ai-mcp --stdio
# SSE mode (for web clients)
crawl4ai-mcp --sse
# HTTP mode (for REST API)
crawl4ai-mcp --http
๐ Features
Core Capabilities
- ๐ Universal Web Scraping - Extract content from any website
- ๐ Markdown Conversion - Clean, formatted markdown output
- ๐ธ Screenshots - Capture visual content
- ๐ PDF Generation - Save pages as PDF
- ๐ญ JavaScript Execution - Interact with dynamic content
- ๐ Multiple Transports - STDIO, SSE, HTTP protocols
Why Choose crawl4ai-mcp?
| Feature | crawl4ai-mcp | Other Tools |
|---|---|---|
| MCP Protocol Support | โ Full | โ None |
| Crawl4AI Integration | โ Native | โ Manual |
| Transport Protocols | โ All 3 | โ ๏ธ Usually 1 |
| AI Optimization | โ Built-in | โ Generic |
| Production Ready | โ Yes | โ ๏ธ Varies |
| Docker Support | โ Yes | โ ๏ธ Limited |
๐ฆ Installation
From PyPI
pip install crawl4ai-mcp
From Source
git clone https://github.com/stgmt/crawl4ai-mcp.git
cd crawl4ai-mcp
pip install -e .
With Docker
docker pull stgmt/crawl4ai-mcp
docker run -p 3000:3000 stgmt/crawl4ai-mcp
๐ง Usage
Basic Command Line
# Default HTTP mode
crawl4ai-mcp
# STDIO mode for CLI integration
crawl4ai-mcp --stdio
# SSE mode for real-time streaming
crawl4ai-mcp --sse
# HTTP mode explicitly
crawl4ai-mcp --http
Python Integration
import asyncio
from crawl4ai_mcp import Crawl4AIMCPServer
async def main():
server = Crawl4AIMCPServer()
# Run in STDIO mode
await server.run_stdio()
# Or run in HTTP mode
# server.run_http(host="0.0.0.0", port=3000)
asyncio.run(main())
With MCP Clients
Using mcp-client SDK
from mcp import ClientSession, StdioServerParameters
import asyncio
async def main():
server_params = StdioServerParameters(
command="crawl4ai-mcp",
args=["--stdio"]
)
async with ClientSession(server_params) as session:
# List available tools
tools = await session.list_tools()
# Crawl a webpage
result = await session.call_tool(
"crawl",
{"url": "https://example.com"}
)
print(result)
asyncio.run(main())
๐ ๏ธ Available Tools
1. crawl - Full Web Crawling
Extract complete content from any webpage.
{
"name": "crawl",
"arguments": {
"url": "https://example.com",
"wait_for": "css:.content",
"timeout": 30000
}
}
2. md - Markdown Extraction
Get clean markdown content from webpages.
{
"name": "md",
"arguments": {
"url": "https://docs.example.com",
"clean": true
}
}
3. html - Raw HTML
Retrieve raw HTML content.
{
"name": "html",
"arguments": {
"url": "https://example.com"
}
}
4. screenshot - Visual Capture
Take screenshots of webpages.
{
"name": "screenshot",
"arguments": {
"url": "https://example.com",
"full_page": true
}
}
5. pdf - PDF Generation
Convert webpages to PDF.
{
"name": "pdf",
"arguments": {
"url": "https://example.com",
"format": "A4"
}
}
6. execute_js - JavaScript Execution
Execute JavaScript on webpages.
{
"name": "execute_js",
"arguments": {
"url": "https://example.com",
"script": "document.title"
}
}
๐ Transport Protocols
STDIO Transport
Best for command-line tools and local development.
crawl4ai-mcp --stdio
Use cases:
- Claude Desktop app
- Terminal-based MCP clients
- Local development and testing
- CI/CD pipelines
SSE Transport (Server-Sent Events)
Ideal for real-time web applications.
crawl4ai-mcp --sse
Use cases:
- Web-based MCP clients
- Real-time streaming applications
- Browser extensions
- Progressive web apps
HTTP Transport
Standard REST API for maximum compatibility.
crawl4ai-mcp --http
Use cases:
- REST API clients
- Microservice architectures
- Cloud deployments
- Load-balanced environments
โ๏ธ Configuration
Environment Variables
# Crawl4AI endpoint (if using remote instance)
CRAWL4AI_ENDPOINT=http://localhost:8000
# Server ports
HTTP_PORT=3000
SSE_PORT=3001
# Logging
LOG_LEVEL=INFO
# Performance
MAX_CONCURRENT_REQUESTS=10
REQUEST_TIMEOUT=30
Configuration File
Create .env file:
CRAWL4AI_ENDPOINT=http://your-crawl4ai-instance.com
HTTP_PORT=3000
SSE_PORT=3001
LOG_LEVEL=DEBUG
DEBUG=true
Advanced Settings
# config/settings.py
from pydantic import BaseSettings
class Settings(BaseSettings):
crawl4ai_endpoint: str = "http://localhost:8000"
http_port: int = 3000
sse_port: int = 3001
log_level: str = "INFO"
debug: bool = False
max_concurrent_requests: int = 10
request_timeout: int = 30
class Config:
env_file = ".env"
settings = Settings()
๐ณ Docker Support
Quick Start with Docker
# Run with default settings
docker run -p 3000:3000 stgmt/crawl4ai-mcp
# With environment variables
docker run -p 3000:3000 \
-e CRAWL4AI_ENDPOINT=http://crawl4ai:8000 \
-e LOG_LEVEL=DEBUG \
stgmt/crawl4ai-mcp
# With Docker Compose
docker-compose up
Docker Compose
version: '3.8'
services:
crawl4ai-mcp:
image: stgmt/crawl4ai-mcp
ports:
- "3000:3000"
- "3001:3001"
environment:
- CRAWL4AI_ENDPOINT=http://crawl4ai:8000
- LOG_LEVEL=INFO
depends_on:
- crawl4ai
crawl4ai:
image: crawl4ai/crawl4ai
ports:
- "8000:8000"
Building Custom Image
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
RUN pip install -e .
EXPOSE 3000 3001
CMD ["crawl4ai-mcp", "--http"]
๐งช Testing
Running Tests
# Install test dependencies
pip install -e .[test]
# Run all tests
pytest
# Run with coverage
pytest --cov=crawl4ai_mcp
# Run specific test
pytest tests/test_server.py::test_crawl_tool
Testing with MCP Server Tester
# Install the tester
npm install -g mcp-server-tester
# Test the server
mcp-server-tester test.yaml --server-config server.json
Example test.yaml:
name: Crawl4AI Tests
tests:
- name: Test markdown extraction
tool: md
arguments:
url: https://example.com
assert:
- type: contains
value: "Example Domain"
๐ Examples
Example 1: Extract Documentation
# Extract markdown from documentation
result = await session.call_tool("md", {
"url": "https://docs.python.org/3/",
"clean": True
})
Example 2: Monitor Price Changes
# Screenshot for visual comparison
screenshot = await session.call_tool("screenshot", {
"url": "https://store.example.com/product",
"full_page": False
})
# Extract price via JavaScript
price = await session.call_tool("execute_js", {
"url": "https://store.example.com/product",
"script": "document.querySelector('.price').innerText"
})
Example 3: Generate Reports
# Generate PDF report
pdf = await session.call_tool("pdf", {
"url": "https://analytics.example.com/report",
"format": "A4",
"landscape": True
})
๐ค Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
Development Setup
# Clone the repository
git clone https://github.com/stgmt/crawl4ai-mcp.git
cd crawl4ai-mcp
# Install in development mode
pip install -e .[dev]
# Run tests
pytest
# Format code
black crawl4ai_mcp tests
ruff check --fix crawl4ai_mcp tests
๐ License
MIT License - see LICENSE file for details.
๐ Links
- PyPI Package: https://pypi.org/project/crawl4ai-mcp/
- GitHub Repository: https://github.com/stgmt/crawl4ai-mcp
- Documentation: https://github.com/stgmt/crawl4ai-mcp#readme
- Issues: https://github.com/stgmt/crawl4ai-mcp/issues
๐ Acknowledgments
- Crawl4AI - The powerful crawling engine
- MCP - Model Context Protocol specification
- Anthropic - For creating the MCP standard
Made with โค๏ธ for the AI community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawl4ai_mcp_sse_stdio-1.0.0.tar.gz.
File metadata
- Download URL: crawl4ai_mcp_sse_stdio-1.0.0.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c271d88e98b2a443c2b54c738dac9c43709c6ff1516dcdd6e31f2b258d294854
|
|
| MD5 |
6e6b7e706ad9eaf2e0b3fd53db13fb96
|
|
| BLAKE2b-256 |
6a61ff2fb181771e53c3fe0aa280d064fa9b6c4618aec30fce5da11f461ad602
|
File details
Details for the file crawl4ai_mcp_sse_stdio-1.0.0-py3-none-any.whl.
File metadata
- Download URL: crawl4ai_mcp_sse_stdio-1.0.0-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf0ed49dc936a2f14dc21f18cc36220548e3b0f3213ea338ee660de82e748002
|
|
| MD5 |
feaaa7c78c125d2725c5b4d2db38aca2
|
|
| BLAKE2b-256 |
551ee5fde00b7ed603b5480594130bfff21cf1b70c1e998e7ca945b40c0d04df
|