A lightweight Model Context Protocol (MCP) server that exposes Crawl4AI web scraping and crawling capabilities as tools for AI agents.
Project description
Crawl4AI MCP Server
๐ท๏ธ A lightweight Model Context Protocol (MCP) server that exposes Crawl4AI web scraping and crawling capabilities as tools for AI agents.
Similar to Firecrawl's API but self-hosted and free. Perfect for integrating web scraping into your AI workflows with OpenAI Agents SDK, Cursor, Claude Code, and other MCP-compatible tools.
Features
- ๐ง MCP Tools: Exposes 4 powerful tools:
scrape,crawl,crawl_site,crawl_sitemapvia stdio MCP server - ๐ Web Scraping: Single-page scraping with markdown extraction
- ๐ท๏ธ Web Crawling: Multi-page breadth-first crawling with depth control
- ๐ง Adaptive Crawling: Smart crawling that stops when sufficient content is gathered
- ๐ก๏ธ Safety: Blocks internal networks, localhost, and private IPs
- ๐ฑ Agent Ready: Works with OpenAI Agents SDK, Cursor, and Claude Code
- โก Fast: Powered by Playwright and Crawl4AI's optimized extraction
๐ Quick Start
Choose between Docker (recommended) or manual installation:
Option A: Docker (Recommended) ๐ณ
Docker eliminates all setup complexity and provides a consistent environment:
Option A1: Use Pre-built Image (Fastest) โก
# No setup required! Just pull and run the published image
docker pull uysalsadi/crawl4ai-mcp-server:latest
# Test it works
python test-config.py
# Use directly in MCP configurations (see examples below)
Option A2: Build Yourself
# Clone the repository
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
# Quick build and test (simplified)
docker build -f Dockerfile.simple -t crawl4ai-mcp .
echo '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test", "version": "1.0"}}}' | docker run --rm -i crawl4ai-mcp
# Or use helper script (full build with Playwright)
./docker-run.sh build
./docker-run.sh test
./docker-run.sh run
Docker Quick Commands:
./docker-run.sh build- Build the image./docker-run.sh run- Run MCP server (stdio mode)./docker-run.sh test- Run smoke tests./docker-run.sh dev- Development mode with shell access
Option B: Manual Installation
# Clone and setup
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Install Playwright browsers
python -m playwright install chromium
# Test basic functionality
python -m crawler_agent.smoke_client
# Test adaptive crawling
python test_adaptive.py
Use with OpenAI Agents SDK
# Set your OpenAI API key
export OPENAI_API_KEY="your-key-here"
# Docker: Run the example agent
docker-compose run --rm -e OPENAI_API_KEY crawl4ai-mcp python -m crawler_agent.agents_example
# Manual: Run the example agent
python -m crawler_agent.agents_example
๐ ๏ธ Tools Reference
The MCP server exposes 4 production-ready tools with content hiding features:
scrape
Fetch a single URL and return markdown content.
Arguments:
url(required): The URL to scrapeoutput_dir(optional): If provided, persists content to disk and returns metadata onlycrawler: Optional Crawl4AI crawler config overridesbrowser: Optional Crawl4AI browser config overridesscript: Optional C4A-Script for page interactiontimeout_sec: Request timeout (default: 45s, max: 600s)
Returns (without output_dir):
{
"url": "https://example.com",
"markdown": "# Page Title\n\nContent...",
"links": ["https://example.com/page1", "..."],
"metadata": {}
}
Returns (with output_dir):
{
"run_id": "scrape_20250815_122811_4fb188",
"file_path": "/path/to/output/pages/example.com_index.md",
"manifest_path": "/path/to/output/manifest.json",
"bytes_written": 230
}
crawl
Multi-page breadth-first crawling with filtering and adaptive stopping.
Arguments:
seed_url(required): Starting URL for the crawlmax_depth: Maximum link depth to follow (default: 1, max: 4)max_pages: Maximum pages to crawl (default: 5, max: 100)same_domain_only: Stay within the same domain (default: true)include_patterns: Regex patterns URLs must matchexclude_patterns: Regex patterns to exclude URLsadaptive: Enable adaptive crawling (default: false)output_dir(optional): If provided, persists content to disk and returns metadata onlycrawler,browser,script,timeout_sec: Same as scrape
Returns (without output_dir):
{
"start_url": "https://example.com",
"pages": [
{
"url": "https://example.com/page1",
"markdown": "Content...",
"links": ["..."]
}
],
"total_pages": 3
}
Returns (with output_dir):
{
"run_id": "crawl_20250815_122828_464944",
"pages_ok": 3,
"pages_failed": 0,
"manifest_path": "/path/to/output/manifest.json",
"bytes_written": 690
}
crawl_site
Comprehensive site crawling with persistence (always requires output_dir).
Arguments:
entry_url(required): Starting URL for site crawloutput_dir(required): Directory to persist resultsmax_depth: Maximum crawl depth (default: 2, max: 6)max_pages: Maximum pages to crawl (default: 200, max: 5000)- Additional config options for filtering and performance
Returns:
{
"run_id": "site_20250815_122851_0e2455",
"output_dir": "/path/to/output",
"manifest_path": "/path/to/output/manifest.json",
"pages_ok": 15,
"pages_failed": 2,
"bytes_written": 45672
}
crawl_sitemap
Sitemap-based crawling with persistence (always requires output_dir).
Arguments:
sitemap_url(required): URL to sitemap.xmloutput_dir(required): Directory to persist resultsmax_entries: Maximum sitemap entries to process (default: 1000)- Additional config options for filtering and performance
Returns:
{
"run_id": "sitemap_20250815_123006_667d71",
"output_dir": "/path/to/output",
"manifest_path": "/path/to/output/manifest.json",
"pages_ok": 25,
"pages_failed": 0,
"bytes_written": 123456
}
๐ก Usage Examples
Standalone MCP Server
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
params = StdioServerParameters(
command="python",
args=["-m", "crawler_agent.mcp_server"]
)
async with stdio_client(params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Scrape a single page
result = await session.call_tool("scrape", {
"url": "https://example.com"
})
# Crawl with adaptive stopping
result = await session.call_tool("crawl", {
"seed_url": "https://docs.example.com",
"max_pages": 10,
"adaptive": True
})
OpenAI Agents SDK Integration
from agents import Agent, Runner
from agents.mcp.server import MCPServerStdio
async with MCPServerStdio(
params={
"command": "python",
"args": ["-m", "crawler_agent.mcp_server"]
},
cache_tools_list=True
) as server:
agent = Agent(
name="Research Assistant",
instructions="Use scrape and crawl tools to research topics.",
mcp_servers=[server],
mcp_config={"convert_schemas_to_strict": True}
)
result = await Runner.run(
agent,
"Research the latest AI safety papers"
)
Cursor/Claude Code Integration
This project supports both Cursor and Claude Code with synchronized configuration:
Cursor Setup
- Uses
.cursorrulesfor AI assistant behavior and project guidance - Automatic MCP tool detection and integration
- Project-specific rules and workflow guidance
Claude Code Setup
- Uses
CLAUDE.mdfor comprehensive project context and memory bank - Native MCP integration via
mcp__crawl4ai-mcp__*tool calls - Global config:
~/.claude/claude_desktop_config.jsonor project config:.mcp.json
Docker Integration for Both Environments
Both Cursor and Claude Code can use the Dockerized MCP server:
// .mcp.json (project-scoped configuration)
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "./crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
Dual Environment Features
- Synchronized Documentation: Changes to documentation are maintained across both environments
- Shared MCP Configuration: Both use the same MCP server and tool schemas
- Docker Compatibility: Consistent environment across both Cursor and Claude Code
- Cross-Compatible: Projects work seamlessly whether using Cursor or Claude Code
๐ณ Docker Usage
Quick Start with Docker
The Docker approach eliminates all manual setup and provides a consistent environment:
# 1. Clone and build
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
./docker-run.sh build
# 2. Test the installation
./docker-run.sh test
# 3. Run MCP server
./docker-run.sh run
Docker Commands Reference
| Command | Description |
|---|---|
./docker-run.sh build |
Build the Docker image |
./docker-run.sh run |
Run MCP server in stdio mode |
./docker-run.sh test |
Run smoke tests |
./docker-run.sh dev |
Development mode with shell access |
./docker-run.sh stop |
Stop running containers |
./docker-run.sh clean |
Remove containers and images |
./docker-run.sh logs |
Show container logs |
Docker Compose Services
The docker-compose.yml provides multiple service configurations:
crawl4ai-mcp: Production MCP servercrawl4ai-mcp-dev: Development container with shell accesscrawl4ai-mcp-test: Test runner for smoke tests
MCP Integration with Docker
Global MCP Configuration (Claude Code)
Add to ~/.claude/claude_desktop_config.json:
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "/tmp/crawl4ai-crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
โ Copy this exact config - it uses the published Docker image!
Global MCP Configuration (Cursor)
Add to ~/.cursor/mcp.json:
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "/tmp/crawl4ai-crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
โ This configuration is tested and working!
Project-Scoped Configuration
Add to your project's .mcp.json:
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "./crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
โ
This project already includes this configuration - see .mcp.json
Configuration Validation
After setting up any MCP configuration:
-
Test the Docker image works:
python test-config.py -
Restart your editor (Cursor/Claude Code) to reload MCP configuration
-
Verify tools are available:
- Look for
crawl4ai-mcpin the MCP tools panel - Should see 4 tools:
scrape,crawl,crawl_site,crawl_sitemap
- Look for
If tools don't appear, check:
- Docker is running and image is accessible
- MCP configuration file syntax is valid JSON
- Editor has been restarted after config changes
Docker Advantages
โ
Zero Setup: No need for Python venv, pip, or Playwright installation
โ
Consistent Environment: Same behavior across all platforms
โ
Isolated Dependencies: No conflicts with your system Python
โ
Easy Updates: docker pull to get latest version
โ
Portable: Works anywhere Docker runs
โ
Volume Persistence: Crawl outputs saved to host filesystem
Environment Variables
Set these when running Docker containers:
# Using docker-compose
OPENAI_API_KEY=your-key docker-compose up crawl4ai-mcp
# Using docker run directly
docker run --rm -i \
-e OPENAI_API_KEY=your-key \
-e CRAWL4AI_MCP_LOG=DEBUG \
-v ./crawls:/app/crawls \
crawl4ai-mcp-server:latest
โ๏ธ Configuration
Environment Variables
TARGET_URL: Default URL for smoke testing (default: https://modelcontextprotocol.io/docs)RESEARCH_TASK: Custom research task for agents exampleOPENAI_API_KEY: Required for OpenAI Agents SDK
Safety Settings
The server blocks these URL patterns by default:
localhost,127.0.0.1,::1- Private IP ranges (RFC 1918)
file://schemes.local,.internal,.landomains
๐ Advanced Features
Adaptive Crawling
When adaptive: true is set, the crawler uses a simple content-based stopping strategy:
- Stops when total content exceeds 5,000 characters
- Prevents over-crawling for information gathering tasks
- More sophisticated LLM-based strategies can be added
Custom Filtering
Use regex patterns to control which URLs are crawled:
await session.call_tool("crawl", {
"seed_url": "https://docs.example.com",
"include_patterns": [r"/docs/", r"/api/"],
"exclude_patterns": [r"/old/", r"\.pdf$"]
})
Browser Configuration
Pass custom browser/crawler settings:
await session.call_tool("scrape", {
"url": "https://example.com",
"browser": {"headless": True, "viewport": {"width": 1280, "height": 720}},
"crawler": {"verbose": True}
})
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ AI Agent โโโโโถโ MCP Server โโโโโถโ Crawl4AI โ
โ (Cursor/Agents) โ โ (stdio/stdio) โ โ (Playwright) โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Safety Guards โ
โ (URL validation) โ
โโโโโโโโโโโโโโโโโโโโ
๐ฆ Publishing to Docker Hub
For maintainers who want to publish updates to the Docker registry:
Publishing Process
# 1. Login to Docker Hub (one time setup)
./docker-push.sh login
# 2. Build, push, and test everything
./docker-push.sh all
# Or do steps individually:
./docker-push.sh build # Build and tag image
./docker-push.sh push # Push to Docker Hub
./docker-push.sh test # Test the published image
Docker Hub Repository
The image is published at: uysalsadi/crawl4ai-mcp-server
- Latest:
uysalsadi/crawl4ai-mcp-server:latest - Versioned:
uysalsadi/crawl4ai-mcp-server:v1.0.0
Usage Statistics
Users can pull and use the image without any local setup:
docker pull uysalsadi/crawl4ai-mcp-server:latest
๐ง Development
Project Structure
crawler_agent/
โโโ __init__.py
โโโ mcp_server.py # Main MCP server
โโโ safety.py # URL safety validation
โโโ adaptive_strategy.py # Adaptive crawling logic
โโโ smoke_client.py # Basic testing client
โโโ agents_example.py # OpenAI Agents SDK example
Testing
# Test MCP server
python -m crawler_agent.smoke_client
# Test adaptive crawling
python test_adaptive.py
# Test with agents (requires OPENAI_API_KEY)
python -m crawler_agent.agents_example
Contributing
- Environment Setup: Always activate
.venvbefore development work - Documentation: Follow the Documentation Memory Rule - synchronize changes across CLAUDE.md, .cursorrules, and README.md
- Code Style: Follow
.cursorrulesfor Cursor orCLAUDE.mdfor Claude Code - Testing: Use
python -m crawler_agent.smoke_clientfor validation - Safety: Ensure security guards are maintained for all new features
- Dual Compatibility: Verify changes work in both Cursor and Claude Code environments
๐ References
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
This project uses Crawl4AI by UncleCode for web scraping capabilities. Crawl4AI is an excellent open-source LLM-friendly web crawler and scraper.
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
โญ Support
If this project helped you, please give it a star! It helps others discover the project.
๐ Issues
Found a bug or have a feature request? Please open an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iflow_mcp_sadiuysal_crawl4ai_mcp_server-0.1.0.tar.gz.
File metadata
- Download URL: iflow_mcp_sadiuysal_crawl4ai_mcp_server-0.1.0.tar.gz
- Upload date:
- Size: 19.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3b386db97df02d5301fabbfc26873e1fff83269ac499ab73e68d95d65813313
|
|
| MD5 |
05e5eccc3a435bbe3d1671121ac18a12
|
|
| BLAKE2b-256 |
1c62ab2f6f2aacf8996ba7e6f9bca3b24780769de944e94212877e1cd397769f
|
File details
Details for the file iflow_mcp_sadiuysal_crawl4ai_mcp_server-0.1.0-py3-none-any.whl.
File metadata
- Download URL: iflow_mcp_sadiuysal_crawl4ai_mcp_server-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af1fa4158ae51e304e50ce95c7e3f07b7e2211b8535702f8c056fd54ba6d7eeb
|
|
| MD5 |
c9d9fbb1296fe573effa3c1533789e45
|
|
| BLAKE2b-256 |
9cbfea56ae23689d022c213af85fac0fae6098b5772ae3a9d54f742a67f09974
|