FreeCrawl MCP Server - Self-hosted web scraping and document processing as a Firecrawl replacement
Project description
FreeCrawl MCP Server
A production-ready Model Context Protocol (MCP) server for web scraping and document processing, designed as a self-hosted replacement for Firecrawl.
🚀 Features
- JavaScript-enabled web scraping with Playwright and anti-detection measures
- Document processing with fallback support for various formats
- Concurrent batch processing with configurable limits
- Intelligent caching with SQLite backend
- Rate limiting per domain
- Comprehensive error handling with retry logic
- Easy installation via
uvxor local development setup - Health monitoring and metrics collection
📦 Installation & Usage
Quick Start with uvx (Recommended)
The easiest way to use FreeCrawl is with uvx, which automatically manages dependencies:
# Install and run directly
uvx freecrawl-mcp
# Install browsers on first run
uvx freecrawl-mcp --install-browsers
# Test functionality
uvx freecrawl-mcp --test
# Get help
uvx freecrawl-mcp --help
Local Development Setup
For local development or customization:
-
Clone from GitHub:
git clone https://github.com/dylan-gluck/freecrawl-mcp.git cd freecrawl-mcp
-
Set up environment:
# Sync dependencies uv sync # Install browser dependencies uv run freecrawl-mcp --install-browsers # Run tests uv run freecrawl-mcp --test
-
Run the server:
uv run freecrawl-mcp
🛠 Configuration
Configure FreeCrawl using environment variables:
Basic Configuration
# Transport (stdio for MCP, http for REST API)
export FREECRAWL_TRANSPORT=stdio
# Browser pool settings
export FREECRAWL_MAX_BROWSERS=3
export FREECRAWL_HEADLESS=true
# Concurrency limits
export FREECRAWL_MAX_CONCURRENT=10
export FREECRAWL_MAX_PER_DOMAIN=3
# Cache settings
export FREECRAWL_CACHE=true
export FREECRAWL_CACHE_DIR=/tmp/freecrawl_cache
export FREECRAWL_CACHE_TTL=3600
export FREECRAWL_CACHE_SIZE=536870912 # 512MB
# Rate limiting
export FREECRAWL_RATE_LIMIT=60 # requests per minute
# Logging
export FREECRAWL_LOG_LEVEL=INFO
Security Settings
# API authentication (optional)
export FREECRAWL_REQUIRE_API_KEY=false
export FREECRAWL_API_KEYS=key1,key2,key3
# Domain blocking
export FREECRAWL_BLOCKED_DOMAINS=localhost,127.0.0.1
# Anti-detection
export FREECRAWL_ANTI_DETECT=true
export FREECRAWL_ROTATE_UA=true
🔧 MCP Tools
FreeCrawl provides the following MCP tools:
freecrawl_scrape
Scrape content from a single URL with advanced options.
Parameters:
url(string): URL to scrapeformats(array): Output formats -["markdown", "html", "text", "screenshot", "structured"]javascript(boolean): Enable JavaScript execution (default: true)wait_for(string, optional): CSS selector or time (ms) to waitanti_bot(boolean): Enable anti-detection measures (default: true)headers(object, optional): Custom HTTP headerscookies(object, optional): Custom cookiescache(boolean): Use cached results if available (default: true)timeout(number): Total timeout in milliseconds (default: 30000)
Example:
{
"name": "freecrawl_scrape",
"arguments": {
"url": "https://example.com",
"formats": ["markdown", "screenshot"],
"javascript": true,
"wait_for": "2000"
}
}
freecrawl_batch_scrape
Scrape multiple URLs concurrently.
Parameters:
urls(array): List of URLs to scrape (max 100)concurrency(number): Maximum concurrent requests (default: 5)formats(array): Output formats (default:["markdown"])common_options(object, optional): Options applied to all URLscontinue_on_error(boolean): Continue if individual URLs fail (default: true)
Example:
{
"name": "freecrawl_batch_scrape",
"arguments": {
"urls": [
"https://example.com/page1",
"https://example.com/page2"
],
"concurrency": 3,
"formats": ["markdown", "text"]
}
}
freecrawl_extract
Extract structured data using schema-driven approach.
Parameters:
url(string): URL to extract data fromschema(object): JSON Schema or Pydantic model definitionprompt(string, optional): Custom extraction instructionsvalidation(boolean): Validate against schema (default: true)multiple(boolean): Extract multiple matching items (default: false)
Example:
{
"name": "freecrawl_extract",
"arguments": {
"url": "https://example.com/product",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"}
}
}
}
}
freecrawl_process_document
Process documents (PDF, DOCX, etc.) with OCR support.
Parameters:
file_path(string, optional): Path to document fileurl(string, optional): URL to download document fromstrategy(string): Processing strategy -"fast","hi_res","ocr_only"(default: "hi_res")formats(array): Output formats -["markdown", "structured", "text"]languages(array, optional): OCR languages (e.g.,["eng", "fra"])extract_images(boolean): Extract embedded images (default: false)extract_tables(boolean): Extract and structure tables (default: true)
Example:
{
"name": "freecrawl_process_document",
"arguments": {
"url": "https://example.com/document.pdf",
"strategy": "hi_res",
"formats": ["markdown", "structured"]
}
}
freecrawl_health_check
Get server health status and metrics.
Example:
{
"name": "freecrawl_health_check",
"arguments": {}
}
🔄 Integration with Claude Code
MCP Configuration
Add FreeCrawl to your MCP configuration:
Using uvx (Recommended):
{
"mcpServers": {
"freecrawl": {
"command": "uvx",
"args": ["freecrawl-mcp"]
}
}
}
Using local development setup:
{
"mcpServers": {
"freecrawl": {
"command": "uv",
"args": ["run", "freecrawl-mcp"],
"cwd": "/path/to/freecrawl-mcp"
}
}
}
Usage in Prompts
Please scrape the content from https://example.com and extract the main article text in markdown format.
Claude Code will automatically use the freecrawl_scrape tool to fetch and process the content.
🚀 Performance & Scalability
Resource Usage
- Memory: ~100MB base + ~50MB per browser instance
- CPU: Moderate usage during active scraping
- Storage: Cache grows based on configured limits
Throughput
- Single requests: 2-5 seconds typical response time
- Batch processing: 10-50 concurrent requests depending on configuration
- Cache hit ratio: 30%+ for repeated content
Optimization Tips
- Enable caching for frequently accessed content
- Adjust concurrency based on target site rate limits
- Use appropriate formats - markdown is faster than screenshots
- Configure rate limiting to avoid being blocked
🛡 Security Considerations
Anti-Detection
- Rotating user agents
- Realistic browser fingerprints
- Request timing randomization
- JavaScript execution in sandboxed environment
Input Validation
- URL format validation
- Private IP blocking
- Domain blocklist support
- Request size limits
Resource Protection
- Memory usage monitoring
- Browser pool size limits
- Request timeout enforcement
- Rate limiting per domain
🔧 Troubleshooting
Common Issues
| Issue | Possible Cause | Solution |
|---|---|---|
| High memory usage | Too many browser instances | Reduce FREECRAWL_MAX_BROWSERS |
| Slow responses | JavaScript-heavy sites | Increase timeout or disable JS |
| Bot detection | Missing anti-detection | Ensure FREECRAWL_ANTI_DETECT=true |
| Cache misses | TTL too short | Increase FREECRAWL_CACHE_TTL |
| Import errors | Missing dependencies | Run uvx freecrawl-mcp --test |
Debug Mode
With uvx:
export FREECRAWL_LOG_LEVEL=DEBUG
uvx freecrawl-mcp --test
Local development:
export FREECRAWL_LOG_LEVEL=DEBUG
uv run freecrawl-mcp --test
📈 Monitoring & Observability
Health Metrics
- Browser pool status
- Memory and CPU usage
- Cache hit rates
- Request success rates
- Response times
Logging
FreeCrawl provides structured logging with configurable levels:
- ERROR: Critical failures
- WARNING: Recoverable issues
- INFO: General operations
- DEBUG: Detailed troubleshooting
🔧 Development
Running Tests
With uvx:
# Basic functionality test
uvx freecrawl-mcp --test
Local development:
# Basic functionality test
uv run freecrawl-mcp --test
Code Structure
- Core server:
FreeCrawlServerclass - Browser management:
BrowserPoolfor resource pooling - Content extraction:
ContentExtractorwith multiple strategies - Caching:
CacheManagerwith SQLite backend - Rate limiting:
RateLimiterwith token bucket algorithm
📄 License
This project is licensed under the MIT License - see the technical specification for details.
🤝 Contributing
- Fork the repository at https://github.com/dylan-gluck/freecrawl-mcp
- Create a feature branch
- Set up local development:
uv sync - Run tests:
uv run freecrawl-mcp --test - Submit a pull request
📚 Technical Specification
For detailed technical information, see ai_docs/FREECRAWL_TECHNICAL_SPEC.md.
FreeCrawl MCP Server - Self-hosted web scraping for the modern web 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file freecrawl_mcp-0.1.1.tar.gz.
File metadata
- Download URL: freecrawl_mcp-0.1.1.tar.gz
- Upload date:
- Size: 99.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
836c99c1d12288e9b8624efc31866f214e8748f508e4223663af9f4ced83dc88
|
|
| MD5 |
894c93403cba12de524be5d893766d4b
|
|
| BLAKE2b-256 |
7720bf8d0696531a61b4eda389d715754277dcff57fb2eda44753acf48d93055
|
File details
Details for the file freecrawl_mcp-0.1.1-py3-none-any.whl.
File metadata
- Download URL: freecrawl_mcp-0.1.1-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f56d3973299a151cf38c44209c144c00a33542c9899082fb84ebd065f6dddeb
|
|
| MD5 |
4998fb60f0d4143c12da7c1c3632d5f0
|
|
| BLAKE2b-256 |
09cfadede97358acb38b2f8b55637690c35c346c21e1fec3e55a5731d04bbcc4
|