Advanced Web Crawling Platform with Deep Analysis and MCP Server
Project description
Crawilfy MCP Server
Advanced web crawling platform with deep analysis capabilities, automatic API discovery, and crawler generation. Built as an MCP (Model Context Protocol) server for seamless integration with AI assistants like Cursor, Claude Code, and Windsurf.
โก Quick Start (Single Command)
Option 1: Using uvx (Recommended - No Installation Required)
The simplest way to use Crawilfy. Just add this to your MCP configuration:
{
"mcpServers": {
"crawilfy": {
"command": "uvx",
"args": ["crawilfy-mcp-server"]
}
}
}
Note: Requires uv to be installed. Install with:
curl -LsSf https://astral.sh/uv/install.sh | sh
Option 2: Using pipx
{
"mcpServers": {
"crawilfy": {
"command": "pipx",
"args": ["run", "crawilfy-mcp-server"]
}
}
}
Option 3: Using pip (Global Install)
pip install crawilfy-mcp-server
playwright install chromium
Then add to your MCP configuration:
{
"mcpServers": {
"crawilfy": {
"command": "python",
"args": ["-m", "src.mcp.server"]
}
}
}
๐ง Where to Add MCP Configuration
For Cursor IDE
- Open Settings (
Cmd/Ctrl + ,) - Search for "MCP"
- Click "Edit in settings.json"
- Add the configuration under
mcpServers
For Claude Code
- Open the MCP settings file at
~/.config/claude/mcp_settings.json - Add the configuration
For Windsurf
- Open Settings โ MCP Servers
- Add the configuration
๐ ๏ธ Available Tools (55 Total)
Test Status Legend: โ Tested & Working | โ ๏ธ Works with limitations | ๐ง Requires config | ๐ No paid API needed
๐ Deep Analysis & Discovery
| Tool | Status | Description | Notes |
|---|---|---|---|
deep_analyze |
โ | Comprehensive analysis of a website (network + JS + security) | |
discover_apis |
โ | Discover all REST and GraphQL APIs including hidden endpoints | |
introspect_graphql |
โ | Extract complete GraphQL schema using introspection | |
execute_graphql |
โ | Execute GraphQL queries and mutations | |
analyze_websocket |
โ | Intercept and analyze WebSocket connections | Returns empty if no WS found |
analyze_auth |
โ | Analyze authentication flow and mechanisms | |
detect_protection |
โ | Detect anti-bot systems, CAPTCHAs, and fingerprinting | |
detect_technology |
โ | Detect technology stack (CMS, frameworks, CDN, analytics) |
๐ JavaScript Analysis
| Tool | Status | Description | Notes |
|---|---|---|---|
deobfuscate_js |
โ ๐ | Deobfuscate JavaScript code with multiple techniques | No browser needed |
extract_from_js |
โ ๐ | Extract API endpoints, URLs, constants, and auth logic from JS | No browser needed |
๐ฌ Session Recording & Crawlers
| Tool | Status | Description | Notes |
|---|---|---|---|
record_session |
โ | Start recording an interactive browser session | |
stop_recording |
โ | Stop an active recording and save it | |
list_recordings |
โ | List all available recordings (active and saved) | |
get_recording_status |
โ | Get status and details of a specific recording | |
delete_recording |
โ | Delete a saved recording | |
export_recording |
โ | Export recording to JSON, HAR, or Playwright test format | |
generate_crawler |
โ | Generate crawler script from recording (YAML, Python, Playwright) |
๐ Content Extraction
| Tool | Status | Description | Notes |
|---|---|---|---|
extract_article |
โ | Extract clean article content with intelligent parsing | |
convert_to_markdown |
โ | Convert webpage to clean markdown for LLM consumption | |
smart_extract |
โ ๐ | Extract data using natural language queries | Works without LLM; optionally enhanced with free providers |
extract_links |
โ | Extract all links with filtering options | |
extract_forms |
โ | Extract all forms with field details | |
extract_metadata |
โ | Extract OG tags, Twitter cards, JSON-LD structured data | |
extract_tables |
โ | Extract tables as JSON, CSV, or Markdown | |
wait_and_extract |
โ | Wait for dynamic elements and extract content |
๐ Network & Sitemap
| Tool | Status | Description | Notes |
|---|---|---|---|
analyze_sitemap |
โ | Analyze sitemap.xml to extract URLs and metadata | |
check_robots |
โ | Analyze robots.txt for crawl rules and sitemaps | |
monitor_network |
โ | Monitor network traffic for a specified duration |
๐ฅ๏ธ Page Interaction
| Tool | Status | Description | Notes |
|---|---|---|---|
take_screenshot |
โ | Take full-page or viewport screenshots | |
execute_js |
โ | Execute JavaScript on a page and return results | |
get_cookies |
โ | Get all cookies from a page/domain | |
get_storage |
โ | Get localStorage and sessionStorage | |
fill_form |
โ | Automatically fill form fields with provided data |
๐ Session & Proxy Management
| Tool | Status | Description | Notes |
|---|---|---|---|
save_session |
โ | Save browser session (cookies, storage) for reuse | |
load_session |
โ | Load a previously saved session | |
list_sessions |
โ | List all saved sessions | |
configure_proxies |
โ | Configure proxy pool with rotation strategies | |
get_proxy_stats |
โ | Get proxy pool health and usage statistics | |
add_proxy |
โ | Add a proxy to the pool | |
remove_proxy |
โ | Remove a proxy from the pool | |
test_proxy |
โ | Test a proxy's connectivity |
๐ Performance & Analysis
| Tool | Status | Description | Notes |
|---|---|---|---|
measure_performance |
โ | Measure page load timing and Core Web Vitals | |
analyze_resources |
โ | Analyze all loaded resources (scripts, images, fonts) | |
check_accessibility |
โ | Run accessibility checks and report issues | |
compare_pages |
โ | Compare two pages for structure/content differences |
๐ก๏ธ Stealth & Anti-Detection
| Tool | Status | Description | Notes |
|---|---|---|---|
stealth_request |
โ | Make HTTP requests with TLS fingerprint impersonation | |
solve_captcha |
๐ง | Detect and solve CAPTCHAs (reCAPTCHA, hCaptcha, Turnstile) | Requires ANTICAPTCHA_API_KEY or CAPSOLVER_API_KEY |
โ๏ธ Advanced (CDP & Cache)
| Tool | Status | Description | Notes |
|---|---|---|---|
execute_cdp |
โ | Execute raw Chrome DevTools Protocol commands | |
get_dom_tree |
โ | Get full DOM tree via CDP | |
clear_cache |
โ | Clear cached pages, responses, or state snapshots | |
get_cache_stats |
โ | Get cache statistics | |
configure_rate_limit |
โ | Configure rate limiting per domain | |
get_rate_limit_stats |
โ | Get rate limiter statistics |
๐ง System
| Tool | Status | Description | Notes |
|---|---|---|---|
health_check |
โ | Check health of server, browser pool, and storage |
โจ Features
- โ 55 Powerful Tools - From deep analysis to crawler generation
- โ Stealth Mode - TLS fingerprint impersonation, anti-detection
- โ AI-Powered Extraction - Natural language queries for data extraction
- โ Session Recording - Record and replay browser sessions
- โ Auto Crawler Generation - Generate Python/Playwright/YAML crawlers
- โ Proxy Pool - Rotation strategies, health checking
- โ Rate Limiting - Per-domain rate limits with backoff
- โ CAPTCHA Solving - reCAPTCHA, hCaptcha, Cloudflare Turnstile
- โ Technology Detection - Detect CMS, frameworks, CDNs
- โ Performance Metrics - Core Web Vitals, resource analysis
- โ Accessibility Checks - Automated a11y auditing
๐ง Configuration (Optional)
Customize behavior with environment variables:
{
"mcpServers": {
"crawilfy": {
"command": "uvx",
"args": ["crawilfy-mcp-server"],
"env": {
"CRAWILFY_HEADLESS": "true",
"CRAWILFY_BROWSER": "chromium",
"CRAWILFY_NAV_TIMEOUT": "30.0",
"CRAWILFY_OP_TIMEOUT": "60.0",
"CRAWILFY_POOL_SIZE": "5"
}
}
}
}
| Variable | Description | Default |
|---|---|---|
CRAWILFY_HEADLESS |
Run browser in background | true |
CRAWILFY_BROWSER |
Browser type (chromium/firefox/webkit) | chromium |
CRAWILFY_NAV_TIMEOUT |
Page load timeout (seconds) | 30.0 |
CRAWILFY_OP_TIMEOUT |
Operation timeout (seconds) | 60.0 |
CRAWILFY_POOL_SIZE |
Max browser instances | 5 |
๐ค AI-Powered Smart Extraction (Optional)
The smart_extract tool works without any paid API using pattern matching. Optionally enable LLM enhancement for better accuracy with any OpenAI-compatible API - including FREE options!
Option 1: OpenRouter (Recommended - FREE Models Available!)
{
"mcpServers": {
"crawilfy": {
"command": "uvx",
"args": ["crawilfy-mcp-server"],
"env": {
"CRAWILFY_LLM_PROVIDER": "openrouter",
"CRAWILFY_LLM_API_KEY": "sk-or-v1-your-key-here",
"CRAWILFY_LLM_MODEL": "meta-llama/llama-3.2-3b-instruct:free"
}
}
}
}
Free models: meta-llama/llama-3.2-3b-instruct:free, google/gemma-2-9b-it:free, qwen/qwen-2-7b-instruct:free
Get your API key at: openrouter.ai/keys
Option 2: Groq (FREE Tier, Very Fast!)
{
"env": {
"CRAWILFY_LLM_PROVIDER": "groq",
"CRAWILFY_LLM_API_KEY": "gsk_your-key-here",
"CRAWILFY_LLM_MODEL": "llama-3.1-8b-instant"
}
}
Get your API key at: console.groq.com/keys
Option 3: Ollama (100% FREE - Runs Locally)
{
"env": {
"CRAWILFY_LLM_PROVIDER": "ollama",
"CRAWILFY_LLM_MODEL": "llama3.2"
}
}
Install Ollama from ollama.ai, then run: ollama pull llama3.2
No API key needed!
Option 4: Any OpenAI-Compatible API
For custom providers (Factory.ai, KiloCode, MegaLLM, etc.):
{
"env": {
"CRAWILFY_LLM_BASE_URL": "https://your-api.com/v1",
"CRAWILFY_LLM_API_KEY": "your-api-key",
"CRAWILFY_LLM_MODEL": "your-model-name"
}
}
LLM Configuration Variables
| Variable | Description | Default |
|---|---|---|
CRAWILFY_LLM_PROVIDER |
Provider shortcut: openrouter, groq, ollama, together, deepseek, openai |
- |
CRAWILFY_LLM_API_KEY |
API key for the provider (not needed for Ollama) | - |
CRAWILFY_LLM_BASE_URL |
Custom API base URL (auto-set if using provider) | - |
CRAWILFY_LLM_MODEL |
Model name (auto-selected per provider if not set) | varies |
OPENAI_API_KEY |
Legacy: also works for OpenAI provider | - |
See llm-config-examples.env for more examples.
๐ฆ Manual Installation (For Development)
# Clone the repository
git clone https://github.com/emad-dev/crawilfy-mcp-server.git
cd crawilfy-mcp-server
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install with dependencies
pip install -e .
# Install browser
playwright install chromium
Then configure MCP with local path:
{
"mcpServers": {
"crawilfy": {
"command": "/path/to/crawilfy-mcp-server/venv/bin/python",
"args": ["-m", "src.mcp.server"],
"cwd": "/path/to/crawilfy-mcp-server"
}
}
}
๐ป Python API
Use Crawilfy programmatically in your own code:
import asyncio
from src.core.browser.pool import BrowserPool
from src.core.browser.stealth import create_stealth_context
from src.intelligence.network.api_discovery import APIDiscoveryEngine
async def analyze_site(url):
pool = BrowserPool()
await pool.initialize()
try:
context = await create_stealth_context(pool)
page = await context.new_page()
await page.goto(url)
# Your analysis code here
await context.close()
finally:
await pool.close()
asyncio.run(analyze_site("https://example.com"))
๐งช CLI Usage
# Deep analysis
crawl deep-analyze https://example.com --full
# Discover APIs
crawl discover-apis https://example.com --include-hidden
# Record session
crawl record https://example.com --output session.json
# Generate crawler
crawl generate --from-recording session.json --output crawler.yaml
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
# Development setup
pip install -e ".[dev]"
# Run tests
pytest
# Code formatting
black src tests
ruff check src tests
๐ License
MIT License - see LICENSE file for details.
Made with โค๏ธ by emad.dev
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawilfy_mcp_server-0.3.5.tar.gz.
File metadata
- Download URL: crawilfy_mcp_server-0.3.5.tar.gz
- Upload date:
- Size: 143.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68e723dbd2cc8a833c34464c98e105ce6c916277eab2355103f4d743a73501d1
|
|
| MD5 |
af7e5a136634be6ec4dfd498c9850233
|
|
| BLAKE2b-256 |
ea28f2994f344e20a69dc4665ca458b8599428a0fcd49638a0b5511a3965e2c9
|
File details
Details for the file crawilfy_mcp_server-0.3.5-py3-none-any.whl.
File metadata
- Download URL: crawilfy_mcp_server-0.3.5-py3-none-any.whl
- Upload date:
- Size: 121.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e04be1fc723a55a94fe51472ba71f9601318acf5b12e60af1afe8c8f1fd73eb
|
|
| MD5 |
d7b945107b63e9d505c6f5f9bdd19162
|
|
| BLAKE2b-256 |
312339afcf0774780ae528db9a4c3d57b26bbf95716d930799efaed2ce356324
|