Skip to main content

Advanced Web Crawling Platform with Deep Analysis and MCP Server

Project description

Crawilfy MCP Server

Python 3.10+ PyPI version License: MIT Code style: black

Advanced web crawling platform with deep analysis capabilities, automatic API discovery, and crawler generation. Built as an MCP (Model Context Protocol) server for seamless integration with AI assistants like Cursor, Claude Code, and Windsurf.


โšก Quick Start (Single Command)

Option 1: Using uvx (Recommended - No Installation Required)

The simplest way to use Crawilfy. Just add this to your MCP configuration:

{
  "mcpServers": {
    "crawilfy": {
      "command": "uvx",
      "args": ["crawilfy-mcp-server"]
    }
  }
}

Note: Requires uv to be installed. Install with: curl -LsSf https://astral.sh/uv/install.sh | sh

Option 2: Using pipx

{
  "mcpServers": {
    "crawilfy": {
      "command": "pipx",
      "args": ["run", "crawilfy-mcp-server"]
    }
  }
}

Option 3: Using pip (Global Install)

pip install crawilfy-mcp-server
playwright install chromium

Then add to your MCP configuration:

{
  "mcpServers": {
    "crawilfy": {
      "command": "python",
      "args": ["-m", "src.mcp.server"]
    }
  }
}

๐Ÿ”ง Where to Add MCP Configuration

For Cursor IDE

  1. Open Settings (Cmd/Ctrl + ,)
  2. Search for "MCP"
  3. Click "Edit in settings.json"
  4. Add the configuration under mcpServers

For Claude Code

  1. Open the MCP settings file at ~/.config/claude/mcp_settings.json
  2. Add the configuration

For Windsurf

  1. Open Settings โ†’ MCP Servers
  2. Add the configuration

๐Ÿ› ๏ธ Available Tools (55 Total)

Test Status Legend: โœ… Tested & Working | โš ๏ธ Works with limitations | ๐Ÿ”ง Requires config | ๐Ÿ†“ No paid API needed

๐Ÿ” Deep Analysis & Discovery

Tool Status Description Notes
deep_analyze โœ… Comprehensive analysis of a website (network + JS + security)
discover_apis โœ… Discover all REST and GraphQL APIs including hidden endpoints
introspect_graphql โœ… Extract complete GraphQL schema using introspection
execute_graphql โœ… Execute GraphQL queries and mutations
analyze_websocket โœ… Intercept and analyze WebSocket connections Returns empty if no WS found
analyze_auth โœ… Analyze authentication flow and mechanisms
detect_protection โœ… Detect anti-bot systems, CAPTCHAs, and fingerprinting
detect_technology โœ… Detect technology stack (CMS, frameworks, CDN, analytics)

๐Ÿ“œ JavaScript Analysis

Tool Status Description Notes
deobfuscate_js โœ… ๐Ÿ†“ Deobfuscate JavaScript code with multiple techniques No browser needed
extract_from_js โœ… ๐Ÿ†“ Extract API endpoints, URLs, constants, and auth logic from JS No browser needed

๐ŸŽฌ Session Recording & Crawlers

Tool Status Description Notes
record_session โœ… Start recording an interactive browser session
stop_recording โœ… Stop an active recording and save it
list_recordings โœ… List all available recordings (active and saved)
get_recording_status โœ… Get status and details of a specific recording
delete_recording โœ… Delete a saved recording
export_recording โœ… Export recording to JSON, HAR, or Playwright test format
generate_crawler โœ… Generate crawler script from recording (YAML, Python, Playwright)

๐Ÿ“„ Content Extraction

Tool Status Description Notes
extract_article โœ… Extract clean article content with intelligent parsing
convert_to_markdown โœ… Convert webpage to clean markdown for LLM consumption
smart_extract โœ… ๐Ÿ†“ Extract data using natural language queries Works without LLM; optionally enhanced with free providers
extract_links โœ… Extract all links with filtering options
extract_forms โœ… Extract all forms with field details
extract_metadata โœ… Extract OG tags, Twitter cards, JSON-LD structured data
extract_tables โœ… Extract tables as JSON, CSV, or Markdown
wait_and_extract โœ… Wait for dynamic elements and extract content

๐ŸŒ Network & Sitemap

Tool Status Description Notes
analyze_sitemap โœ… Analyze sitemap.xml to extract URLs and metadata
check_robots โœ… Analyze robots.txt for crawl rules and sitemaps
monitor_network โœ… Monitor network traffic for a specified duration

๐Ÿ–ฅ๏ธ Page Interaction

Tool Status Description Notes
take_screenshot โœ… Take full-page or viewport screenshots
execute_js โœ… Execute JavaScript on a page and return results
get_cookies โœ… Get all cookies from a page/domain
get_storage โœ… Get localStorage and sessionStorage
fill_form โœ… Automatically fill form fields with provided data

๐Ÿ” Session & Proxy Management

Tool Status Description Notes
save_session โœ… Save browser session (cookies, storage) for reuse
load_session โœ… Load a previously saved session
list_sessions โœ… List all saved sessions
configure_proxies โœ… Configure proxy pool with rotation strategies
get_proxy_stats โœ… Get proxy pool health and usage statistics
add_proxy โœ… Add a proxy to the pool
remove_proxy โœ… Remove a proxy from the pool
test_proxy โœ… Test a proxy's connectivity

๐Ÿ“Š Performance & Analysis

Tool Status Description Notes
measure_performance โœ… Measure page load timing and Core Web Vitals
analyze_resources โœ… Analyze all loaded resources (scripts, images, fonts)
check_accessibility โœ… Run accessibility checks and report issues
compare_pages โœ… Compare two pages for structure/content differences

๐Ÿ›ก๏ธ Stealth & Anti-Detection

Tool Status Description Notes
stealth_request โœ… Make HTTP requests with TLS fingerprint impersonation
solve_captcha ๐Ÿ”ง Detect and solve CAPTCHAs (reCAPTCHA, hCaptcha, Turnstile) Requires ANTICAPTCHA_API_KEY or CAPSOLVER_API_KEY

โš™๏ธ Advanced (CDP & Cache)

Tool Status Description Notes
execute_cdp โœ… Execute raw Chrome DevTools Protocol commands
get_dom_tree โœ… Get full DOM tree via CDP
clear_cache โœ… Clear cached pages, responses, or state snapshots
get_cache_stats โœ… Get cache statistics
configure_rate_limit โœ… Configure rate limiting per domain
get_rate_limit_stats โœ… Get rate limiter statistics

๐Ÿ”ง System

Tool Status Description Notes
health_check โœ… Check health of server, browser pool, and storage

โœจ Features

  • โœ… 55 Powerful Tools - From deep analysis to crawler generation
  • โœ… Stealth Mode - TLS fingerprint impersonation, anti-detection
  • โœ… AI-Powered Extraction - Natural language queries for data extraction
  • โœ… Session Recording - Record and replay browser sessions
  • โœ… Auto Crawler Generation - Generate Python/Playwright/YAML crawlers
  • โœ… Proxy Pool - Rotation strategies, health checking
  • โœ… Rate Limiting - Per-domain rate limits with backoff
  • โœ… CAPTCHA Solving - reCAPTCHA, hCaptcha, Cloudflare Turnstile
  • โœ… Technology Detection - Detect CMS, frameworks, CDNs
  • โœ… Performance Metrics - Core Web Vitals, resource analysis
  • โœ… Accessibility Checks - Automated a11y auditing

๐Ÿ”ง Configuration (Optional)

Customize behavior with environment variables:

{
  "mcpServers": {
    "crawilfy": {
      "command": "uvx",
      "args": ["crawilfy-mcp-server"],
      "env": {
        "CRAWILFY_HEADLESS": "true",
        "CRAWILFY_BROWSER": "chromium",
        "CRAWILFY_NAV_TIMEOUT": "30.0",
        "CRAWILFY_OP_TIMEOUT": "60.0",
        "CRAWILFY_POOL_SIZE": "5"
      }
    }
  }
}
Variable Description Default
CRAWILFY_HEADLESS Run browser in background true
CRAWILFY_BROWSER Browser type (chromium/firefox/webkit) chromium
CRAWILFY_NAV_TIMEOUT Page load timeout (seconds) 30.0
CRAWILFY_OP_TIMEOUT Operation timeout (seconds) 60.0
CRAWILFY_POOL_SIZE Max browser instances 5

๐Ÿค– AI-Powered Smart Extraction (Optional)

The smart_extract tool works without any paid API using pattern matching. Optionally enable LLM enhancement for better accuracy with any OpenAI-compatible API - including FREE options!

Option 1: OpenRouter (Recommended - FREE Models Available!)

{
  "mcpServers": {
    "crawilfy": {
      "command": "uvx",
      "args": ["crawilfy-mcp-server"],
      "env": {
        "CRAWILFY_LLM_PROVIDER": "openrouter",
        "CRAWILFY_LLM_API_KEY": "sk-or-v1-your-key-here",
        "CRAWILFY_LLM_MODEL": "meta-llama/llama-3.2-3b-instruct:free"
      }
    }
  }
}

Free models: meta-llama/llama-3.2-3b-instruct:free, google/gemma-2-9b-it:free, qwen/qwen-2-7b-instruct:free

Get your API key at: openrouter.ai/keys

Option 2: Groq (FREE Tier, Very Fast!)

{
  "env": {
    "CRAWILFY_LLM_PROVIDER": "groq",
    "CRAWILFY_LLM_API_KEY": "gsk_your-key-here",
    "CRAWILFY_LLM_MODEL": "llama-3.1-8b-instant"
  }
}

Get your API key at: console.groq.com/keys

Option 3: Ollama (100% FREE - Runs Locally)

{
  "env": {
    "CRAWILFY_LLM_PROVIDER": "ollama",
    "CRAWILFY_LLM_MODEL": "llama3.2"
  }
}

Install Ollama from ollama.ai, then run: ollama pull llama3.2

No API key needed!

Option 4: Any OpenAI-Compatible API

For custom providers (Factory.ai, KiloCode, MegaLLM, etc.):

{
  "env": {
    "CRAWILFY_LLM_BASE_URL": "https://your-api.com/v1",
    "CRAWILFY_LLM_API_KEY": "your-api-key",
    "CRAWILFY_LLM_MODEL": "your-model-name"
  }
}

LLM Configuration Variables

Variable Description Default
CRAWILFY_LLM_PROVIDER Provider shortcut: openrouter, groq, ollama, together, deepseek, openai -
CRAWILFY_LLM_API_KEY API key for the provider (not needed for Ollama) -
CRAWILFY_LLM_BASE_URL Custom API base URL (auto-set if using provider) -
CRAWILFY_LLM_MODEL Model name (auto-selected per provider if not set) varies
OPENAI_API_KEY Legacy: also works for OpenAI provider -

See llm-config-examples.env for more examples.


๐Ÿ“ฆ Manual Installation (For Development)

# Clone the repository
git clone https://github.com/emad-dev/crawilfy-mcp-server.git
cd crawilfy-mcp-server

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install with dependencies
pip install -e .

# Install browser
playwright install chromium

Then configure MCP with local path:

{
  "mcpServers": {
    "crawilfy": {
      "command": "/path/to/crawilfy-mcp-server/venv/bin/python",
      "args": ["-m", "src.mcp.server"],
      "cwd": "/path/to/crawilfy-mcp-server"
    }
  }
}

๐Ÿ’ป Python API

Use Crawilfy programmatically in your own code:

import asyncio
from src.core.browser.pool import BrowserPool
from src.core.browser.stealth import create_stealth_context
from src.intelligence.network.api_discovery import APIDiscoveryEngine

async def analyze_site(url):
    pool = BrowserPool()
    await pool.initialize()
    
    try:
        context = await create_stealth_context(pool)
        page = await context.new_page()
        
        await page.goto(url)
        
        # Your analysis code here
        
        await context.close()
    finally:
        await pool.close()

asyncio.run(analyze_site("https://example.com"))

๐Ÿงช CLI Usage

# Deep analysis
crawl deep-analyze https://example.com --full

# Discover APIs
crawl discover-apis https://example.com --include-hidden

# Record session
crawl record https://example.com --output session.json

# Generate crawler
crawl generate --from-recording session.json --output crawler.yaml

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

# Development setup
pip install -e ".[dev]"

# Run tests
pytest

# Code formatting
black src tests
ruff check src tests

๐Ÿ“„ License

MIT License - see LICENSE file for details.


Made with โค๏ธ by emad.dev

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawilfy_mcp_server-0.3.5.tar.gz (143.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawilfy_mcp_server-0.3.5-py3-none-any.whl (121.7 kB view details)

Uploaded Python 3

File details

Details for the file crawilfy_mcp_server-0.3.5.tar.gz.

File metadata

  • Download URL: crawilfy_mcp_server-0.3.5.tar.gz
  • Upload date:
  • Size: 143.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for crawilfy_mcp_server-0.3.5.tar.gz
Algorithm Hash digest
SHA256 68e723dbd2cc8a833c34464c98e105ce6c916277eab2355103f4d743a73501d1
MD5 af7e5a136634be6ec4dfd498c9850233
BLAKE2b-256 ea28f2994f344e20a69dc4665ca458b8599428a0fcd49638a0b5511a3965e2c9

See more details on using hashes here.

File details

Details for the file crawilfy_mcp_server-0.3.5-py3-none-any.whl.

File metadata

File hashes

Hashes for crawilfy_mcp_server-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 0e04be1fc723a55a94fe51472ba71f9601318acf5b12e60af1afe8c8f1fd73eb
MD5 d7b945107b63e9d505c6f5f9bdd19162
BLAKE2b-256 312339afcf0774780ae528db9a4c3d57b26bbf95716d930799efaed2ce356324

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page