Advanced Web Crawling Platform with Deep Analysis and MCP Server

These details have not been verified by PyPI

Project description

Crawilfy MCP Server

Advanced web crawling platform with deep analysis capabilities, automatic API discovery, and crawler generation. Built as an MCP (Model Context Protocol) server for seamless integration with AI assistants like Cursor, Claude Code, and Windsurf.

⚡ Quick Start (Single Command)

Option 1: Using uvx (Recommended - No Installation Required)

The simplest way to use Crawilfy. Just add this to your MCP configuration:

{
  "mcpServers": {
    "crawilfy": {
      "command": "uvx",
      "args": ["crawilfy-mcp-server"]
    }
  }
}

Note: Requires uv to be installed. Install with: curl -LsSf https://astral.sh/uv/install.sh | sh

Option 2: Using pipx

{
  "mcpServers": {
    "crawilfy": {
      "command": "pipx",
      "args": ["run", "crawilfy-mcp-server"]
    }
  }
}

Option 3: Using pip (Global Install)

pip install crawilfy-mcp-server
playwright install chromium

Then add to your MCP configuration:

{
  "mcpServers": {
    "crawilfy": {
      "command": "python",
      "args": ["-m", "src.mcp.server"]
    }
  }
}

🔧 Where to Add MCP Configuration

For Cursor IDE

Open Settings (Cmd/Ctrl + ,)
Search for "MCP"
Click "Edit in settings.json"
Add the configuration under mcpServers

For Claude Code

Open the MCP settings file at ~/.config/claude/mcp_settings.json
Add the configuration

For Windsurf

Open Settings → MCP Servers
Add the configuration

🛠️ Available Tools (55 Total)

Test Status Legend: ✅ Tested & Working | ⚠️ Works with limitations | 🔧 Requires config | 🆓 No paid API needed

🔍 Deep Analysis & Discovery

Tool	Status	Description	Notes
`deep_analyze`	✅	Comprehensive analysis of a website (network + JS + security)
`discover_apis`	✅	Discover all REST and GraphQL APIs including hidden endpoints
`introspect_graphql`	✅	Extract complete GraphQL schema using introspection
`execute_graphql`	✅	Execute GraphQL queries and mutations
`analyze_websocket`	✅	Intercept and analyze WebSocket connections	Returns empty if no WS found
`analyze_auth`	✅	Analyze authentication flow and mechanisms
`detect_protection`	✅	Detect anti-bot systems, CAPTCHAs, and fingerprinting
`detect_technology`	✅	Detect technology stack (CMS, frameworks, CDN, analytics)

📜 JavaScript Analysis

Tool	Status	Description	Notes
`deobfuscate_js`	✅ 🆓	Deobfuscate JavaScript code with multiple techniques	No browser needed
`extract_from_js`	✅ 🆓	Extract API endpoints, URLs, constants, and auth logic from JS	No browser needed

🎬 Session Recording & Crawlers

Tool	Status	Description
`record_session`	✅	Start recording an interactive browser session
`stop_recording`	✅	Stop an active recording and save it
`list_recordings`	✅	List all available recordings (active and saved)
`get_recording_status`	✅	Get status and details of a specific recording
`delete_recording`	✅	Delete a saved recording
`export_recording`	✅	Export recording to JSON, HAR, or Playwright test format
`generate_crawler`	✅	Generate crawler script from recording (YAML, Python, Playwright)

📄 Content Extraction

Tool	Status	Description	Notes
`extract_article`	✅	Extract clean article content with intelligent parsing
`convert_to_markdown`	✅	Convert webpage to clean markdown for LLM consumption
`smart_extract`	✅ 🆓	Extract data using natural language queries	Works without LLM; optionally enhanced with free providers
`extract_links`	✅	Extract all links with filtering options
`extract_forms`	✅	Extract all forms with field details
`extract_metadata`	✅	Extract OG tags, Twitter cards, JSON-LD structured data
`extract_tables`	✅	Extract tables as JSON, CSV, or Markdown
`wait_and_extract`	✅	Wait for dynamic elements and extract content

🌐 Network & Sitemap

Tool	Status	Description
`analyze_sitemap`	✅	Analyze sitemap.xml to extract URLs and metadata
`check_robots`	✅	Analyze robots.txt for crawl rules and sitemaps
`monitor_network`	✅	Monitor network traffic for a specified duration

🖥️ Page Interaction

Tool	Status	Description
`take_screenshot`	✅	Take full-page or viewport screenshots
`execute_js`	✅	Execute JavaScript on a page and return results
`get_cookies`	✅	Get all cookies from a page/domain
`get_storage`	✅	Get localStorage and sessionStorage
`fill_form`	✅	Automatically fill form fields with provided data

🔐 Session & Proxy Management

Tool	Status	Description
`save_session`	✅	Save browser session (cookies, storage) for reuse
`load_session`	✅	Load a previously saved session
`list_sessions`	✅	List all saved sessions
`configure_proxies`	✅	Configure proxy pool with rotation strategies
`get_proxy_stats`	✅	Get proxy pool health and usage statistics
`add_proxy`	✅	Add a proxy to the pool
`remove_proxy`	✅	Remove a proxy from the pool
`test_proxy`	✅	Test a proxy's connectivity

📊 Performance & Analysis

Tool	Status	Description
`measure_performance`	✅	Measure page load timing and Core Web Vitals
`analyze_resources`	✅	Analyze all loaded resources (scripts, images, fonts)
`check_accessibility`	✅	Run accessibility checks and report issues
`compare_pages`	✅	Compare two pages for structure/content differences

🛡️ Stealth & Anti-Detection

Tool	Status	Description	Notes
`stealth_request`	✅	Make HTTP requests with TLS fingerprint impersonation
`solve_captcha`	🔧	Detect and solve CAPTCHAs (reCAPTCHA, hCaptcha, Turnstile)	Requires ANTICAPTCHA_API_KEY or CAPSOLVER_API_KEY

⚙️ Advanced (CDP & Cache)

Tool	Status	Description
`execute_cdp`	✅	Execute raw Chrome DevTools Protocol commands
`get_dom_tree`	✅	Get full DOM tree via CDP
`clear_cache`	✅	Clear cached pages, responses, or state snapshots
`get_cache_stats`	✅	Get cache statistics
`configure_rate_limit`	✅	Configure rate limiting per domain
`get_rate_limit_stats`	✅	Get rate limiter statistics

🔧 System

Tool	Status	Description	Notes
`health_check`	✅	Check health of server, browser pool, and storage

✨ Features

✅ 55 Powerful Tools - From deep analysis to crawler generation
✅ Stealth Mode - TLS fingerprint impersonation, anti-detection
✅ AI-Powered Extraction - Natural language queries for data extraction
✅ Session Recording - Record and replay browser sessions
✅ Auto Crawler Generation - Generate Python/Playwright/YAML crawlers
✅ Proxy Pool - Rotation strategies, health checking
✅ Rate Limiting - Per-domain rate limits with backoff
✅ CAPTCHA Solving - reCAPTCHA, hCaptcha, Cloudflare Turnstile
✅ Technology Detection - Detect CMS, frameworks, CDNs
✅ Performance Metrics - Core Web Vitals, resource analysis
✅ Accessibility Checks - Automated a11y auditing

🔧 Configuration (Optional)

Customize behavior with environment variables:

{
  "mcpServers": {
    "crawilfy": {
      "command": "uvx",
      "args": ["crawilfy-mcp-server"],
      "env": {
        "CRAWILFY_HEADLESS": "true",
        "CRAWILFY_BROWSER": "chromium",
        "CRAWILFY_NAV_TIMEOUT": "30.0",
        "CRAWILFY_OP_TIMEOUT": "60.0",
        "CRAWILFY_POOL_SIZE": "5"
      }
    }
  }
}

Variable	Description	Default
`CRAWILFY_HEADLESS`	Run browser in background	`true`
`CRAWILFY_BROWSER`	Browser type (chromium/firefox/webkit)	`chromium`
`CRAWILFY_NAV_TIMEOUT`	Page load timeout (seconds)	`30.0`
`CRAWILFY_OP_TIMEOUT`	Operation timeout (seconds)	`60.0`
`CRAWILFY_POOL_SIZE`	Max browser instances	`5`

🤖 AI-Powered Smart Extraction (Optional)

The smart_extract tool works without any paid API using pattern matching. Optionally enable LLM enhancement for better accuracy with any OpenAI-compatible API - including FREE options!

Option 1: OpenRouter (Recommended - FREE Models Available!)

{
  "mcpServers": {
    "crawilfy": {
      "command": "uvx",
      "args": ["crawilfy-mcp-server"],
      "env": {
        "CRAWILFY_LLM_PROVIDER": "openrouter",
        "CRAWILFY_LLM_API_KEY": "sk-or-v1-your-key-here",
        "CRAWILFY_LLM_MODEL": "meta-llama/llama-3.2-3b-instruct:free"
      }
    }
  }
}

Free models: meta-llama/llama-3.2-3b-instruct:free, google/gemma-2-9b-it:free, qwen/qwen-2-7b-instruct:free

Get your API key at: openrouter.ai/keys

Option 2: Groq (FREE Tier, Very Fast!)

{
  "env": {
    "CRAWILFY_LLM_PROVIDER": "groq",
    "CRAWILFY_LLM_API_KEY": "gsk_your-key-here",
    "CRAWILFY_LLM_MODEL": "llama-3.1-8b-instant"
  }
}

Get your API key at: console.groq.com/keys

Option 3: Ollama (100% FREE - Runs Locally)

{
  "env": {
    "CRAWILFY_LLM_PROVIDER": "ollama",
    "CRAWILFY_LLM_MODEL": "llama3.2"
  }
}

Install Ollama from ollama.ai, then run: ollama pull llama3.2

No API key needed!

Option 4: Any OpenAI-Compatible API

For custom providers (Factory.ai, KiloCode, MegaLLM, etc.):

{
  "env": {
    "CRAWILFY_LLM_BASE_URL": "https://your-api.com/v1",
    "CRAWILFY_LLM_API_KEY": "your-api-key",
    "CRAWILFY_LLM_MODEL": "your-model-name"
  }
}

LLM Configuration Variables

Variable	Description	Default
`CRAWILFY_LLM_PROVIDER`	Provider shortcut: `openrouter`, `groq`, `ollama`, `together`, `deepseek`, `openai`	-
`CRAWILFY_LLM_API_KEY`	API key for the provider (not needed for Ollama)	-
`CRAWILFY_LLM_BASE_URL`	Custom API base URL (auto-set if using provider)	-
`CRAWILFY_LLM_MODEL`	Model name (auto-selected per provider if not set)	varies
`OPENAI_API_KEY`	Legacy: also works for OpenAI provider	-

See llm-config-examples.env for more examples.

📦 Manual Installation (For Development)

# Clone the repository
git clone https://github.com/emad-dev/crawilfy-mcp-server.git
cd crawilfy-mcp-server

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install with dependencies
pip install -e .

# Install browser
playwright install chromium

Then configure MCP with local path:

{
  "mcpServers": {
    "crawilfy": {
      "command": "/path/to/crawilfy-mcp-server/venv/bin/python",
      "args": ["-m", "src.mcp.server"],
      "cwd": "/path/to/crawilfy-mcp-server"
    }
  }
}

💻 Python API

Use Crawilfy programmatically in your own code:

import asyncio
from src.core.browser.pool import BrowserPool
from src.core.browser.stealth import create_stealth_context
from src.intelligence.network.api_discovery import APIDiscoveryEngine

async def analyze_site(url):
    pool = BrowserPool()
    await pool.initialize()
    
    try:
        context = await create_stealth_context(pool)
        page = await context.new_page()
        
        await page.goto(url)
        
        # Your analysis code here
        
        await context.close()
    finally:
        await pool.close()

asyncio.run(analyze_site("https://example.com"))

🧪 CLI Usage

# Deep analysis
crawl deep-analyze https://example.com --full

# Discover APIs
crawl discover-apis https://example.com --include-hidden

# Record session
crawl record https://example.com --output session.json

# Generate crawler
crawl generate --from-recording session.json --output crawler.yaml

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

# Development setup
pip install -e ".[dev]"

# Run tests
pytest

# Code formatting
black src tests
ruff check src tests

📄 License

MIT License - see LICENSE file for details.

Made with ❤️ by emad.dev

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.3

May 4, 2026

1.1.2

May 4, 2026

1.1.1

May 4, 2026

1.1.0

May 4, 2026

1.0.0

Apr 14, 2026

This version

0.3.5

Apr 14, 2026

0.3.4

Dec 4, 2025

0.3.3

Dec 4, 2025

0.3.2

Dec 4, 2025

0.3.1

Dec 4, 2025

0.3.0

Dec 4, 2025

0.2.0

Dec 4, 2025

0.1.2

Dec 4, 2025

0.1.1

Dec 4, 2025

0.1.0

Dec 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawilfy_mcp_server-0.3.5.tar.gz (143.3 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crawilfy_mcp_server-0.3.5-py3-none-any.whl (121.7 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file crawilfy_mcp_server-0.3.5.tar.gz.

File metadata

Download URL: crawilfy_mcp_server-0.3.5.tar.gz
Upload date: Apr 14, 2026
Size: 143.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for crawilfy_mcp_server-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`68e723dbd2cc8a833c34464c98e105ce6c916277eab2355103f4d743a73501d1`
MD5	`af7e5a136634be6ec4dfd498c9850233`
BLAKE2b-256	`ea28f2994f344e20a69dc4665ca458b8599428a0fcd49638a0b5511a3965e2c9`

See more details on using hashes here.

File details

Details for the file crawilfy_mcp_server-0.3.5-py3-none-any.whl.

File metadata

Download URL: crawilfy_mcp_server-0.3.5-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 121.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for crawilfy_mcp_server-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e04be1fc723a55a94fe51472ba71f9601318acf5b12e60af1afe8c8f1fd73eb`
MD5	`d7b945107b63e9d505c6f5f9bdd19162`
BLAKE2b-256	`312339afcf0774780ae528db9a4c3d57b26bbf95716d930799efaed2ce356324`

See more details on using hashes here.

crawilfy-mcp-server 0.3.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Crawilfy MCP Server

⚡ Quick Start (Single Command)

Option 1: Using uvx (Recommended - No Installation Required)

Option 2: Using pipx

Option 3: Using pip (Global Install)

🔧 Where to Add MCP Configuration

For Cursor IDE

For Claude Code

For Windsurf

🛠️ Available Tools (55 Total)

🔍 Deep Analysis & Discovery

📜 JavaScript Analysis

🎬 Session Recording & Crawlers

📄 Content Extraction

🌐 Network & Sitemap

🖥️ Page Interaction

🔐 Session & Proxy Management

📊 Performance & Analysis

🛡️ Stealth & Anti-Detection

⚙️ Advanced (CDP & Cache)

🔧 System

✨ Features

🔧 Configuration (Optional)

🤖 AI-Powered Smart Extraction (Optional)

Option 1: OpenRouter (Recommended - FREE Models Available!)

Option 2: Groq (FREE Tier, Very Fast!)

Option 3: Ollama (100% FREE - Runs Locally)

Option 4: Any OpenAI-Compatible API

LLM Configuration Variables

📦 Manual Installation (For Development)

💻 Python API

🧪 CLI Usage

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes