Skip to main content

Official MCP Server for Thordata - Give your AI Agents real-time web scraping superpowers.

Project description

Thordata MCP Server

Give your AI Agents real-time web scraping superpowers.

Python 3.10+ License: MIT PyPI version

A production-ready MCP (Model Context Protocol) server that provides AI agents with powerful web scraping capabilities. Optimized for LLM-friendly interactions with comprehensive error handling, batch operations, and intelligent tool selection.

โœจ Features

๐ŸŽฏ Core Capabilities

  • ๐Ÿ” Search Engine Tools: High-level web search with LLM-friendly results

    • search_engine: Single-query search with light JSON results
    • search_engine_batch: Batch search with concurrent processing
    • Supports Google, Bing, Yandex with pagination
  • ๐ŸŒ Universal Web Scraper: Extract content from any webpage

    • unlocker: Universal page unlocking with JS rendering & anti-bot handling
    • unlocker_batch: Batch scraping with error isolation
    • Output formats: HTML, Markdown, PNG
    • Smart error handling for HTTP status codes
  • ๐Ÿค– Browser Automation: Full browser-level scraping

    • browser: Navigate and capture ARIA/DOM snapshots
    • JavaScript rendering support
    • Filtered accessibility tree for AI-friendly output
  • ๐Ÿง  Smart Scraping: Intelligent tool selection

    • smart_scrape: Auto-selects best scraper (SERP, Web Scraper, Unlocker)
    • Automatic fallback to universal scraper
    • Structured data extraction when available
  • ๐Ÿ“Š SERP API: Low-level search result scraping

    • serp: Advanced SERP operations with full parameter control
    • Batch search support
    • Multiple output formats

๐Ÿš€ Key Highlights

  • โœ… Production Ready: 100% test coverage with comprehensive error handling
  • ๐ŸŽฏ LLM Optimized: Clean tool surface designed for AI agents
  • โšก High Performance: Concurrent batch operations, optimized response times
  • ๐Ÿ›ก๏ธ Robust Error Handling: Detailed error messages with diagnostic information
  • ๐Ÿ“ฆ Batch Support: Efficient batch processing for multiple URLs/queries
  • ๐ŸŒ Multi-Engine: Support for Google, Bing, Yandex search engines

๐Ÿ“ฆ Installation

Prerequisites

Install from PyPI

pip install thordata-mcp-server

Install from Source

# Clone the repository
git clone https://github.com/thordata/thordata-mcp-server.git
cd thordata-mcp-server

# Install dependencies
pip install -e .

# Install Playwright browsers (for browser automation)
playwright install chromium

๐Ÿ”ง Configuration

Environment Variables

Create a .env file in the root directory or set environment variables:

# Required: Thordata API Credentials
THORDATA_SCRAPER_TOKEN=your_scraper_token
THORDATA_PUBLIC_TOKEN=your_public_token
THORDATA_PUBLIC_KEY=your_public_key

# Optional: Browser Automation (for browser tool)
THORDATA_BROWSER_USERNAME=your_browser_username
THORDATA_BROWSER_PASSWORD=your_browser_password

Tool Exposure Control

Control which tools are exposed via environment variables:

# Expose all tools (default: compact set)
THORDATA_MODE=pro

# Or specify tools explicitly
THORDATA_TOOLS=search_engine,search_engine_batch,unlocker,unlocker_batch,serp,browser,smart_scrape

๐Ÿƒ Quick Start

Running Locally (Stdio - Recommended)

Standard mode for MCP clients (Claude Desktop, Cursor, etc.):

thordata-mcp

Or using Python module:

python -m thordata_mcp.main --transport stdio

Running with HTTP (SSE)

For remote debugging or HTTP-based clients:

thordata-mcp --transport streamable-http --port 8000

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "thordata": {
      "command": "thordata-mcp",
      "env": {
        "THORDATA_SCRAPER_TOKEN": "your_token",
        "THORDATA_PUBLIC_TOKEN": "your_token",
        "THORDATA_PUBLIC_KEY": "your_key"
      }
    }
  }
}

Cursor Configuration

Add to your ~/.cursor/mcp.json:

{
  "mcpServers": {
    "thordata": {
      "command": "thordata-mcp",
      "env": {
        "THORDATA_SCRAPER_TOKEN": "your_token",
        "THORDATA_PUBLIC_TOKEN": "your_token",
        "THORDATA_PUBLIC_KEY": "your_key"
      }
    }
  }
}

๐Ÿ› ๏ธ Available Tools

Default Tools (Compact Surface)

By default, the server exposes a compact, LLM-friendly tool set:

1. search_engine - Web Search

High-level web search wrapper optimized for LLMs.

Parameters:

  • q (required): Search query string
  • engine (default: "google"): Search engine ("google", "bing", "yandex")
  • num (default: 10): Number of results (1-50)
  • start (default: 0): Starting position for pagination
  • country: Country code for geolocation (e.g., "US", "JP")
  • language: Language code (e.g., "en", "ja")

Example:

{
  "q": "Python web scraping",
  "engine": "google",
  "num": 10
}

Response:

{
  "ok": true,
  "output": {
    "results": [
      {
        "title": "Web Scraping with Python",
        "link": "https://example.com",
        "description": "Learn web scraping..."
      }
    ],
    "meta": {
      "engine": "google",
      "q": "Python web scraping",
      "num": 10
    }
  }
}

2. search_engine_batch - Batch Web Search

Batch search with concurrent processing and per-item error handling.

Parameters:

  • requests (required): Array of search request objects
  • concurrency (default: 5): Number of concurrent requests (1-20)
  • engine (default: "google"): Default engine for all requests
  • num (default: 10): Default number of results per request

Example:

{
  "requests": [
    {"q": "Python programming"},
    {"q": "JavaScript frameworks"},
    {"q": "Machine learning"}
  ],
  "concurrency": 3
}

3. unlocker - Universal Web Scraper

Extract content from any webpage with JavaScript rendering support.

Parameters:

  • url (required): Target URL to scrape
  • js_render (default: false): Enable JavaScript rendering
  • output_format (default: "html"): Output format ("html", "markdown", "png")
  • country: Country code for geolocation
  • wait_ms: Wait time in milliseconds before capture
  • wait_for: CSS selector or text to wait for
  • block_resources: Block resource types ("script", "image", "video")

Example:

{
  "url": "https://example.com",
  "js_render": true,
  "output_format": "markdown"
}

Response:

{
  "ok": true,
  "output": {
    "markdown": "# Example Page\n\nContent here...",
    "format": "markdown"
  }
}

4. unlocker_batch - Batch Web Scraping

Batch web scraping with concurrent processing and error isolation.

Parameters:

  • requests (required): Array of request objects with url and optional parameters
  • concurrency (default: 5): Number of concurrent requests (1-20)

Example:

{
  "requests": [
    {"url": "https://example.com", "js_render": true},
    {"url": "https://example.org", "output_format": "markdown"}
  ],
  "concurrency": 3
}

5. browser - Browser Scraper

Navigate and capture ARIA/DOM snapshots using Playwright.

Parameters:

  • url (required): Target URL to navigate
  • filtered (default: true): Return filtered ARIA snapshot
  • mode (default: "accessibility"): Snapshot mode ("accessibility" or "dom")
  • max_items (default: 100): Maximum items in snapshot (1-500)
  • max_chars (default: 20000): Maximum characters in snapshot
  • include_dom (default: false): Include DOM snapshot

Example:

{
  "url": "https://example.com",
  "filtered": true,
  "max_items": 50
}

6. smart_scrape - Intelligent Scraping

Automatically selects the best scraping method for any URL.

Parameters:

  • url (required): Target URL to scrape
  • prefer_structured (default: true): Prefer structured data extraction
  • preview (default: true): Include raw HTML/JSON preview
  • preview_max_chars (default: 20000): Maximum characters in preview
  • max_wait_seconds (default: 300): Maximum wait time for task completion
  • unlocker_output (default: "markdown"): Output format when using Unlocker fallback

Example:

{
  "url": "https://amazon.com/dp/B08N5WRWNW",
  "prefer_structured": true
}

Response:

{
  "ok": true,
  "output": {
    "tool_used": "amazon_product",
    "structured_data": {
      "title": "Product Title",
      "price": "$99.99",
      ...
    },
    "preview": "..."
  }
}

7. serp - SERP API (Advanced)

Low-level SERP scraper with full parameter control.

Parameters:

  • action (required): Action to perform ("search" or "batch_search")
  • params (required): Parameters dictionary

Example:

{
  "action": "search",
  "params": {
    "q": "Python programming",
    "engine": "google",
    "num": 10,
    "format": "light_json"
  }
}

๐ŸŽฏ Error Handling

The server provides comprehensive error handling with detailed diagnostic information:

Error Response Format

{
  "ok": false,
  "error": {
    "type": "not_found",
    "code": "E3003",
    "message": "HTTP 404 error: Page returned empty content...",
    "details": {
      "url": "https://example.com/not-found",
      "status_code": 404
    }
  },
  "request_id": "unique-request-id"
}

Error Types

  • validation_error: Invalid parameters (E4001)
  • not_found: Resource not found (E3003)
  • permission_denied: Access forbidden (E1004)
  • upstream_internal_error: Server errors (E2106)
  • timeout_error: Request timeout (E2003)
  • network_error: Network issues (E2001)

Special Features

  • Special Character Detection: Automatically detects and reports problematic characters in search queries
  • HTTP Status Code Mapping: Clear error messages for 404, 500, 403, etc.
  • Empty Result Hints: Helpful notes for empty search results (e.g., Chinese query limitations)
  • Batch Error Isolation: Individual request failures don't affect batch operations

๐Ÿ“Š Performance

  • Response Time: 0.4-2 seconds for most operations
  • Concurrent Processing: Supports up to 20 concurrent requests
  • Batch Operations: Efficient batch processing with error isolation
  • Resource Optimization: Smart caching and request optimization

๐Ÿงช Testing

The server has been extensively tested with 60+ test scenarios:

  • โœ… HTTP Error Handling: All status codes properly handled
  • โœ… Special Character Processing: Automatic detection and clear error messages
  • โœ… Batch Operations: Concurrent processing with error isolation
  • โœ… Empty Result Handling: Helpful hints for empty results
  • โœ… Performance: Optimized response times and resource usage

Test Coverage: 100% of reported issues resolved

๐Ÿ—๏ธ Architecture

thordata_mcp/
โ”œโ”€โ”€ main.py              # Entry point
โ”œโ”€โ”€ registry.py          # Tool registration
โ”œโ”€โ”€ config.py            # Configuration management
โ”œโ”€โ”€ context.py           # Server context (client, browser session)
โ”œโ”€โ”€ utils.py             # Common utilities (error handling, responses)
โ”œโ”€โ”€ browser_session.py   # Browser session management (Playwright)
โ”œโ”€โ”€ aria_snapshot.py     # ARIA snapshot filtering
โ””โ”€โ”€ tools/
    โ”œโ”€โ”€ product_compact.py  # Main tool definitions (compact surface)
    โ”œโ”€โ”€ product.py          # Full product implementation
    โ””โ”€โ”€ data/               # Data plane tools
        โ”œโ”€โ”€ serp.py         # SERP backend integration
        โ”œโ”€โ”€ universal.py    # Universal scraper integration
        โ”œโ”€โ”€ browser.py      # Browser automation
        โ””โ”€โ”€ tasks.py        # Structured scraping tasks

๐ŸŽฏ Design Principles

  1. LLM-Friendly: Clean tool surface optimized for AI agents
  2. Robust Error Handling: Detailed error messages with diagnostic information
  3. Batch Support: Efficient concurrent processing
  4. Performance Optimized: Fast response times and resource efficiency
  5. Production Ready: Comprehensive testing and error handling

๐Ÿš€ Deployment

Docker

docker build -t thordata-mcp-server .
docker run -e THORDATA_SCRAPER_TOKEN=... thordata-mcp-server

Docker Compose

See docker-compose.yml for a complete setup with Caddy reverse proxy.

๐Ÿ“ License

MIT License. Copyright (c) 2026 Thordata.

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“ž Support

๐Ÿ™ Acknowledgments

Built with:


Ready to give your AI agents web scraping superpowers? ๐Ÿš€

Install now: pip install thordata-mcp-server

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thordata_mcp_server-0.6.0.tar.gz (84.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thordata_mcp_server-0.6.0-py3-none-any.whl (87.8 kB view details)

Uploaded Python 3

File details

Details for the file thordata_mcp_server-0.6.0.tar.gz.

File metadata

  • Download URL: thordata_mcp_server-0.6.0.tar.gz
  • Upload date:
  • Size: 84.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for thordata_mcp_server-0.6.0.tar.gz
Algorithm Hash digest
SHA256 5344d6b41613ad1b646b9d2113760454120869407ddc55e714720923e7012564
MD5 c958219a0a6578e429035ab85d06af46
BLAKE2b-256 4d722432aae0f9189e3896a70fb4e01627a3894d234f428e7a400d7a524466dd

See more details on using hashes here.

File details

Details for the file thordata_mcp_server-0.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for thordata_mcp_server-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bacca9a65fb652eb845ab12228e285f7f420a40db74981aa82eb2cfba62cf33c
MD5 2e651a551873d0bc95537cf9c154b373
BLAKE2b-256 b20190562e0b5f8a808ed4e882cbe65cb5857fb55b13ddd7d2d341590b197326

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page