Official MCP Server for Thordata - Give your AI Agents real-time web scraping superpowers.

These details have not been verified by PyPI

Project description

Thordata MCP Server

Give your AI Agents real-time web scraping superpowers.

A production-ready MCP (Model Context Protocol) server that provides AI agents with powerful web scraping capabilities. Optimized for LLM-friendly interactions with comprehensive error handling, batch operations, and intelligent tool selection.

✨ Features

🎯 Core Capabilities

🔍 Search Engine Tools: High-level web search with LLM-friendly results
- search_engine: Single-query search with light JSON results
- search_engine_batch: Batch search with concurrent processing
- Supports Google, Bing, Yandex with pagination
🌐 Universal Web Scraper: Extract content from any webpage
- unlocker: Universal page unlocking with JS rendering & anti-bot handling
- unlocker_batch: Batch scraping with error isolation
- Output formats: HTML, Markdown, PNG
- Smart error handling for HTTP status codes
🤖 Browser Automation: Full browser-level scraping
- browser: Navigate and capture ARIA/DOM snapshots
- JavaScript rendering support
- Filtered accessibility tree for AI-friendly output
🧠 Smart Scraping: Intelligent tool selection
- smart_scrape: Auto-selects best scraper (SERP, Web Scraper, Unlocker)
- Automatic fallback to universal scraper
- Structured data extraction when available
📊 SERP API: Low-level search result scraping
- serp: Advanced SERP operations with full parameter control
- Batch search support
- Multiple output formats

🚀 Key Highlights

✅ Production Ready: 100% test coverage with comprehensive error handling
🎯 LLM Optimized: Clean tool surface designed for AI agents
⚡ High Performance: Concurrent batch operations, optimized response times
🛡️ Robust Error Handling: Detailed error messages with diagnostic information
📦 Batch Support: Efficient batch processing for multiple URLs/queries
🌍 Multi-Engine: Support for Google, Bing, Yandex search engines

📦 Installation

Prerequisites

Python 3.10 or higher
Thordata API credentials (Get your tokens)

Install from PyPI

pip install thordata-mcp-server

Install from Source

# Clone the repository
git clone https://github.com/thordata/thordata-mcp-server.git
cd thordata-mcp-server

# Install dependencies
pip install -e .

# Install Playwright browsers (for browser automation)
playwright install chromium

🔧 Configuration

Environment Variables

Create a .env file in the root directory or set environment variables:

# Required: Thordata API Credentials
THORDATA_SCRAPER_TOKEN=your_scraper_token
THORDATA_PUBLIC_TOKEN=your_public_token
THORDATA_PUBLIC_KEY=your_public_key

# Optional: Browser Automation (for browser tool)
THORDATA_BROWSER_USERNAME=your_browser_username
THORDATA_BROWSER_PASSWORD=your_browser_password

Tool Exposure Control

Control which tools are exposed via environment variables:

# Expose all tools (default: compact set)
THORDATA_MODE=pro

# Or specify tools explicitly
THORDATA_TOOLS=search_engine,search_engine_batch,unlocker,unlocker_batch,serp,browser,smart_scrape

🏃 Quick Start

Running Locally (Stdio - Recommended)

Standard mode for MCP clients (Claude Desktop, Cursor, etc.):

thordata-mcp

Or using Python module:

python -m thordata_mcp.main --transport stdio

Running with HTTP (SSE)

For remote debugging or HTTP-based clients:

thordata-mcp --transport streamable-http --port 8000

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "thordata": {
      "command": "thordata-mcp",
      "env": {
        "THORDATA_SCRAPER_TOKEN": "your_token",
        "THORDATA_PUBLIC_TOKEN": "your_token",
        "THORDATA_PUBLIC_KEY": "your_key"
      }
    }
  }
}

Cursor Configuration

Add to your ~/.cursor/mcp.json:

{
  "mcpServers": {
    "thordata": {
      "command": "thordata-mcp",
      "env": {
        "THORDATA_SCRAPER_TOKEN": "your_token",
        "THORDATA_PUBLIC_TOKEN": "your_token",
        "THORDATA_PUBLIC_KEY": "your_key"
      }
    }
  }
}

🛠️ Available Tools

Default Tools (Compact Surface)

By default, the server exposes a compact, LLM-friendly tool set:

1. `search_engine` - Web Search

High-level web search wrapper optimized for LLMs.

Parameters:

q (required): Search query string
engine (default: "google"): Search engine ("google", "bing", "yandex")
num (default: 10): Number of results (1-50)
start (default: 0): Starting position for pagination
country: Country code for geolocation (e.g., "US", "JP")
language: Language code (e.g., "en", "ja")

Example:

{
  "q": "Python web scraping",
  "engine": "google",
  "num": 10
}

Response:

{
  "ok": true,
  "output": {
    "results": [
      {
        "title": "Web Scraping with Python",
        "link": "https://example.com",
        "description": "Learn web scraping..."
      }
    ],
    "meta": {
      "engine": "google",
      "q": "Python web scraping",
      "num": 10
    }
  }
}

2. `search_engine_batch` - Batch Web Search

Batch search with concurrent processing and per-item error handling.

Parameters:

requests (required): Array of search request objects
concurrency (default: 5): Number of concurrent requests (1-20)
engine (default: "google"): Default engine for all requests
num (default: 10): Default number of results per request

Example:

{
  "requests": [
    {"q": "Python programming"},
    {"q": "JavaScript frameworks"},
    {"q": "Machine learning"}
  ],
  "concurrency": 3
}

3. `unlocker` - Universal Web Scraper

Extract content from any webpage with JavaScript rendering support.

Parameters:

url (required): Target URL to scrape
js_render (default: false): Enable JavaScript rendering
output_format (default: "html"): Output format ("html", "markdown", "png")
country: Country code for geolocation
wait_ms: Wait time in milliseconds before capture
wait_for: CSS selector or text to wait for
block_resources: Block resource types ("script", "image", "video")

Example:

{
  "url": "https://example.com",
  "js_render": true,
  "output_format": "markdown"
}

Response:

{
  "ok": true,
  "output": {
    "markdown": "# Example Page\n\nContent here...",
    "format": "markdown"
  }
}

4. `unlocker_batch` - Batch Web Scraping

Batch web scraping with concurrent processing and error isolation.

Parameters:

requests (required): Array of request objects with url and optional parameters
concurrency (default: 5): Number of concurrent requests (1-20)

Example:

{
  "requests": [
    {"url": "https://example.com", "js_render": true},
    {"url": "https://example.org", "output_format": "markdown"}
  ],
  "concurrency": 3
}

5. `browser` - Browser Scraper

Navigate and capture ARIA/DOM snapshots using Playwright.

Parameters:

url (required): Target URL to navigate
filtered (default: true): Return filtered ARIA snapshot
mode (default: "accessibility"): Snapshot mode ("accessibility" or "dom")
max_items (default: 100): Maximum items in snapshot (1-500)
max_chars (default: 20000): Maximum characters in snapshot
include_dom (default: false): Include DOM snapshot

Example:

{
  "url": "https://example.com",
  "filtered": true,
  "max_items": 50
}

6. `smart_scrape` - Intelligent Scraping

Automatically selects the best scraping method for any URL.

Parameters:

url (required): Target URL to scrape
prefer_structured (default: true): Prefer structured data extraction
preview (default: true): Include raw HTML/JSON preview
preview_max_chars (default: 20000): Maximum characters in preview
max_wait_seconds (default: 300): Maximum wait time for task completion
unlocker_output (default: "markdown"): Output format when using Unlocker fallback

Example:

{
  "url": "https://amazon.com/dp/B08N5WRWNW",
  "prefer_structured": true
}

Response:

{
  "ok": true,
  "output": {
    "tool_used": "amazon_product",
    "structured_data": {
      "title": "Product Title",
      "price": "$99.99",
      ...
    },
    "preview": "..."
  }
}

7. `serp` - SERP API (Advanced)

Low-level SERP scraper with full parameter control.

Parameters:

action (required): Action to perform ("search" or "batch_search")
params (required): Parameters dictionary

Example:

{
  "action": "search",
  "params": {
    "q": "Python programming",
    "engine": "google",
    "num": 10,
    "format": "light_json"
  }
}

🎯 Error Handling

The server provides comprehensive error handling with detailed diagnostic information:

Error Response Format

{
  "ok": false,
  "error": {
    "type": "not_found",
    "code": "E3003",
    "message": "HTTP 404 error: Page returned empty content...",
    "details": {
      "url": "https://example.com/not-found",
      "status_code": 404
    }
  },
  "request_id": "unique-request-id"
}

Error Types

validation_error: Invalid parameters (E4001)
not_found: Resource not found (E3003)
permission_denied: Access forbidden (E1004)
upstream_internal_error: Server errors (E2106)
timeout_error: Request timeout (E2003)
network_error: Network issues (E2001)

Special Features

Special Character Detection: Automatically detects and reports problematic characters in search queries
HTTP Status Code Mapping: Clear error messages for 404, 500, 403, etc.
Empty Result Hints: Helpful notes for empty search results (e.g., Chinese query limitations)
Batch Error Isolation: Individual request failures don't affect batch operations

📊 Performance

Response Time: 0.4-2 seconds for most operations
Concurrent Processing: Supports up to 20 concurrent requests
Batch Operations: Efficient batch processing with error isolation
Resource Optimization: Smart caching and request optimization

🧪 Testing

The server has been extensively tested with 60+ test scenarios:

✅ HTTP Error Handling: All status codes properly handled
✅ Special Character Processing: Automatic detection and clear error messages
✅ Batch Operations: Concurrent processing with error isolation
✅ Empty Result Handling: Helpful hints for empty results
✅ Performance: Optimized response times and resource usage

Test Coverage: 100% of reported issues resolved

🏗️ Architecture

thordata_mcp/
├── main.py              # Entry point
├── registry.py          # Tool registration
├── config.py            # Configuration management
├── context.py           # Server context (client, browser session)
├── utils.py             # Common utilities (error handling, responses)
├── browser_session.py   # Browser session management (Playwright)
├── aria_snapshot.py     # ARIA snapshot filtering
└── tools/
    ├── product_compact.py  # Main tool definitions (compact surface)
    ├── product.py          # Full product implementation
    └── data/               # Data plane tools
        ├── serp.py         # SERP backend integration
        ├── universal.py    # Universal scraper integration
        ├── browser.py      # Browser automation
        └── tasks.py        # Structured scraping tasks

🎯 Design Principles

LLM-Friendly: Clean tool surface optimized for AI agents
Robust Error Handling: Detailed error messages with diagnostic information
Batch Support: Efficient concurrent processing
Performance Optimized: Fast response times and resource efficiency
Production Ready: Comprehensive testing and error handling

🚀 Deployment

Docker

docker build -t thordata-mcp-server .
docker run -e THORDATA_SCRAPER_TOKEN=... thordata-mcp-server

Docker Compose

See docker-compose.yml for a complete setup with Caddy reverse proxy.

📝 License

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📞 Support

Documentation: GitHub Wiki
Issues: GitHub Issues
Email: support@thordata.com

🙏 Acknowledgments

Built with:

MCP - Model Context Protocol
Thordata SDK - Web scraping SDK
Playwright - Browser automation

Ready to give your AI agents web scraping superpowers? 🚀

Install now: pip install thordata-mcp-server

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.6.0

Feb 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thordata_mcp_server-0.6.0.tar.gz (84.9 kB view details)

Uploaded Feb 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

thordata_mcp_server-0.6.0-py3-none-any.whl (87.8 kB view details)

Uploaded Feb 9, 2026 Python 3

File details

Details for the file thordata_mcp_server-0.6.0.tar.gz.

File metadata

Download URL: thordata_mcp_server-0.6.0.tar.gz
Upload date: Feb 9, 2026
Size: 84.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for thordata_mcp_server-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`5344d6b41613ad1b646b9d2113760454120869407ddc55e714720923e7012564`
MD5	`c958219a0a6578e429035ab85d06af46`
BLAKE2b-256	`4d722432aae0f9189e3896a70fb4e01627a3894d234f428e7a400d7a524466dd`

See more details on using hashes here.

File details

Details for the file thordata_mcp_server-0.6.0-py3-none-any.whl.

File metadata

Download URL: thordata_mcp_server-0.6.0-py3-none-any.whl
Upload date: Feb 9, 2026
Size: 87.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for thordata_mcp_server-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bacca9a65fb652eb845ab12228e285f7f420a40db74981aa82eb2cfba62cf33c`
MD5	`2e651a551873d0bc95537cf9c154b373`
BLAKE2b-256	`b20190562e0b5f8a808ed4e882cbe65cb5857fb55b13ddd7d2d341590b197326`

See more details on using hashes here.

thordata-mcp-server 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Thordata MCP Server

✨ Features

🎯 Core Capabilities

🚀 Key Highlights

📦 Installation

Prerequisites

Install from PyPI

Install from Source

🔧 Configuration

Environment Variables

Tool Exposure Control

🏃 Quick Start

Running Locally (Stdio - Recommended)

Running with HTTP (SSE)

Claude Desktop Configuration

Cursor Configuration

🛠️ Available Tools

Default Tools (Compact Surface)

1. search_engine - Web Search

2. search_engine_batch - Batch Web Search

3. unlocker - Universal Web Scraper

4. unlocker_batch - Batch Web Scraping

5. browser - Browser Scraper

6. smart_scrape - Intelligent Scraping

7. serp - SERP API (Advanced)

🎯 Error Handling

Error Response Format

Error Types

Special Features

📊 Performance

🧪 Testing

🏗️ Architecture

🎯 Design Principles

🚀 Deployment

Docker

Docker Compose

📝 License

🤝 Contributing

📞 Support

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `search_engine` - Web Search

2. `search_engine_batch` - Batch Web Search

3. `unlocker` - Universal Web Scraper

4. `unlocker_batch` - Batch Web Scraping

5. `browser` - Browser Scraper

6. `smart_scrape` - Intelligent Scraping

7. `serp` - SERP API (Advanced)