Official MCP Server for Thordata - Give your AI Agents real-time web scraping superpowers.
Project description
Thordata MCP Server
Give your AI Agents real-time web scraping superpowers.
A production-ready MCP (Model Context Protocol) server that provides AI agents with powerful web scraping capabilities. Optimized for LLM-friendly interactions with comprehensive error handling, batch operations, and intelligent tool selection.
โจ Features
๐ฏ Core Capabilities
-
๐ Search Engine Tools: High-level web search with LLM-friendly results
search_engine: Single-query search with light JSON resultssearch_engine_batch: Batch search with concurrent processing- Supports Google, Bing, Yandex with pagination
-
๐ Universal Web Scraper: Extract content from any webpage
unlocker: Universal page unlocking with JS rendering & anti-bot handlingunlocker_batch: Batch scraping with error isolation- Output formats: HTML, Markdown, PNG
- Smart error handling for HTTP status codes
-
๐ค Browser Automation: Full browser-level scraping
browser: Navigate and capture ARIA/DOM snapshots- JavaScript rendering support
- Filtered accessibility tree for AI-friendly output
-
๐ง Smart Scraping: Intelligent tool selection
smart_scrape: Auto-selects best scraper (SERP, Web Scraper, Unlocker)- Automatic fallback to universal scraper
- Structured data extraction when available
-
๐ SERP API: Low-level search result scraping
serp: Advanced SERP operations with full parameter control- Batch search support
- Multiple output formats
๐ Key Highlights
- โ Production Ready: 100% test coverage with comprehensive error handling
- ๐ฏ LLM Optimized: Clean tool surface designed for AI agents
- โก High Performance: Concurrent batch operations, optimized response times
- ๐ก๏ธ Robust Error Handling: Detailed error messages with diagnostic information
- ๐ฆ Batch Support: Efficient batch processing for multiple URLs/queries
- ๐ Multi-Engine: Support for Google, Bing, Yandex search engines
๐ฆ Installation
Prerequisites
- Python 3.10 or higher
- Thordata API credentials (Get your tokens)
Install from PyPI
pip install thordata-mcp-server
Install from Source
# Clone the repository
git clone https://github.com/thordata/thordata-mcp-server.git
cd thordata-mcp-server
# Install dependencies
pip install -e .
# Install Playwright browsers (for browser automation)
playwright install chromium
๐ง Configuration
Environment Variables
Create a .env file in the root directory or set environment variables:
# Required: Thordata API Credentials
THORDATA_SCRAPER_TOKEN=your_scraper_token
THORDATA_PUBLIC_TOKEN=your_public_token
THORDATA_PUBLIC_KEY=your_public_key
# Optional: Browser Automation (for browser tool)
THORDATA_BROWSER_USERNAME=your_browser_username
THORDATA_BROWSER_PASSWORD=your_browser_password
Tool Exposure Control
Control which tools are exposed via environment variables:
# Expose all tools (default: compact set)
THORDATA_MODE=pro
# Or specify tools explicitly
THORDATA_TOOLS=search_engine,search_engine_batch,unlocker,unlocker_batch,serp,browser,smart_scrape
๐ Quick Start
Running Locally (Stdio - Recommended)
Standard mode for MCP clients (Claude Desktop, Cursor, etc.):
thordata-mcp
Or using Python module:
python -m thordata_mcp.main --transport stdio
Running with HTTP (SSE)
For remote debugging or HTTP-based clients:
thordata-mcp --transport streamable-http --port 8000
Claude Desktop Configuration
Add to your claude_desktop_config.json:
{
"mcpServers": {
"thordata": {
"command": "thordata-mcp",
"env": {
"THORDATA_SCRAPER_TOKEN": "your_token",
"THORDATA_PUBLIC_TOKEN": "your_token",
"THORDATA_PUBLIC_KEY": "your_key"
}
}
}
}
Cursor Configuration
Add to your ~/.cursor/mcp.json:
{
"mcpServers": {
"thordata": {
"command": "thordata-mcp",
"env": {
"THORDATA_SCRAPER_TOKEN": "your_token",
"THORDATA_PUBLIC_TOKEN": "your_token",
"THORDATA_PUBLIC_KEY": "your_key"
}
}
}
}
๐ ๏ธ Available Tools
Default Tools (Compact Surface)
By default, the server exposes a compact, LLM-friendly tool set:
1. search_engine - Web Search
High-level web search wrapper optimized for LLMs.
Parameters:
q(required): Search query stringengine(default: "google"): Search engine ("google", "bing", "yandex")num(default: 10): Number of results (1-50)start(default: 0): Starting position for paginationcountry: Country code for geolocation (e.g., "US", "JP")language: Language code (e.g., "en", "ja")
Example:
{
"q": "Python web scraping",
"engine": "google",
"num": 10
}
Response:
{
"ok": true,
"output": {
"results": [
{
"title": "Web Scraping with Python",
"link": "https://example.com",
"description": "Learn web scraping..."
}
],
"meta": {
"engine": "google",
"q": "Python web scraping",
"num": 10
}
}
}
2. search_engine_batch - Batch Web Search
Batch search with concurrent processing and per-item error handling.
Parameters:
requests(required): Array of search request objectsconcurrency(default: 5): Number of concurrent requests (1-20)engine(default: "google"): Default engine for all requestsnum(default: 10): Default number of results per request
Example:
{
"requests": [
{"q": "Python programming"},
{"q": "JavaScript frameworks"},
{"q": "Machine learning"}
],
"concurrency": 3
}
3. unlocker - Universal Web Scraper
Extract content from any webpage with JavaScript rendering support.
Parameters:
url(required): Target URL to scrapejs_render(default: false): Enable JavaScript renderingoutput_format(default: "html"): Output format ("html", "markdown", "png")country: Country code for geolocationwait_ms: Wait time in milliseconds before capturewait_for: CSS selector or text to wait forblock_resources: Block resource types ("script", "image", "video")
Example:
{
"url": "https://example.com",
"js_render": true,
"output_format": "markdown"
}
Response:
{
"ok": true,
"output": {
"markdown": "# Example Page\n\nContent here...",
"format": "markdown"
}
}
4. unlocker_batch - Batch Web Scraping
Batch web scraping with concurrent processing and error isolation.
Parameters:
requests(required): Array of request objects withurland optional parametersconcurrency(default: 5): Number of concurrent requests (1-20)
Example:
{
"requests": [
{"url": "https://example.com", "js_render": true},
{"url": "https://example.org", "output_format": "markdown"}
],
"concurrency": 3
}
5. browser - Browser Scraper
Navigate and capture ARIA/DOM snapshots using Playwright.
Parameters:
url(required): Target URL to navigatefiltered(default: true): Return filtered ARIA snapshotmode(default: "accessibility"): Snapshot mode ("accessibility" or "dom")max_items(default: 100): Maximum items in snapshot (1-500)max_chars(default: 20000): Maximum characters in snapshotinclude_dom(default: false): Include DOM snapshot
Example:
{
"url": "https://example.com",
"filtered": true,
"max_items": 50
}
6. smart_scrape - Intelligent Scraping
Automatically selects the best scraping method for any URL.
Parameters:
url(required): Target URL to scrapeprefer_structured(default: true): Prefer structured data extractionpreview(default: true): Include raw HTML/JSON previewpreview_max_chars(default: 20000): Maximum characters in previewmax_wait_seconds(default: 300): Maximum wait time for task completionunlocker_output(default: "markdown"): Output format when using Unlocker fallback
Example:
{
"url": "https://amazon.com/dp/B08N5WRWNW",
"prefer_structured": true
}
Response:
{
"ok": true,
"output": {
"tool_used": "amazon_product",
"structured_data": {
"title": "Product Title",
"price": "$99.99",
...
},
"preview": "..."
}
}
7. serp - SERP API (Advanced)
Low-level SERP scraper with full parameter control.
Parameters:
action(required): Action to perform ("search" or "batch_search")params(required): Parameters dictionary
Example:
{
"action": "search",
"params": {
"q": "Python programming",
"engine": "google",
"num": 10,
"format": "light_json"
}
}
๐ฏ Error Handling
The server provides comprehensive error handling with detailed diagnostic information:
Error Response Format
{
"ok": false,
"error": {
"type": "not_found",
"code": "E3003",
"message": "HTTP 404 error: Page returned empty content...",
"details": {
"url": "https://example.com/not-found",
"status_code": 404
}
},
"request_id": "unique-request-id"
}
Error Types
validation_error: Invalid parameters (E4001)not_found: Resource not found (E3003)permission_denied: Access forbidden (E1004)upstream_internal_error: Server errors (E2106)timeout_error: Request timeout (E2003)network_error: Network issues (E2001)
Special Features
- Special Character Detection: Automatically detects and reports problematic characters in search queries
- HTTP Status Code Mapping: Clear error messages for 404, 500, 403, etc.
- Empty Result Hints: Helpful notes for empty search results (e.g., Chinese query limitations)
- Batch Error Isolation: Individual request failures don't affect batch operations
๐ Performance
- Response Time: 0.4-2 seconds for most operations
- Concurrent Processing: Supports up to 20 concurrent requests
- Batch Operations: Efficient batch processing with error isolation
- Resource Optimization: Smart caching and request optimization
๐งช Testing
The server has been extensively tested with 60+ test scenarios:
- โ HTTP Error Handling: All status codes properly handled
- โ Special Character Processing: Automatic detection and clear error messages
- โ Batch Operations: Concurrent processing with error isolation
- โ Empty Result Handling: Helpful hints for empty results
- โ Performance: Optimized response times and resource usage
Test Coverage: 100% of reported issues resolved
๐๏ธ Architecture
thordata_mcp/
โโโ main.py # Entry point
โโโ registry.py # Tool registration
โโโ config.py # Configuration management
โโโ context.py # Server context (client, browser session)
โโโ utils.py # Common utilities (error handling, responses)
โโโ browser_session.py # Browser session management (Playwright)
โโโ aria_snapshot.py # ARIA snapshot filtering
โโโ tools/
โโโ product_compact.py # Main tool definitions (compact surface)
โโโ product.py # Full product implementation
โโโ data/ # Data plane tools
โโโ serp.py # SERP backend integration
โโโ universal.py # Universal scraper integration
โโโ browser.py # Browser automation
โโโ tasks.py # Structured scraping tasks
๐ฏ Design Principles
- LLM-Friendly: Clean tool surface optimized for AI agents
- Robust Error Handling: Detailed error messages with diagnostic information
- Batch Support: Efficient concurrent processing
- Performance Optimized: Fast response times and resource efficiency
- Production Ready: Comprehensive testing and error handling
๐ Deployment
Docker
docker build -t thordata-mcp-server .
docker run -e THORDATA_SCRAPER_TOKEN=... thordata-mcp-server
Docker Compose
See docker-compose.yml for a complete setup with Caddy reverse proxy.
๐ License
MIT License. Copyright (c) 2026 Thordata.
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ Support
- Documentation: GitHub Wiki
- Issues: GitHub Issues
- Email: support@thordata.com
๐ Acknowledgments
Built with:
- MCP - Model Context Protocol
- Thordata SDK - Web scraping SDK
- Playwright - Browser automation
Ready to give your AI agents web scraping superpowers? ๐
Install now: pip install thordata-mcp-server
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thordata_mcp_server-0.6.0.tar.gz.
File metadata
- Download URL: thordata_mcp_server-0.6.0.tar.gz
- Upload date:
- Size: 84.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5344d6b41613ad1b646b9d2113760454120869407ddc55e714720923e7012564
|
|
| MD5 |
c958219a0a6578e429035ab85d06af46
|
|
| BLAKE2b-256 |
4d722432aae0f9189e3896a70fb4e01627a3894d234f428e7a400d7a524466dd
|
File details
Details for the file thordata_mcp_server-0.6.0-py3-none-any.whl.
File metadata
- Download URL: thordata_mcp_server-0.6.0-py3-none-any.whl
- Upload date:
- Size: 87.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bacca9a65fb652eb845ab12228e285f7f420a40db74981aa82eb2cfba62cf33c
|
|
| MD5 |
2e651a551873d0bc95537cf9c154b373
|
|
| BLAKE2b-256 |
b20190562e0b5f8a808ed4e882cbe65cb5857fb55b13ddd7d2d341590b197326
|