Skip to main content

A robust MCP server for fetching and extracting web content using Trafilatura

Project description

FetchV2 MCP Server

CI Python 3.10+ License: MIT Code style: ruff

A robust Model Context Protocol server for fetching and extracting web content using Trafilatura. Optimized for AI agents with clean markdown output.

Why FetchV2?

Trafilatura is the real star. Unlike basic HTML-to-markdown converters, Trafilatura is specifically designed for web content extraction:

  • Removes boilerplate (navbars, footers, ads, cookie banners)
  • Preserves article structure and tables
  • Extracts metadata (title, author, date) automatically
  • Handles edge cases like minimal-content SPAs

Graceful robots.txt handling. Instead of failing hard when robots.txt is unreachable, FetchV2 treats timeout/unavailable as "allowed" - more practical for real-world use.

Features

  • Superior Content Extraction: Uses Trafilatura for high-quality HTML-to-markdown conversion
  • Robots.txt Compliance: Respects robots.txt by default, gracefully handles timeouts
  • Pagination Support: Handle large pages with start_index parameter
  • Multi-URL Fetching: Fetch up to 10 URLs in a single request
  • Link Discovery: Extract and filter links from any webpage
  • Raw Mode: Get unprocessed content when needed
  • Markdown Detection: Automatically handles .md files without extraction

Installation

# Clone the repo
git clone https://github.com/praveenc/fetchv2-mcp-server.git
cd fetchv2-mcp-server

# Using uv (recommended)
uv sync
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Or using pip
python -m venv .venv
source .venv/bin/activate
pip install -e .

Available Tools

fetch

Fetch a single webpage and extract its main content as clean markdown.

Use when: Reading an article, documentation page, or blog post.

Parameters:

  • url (required): The webpage URL to fetch
  • max_length (default: 5000): Maximum characters to return (use 1000-2000 for summaries)
  • start_index (default: 0): Character offset for pagination
  • get_raw_html (default: false): Skip extraction, return original HTML
  • include_metadata (default: true): Include title, author, date at top
  • include_tables (default: true): Preserve tables in markdown format
  • include_links (default: false): Preserve hyperlinks in output
  • bypass_robots_txt (default: false): Skip robots.txt check (user-initiated only)

fetch_batch

Fetch multiple webpages in a single request. Fewer round trips = faster workflows.

Use when: You have 2-10 URLs to read (e.g., from discover_links results).

Parameters:

  • urls (required): List of URLs (max 10)
  • max_length_per_url (default: 2000): Character limit per URL
  • get_raw_html (default: false): Skip extraction for all URLs

discover_links

Discover all links on a webpage. Use before fetch_batch to find relevant URLs.

Use when: Exploring a site to find relevant pages before fetching.

Parameters:

  • url (required): The webpage URL to scan for links
  • filter_pattern (optional): Regex to filter links (e.g., /docs/, \.pdf$)

Real-World Use Cases

Discovery → Batch Fetch Workflow

First, discover what pages exist:

discover_links(url="https://kiro.dev/docs/", filter_pattern="/docs/")

Tool Output:

# Links from https://kiro.dev/docs/
Found 11 links

- https://kiro.dev/docs/getting-started/installation/
- https://kiro.dev/docs/getting-started/first-project/
- https://kiro.dev/docs/specs/
- https://kiro.dev/docs/hooks/
- https://kiro.dev/docs/chat/
- https://kiro.dev/docs/steering/
- https://kiro.dev/docs/mcp/
...

Then fetch multiple pages at once:

fetch_batch(
  urls=["https://kiro.dev/docs/specs/", "https://kiro.dev/docs/hooks/", "https://kiro.dev/docs/steering/"],
  max_length_per_url=1500
)

Tool Output:

## https://kiro.dev/docs/specs/
<!-- Type: markdown (extracted) -->

Specs or specifications are structured artifacts that formalize the development
process for complex features in your application...

---

## https://kiro.dev/docs/hooks/
<!-- Type: markdown (extracted) -->

Agent hooks are powerful automation tools that streamline your development
workflow by automatically executing predefined agent actions...

---

## https://kiro.dev/docs/steering/
<!-- Type: markdown (extracted) -->

Steering gives Kiro persistent knowledge about your workspace through markdown
files. Instead of explaining your conventions in every chat...

Use Case Examples

discover_links:

  • Docs crawling - Find all pages before scraping
  • Competitive research - Extract blog post links from a site
  • API discovery - Find all API endpoint documentation pages

fetch_batch:

  • Comparison research - Fetch React, Vue, and Svelte docs to compare approaches
  • Onboarding context - Grab multiple docs pages to understand a new tool
  • Multi-source fact-checking - Get the same topic from different sources

Key value: fewer round trips. Instead of 10 separate fetch calls (10 tool invocations, 10 approvals in supervised mode), you get everything in 1-2 calls.

Configuration

Kiro / VS Code

Add to .kiro/settings/mcp.json:

{
  "mcpServers": {
    "fetchv2": {
      "command": "uv",
      "args": ["--directory", "/path/to/fetchv2-mcp-server", "run", "python", "-m", "fetchv2_mcp_server"]
    }
  }
}

Claude Desktop

{
  "mcpServers": {
    "fetchv2": {
      "command": "uv",
      "args": ["--directory", "/path/to/fetchv2-mcp-server", "run", "python", "-m", "fetchv2_mcp_server"]
    }
  }
}

Prompts

  • fetch_manual - User-initiated fetch that bypasses robots.txt
  • research_topic - Research a topic by fetching multiple relevant URLs

Development

# Install dev dependencies
uv sync --dev

# Run with MCP Inspector
mcp dev server.py

# Type checking
uv run pyright

# Linting
uv run ruff check .

Project Structure

fetchv2_mcp_server/
├── pyproject.toml
├── README.md
└── src/
    └── fetchv2_mcp_server/
        ├── __init__.py
        ├── __main__.py
        └── server.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetchv2_mcp_server-1.0.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fetchv2_mcp_server-1.0.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file fetchv2_mcp_server-1.0.0.tar.gz.

File metadata

  • Download URL: fetchv2_mcp_server-1.0.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fetchv2_mcp_server-1.0.0.tar.gz
Algorithm Hash digest
SHA256 2dc65bd19210ecb4a98709357fdd8520c763837af76e24cb5b77b5af33f8f050
MD5 21cd54a4f0f0a9fdb38b05acc3798747
BLAKE2b-256 eb2e447b383c399e2167749ed7b574b5a71db073ceb11559d832593bf4232177

See more details on using hashes here.

Provenance

The following attestation bundles were made for fetchv2_mcp_server-1.0.0.tar.gz:

Publisher: publish.yml on praveenc/fetchv2-mcp-server

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fetchv2_mcp_server-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for fetchv2_mcp_server-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 89c63353e115fcd572d8c28e9e03f2b0f4a7e1da7eb8c3c48a71dc12bc1449ba
MD5 1e1264623da3c4d9ef6c56265fc394e3
BLAKE2b-256 e6e7b588f72aaf54e3f1f068710171d6e2962e2706847ca0d8813acbb94d558c

See more details on using hashes here.

Provenance

The following attestation bundles were made for fetchv2_mcp_server-1.0.0-py3-none-any.whl:

Publisher: publish.yml on praveenc/fetchv2-mcp-server

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page