A robust MCP server for fetching and extracting web content using Trafilatura
Project description
FetchV2 MCP Server
Model Context Protocol (MCP) server for web content fetching and extraction.
This MCP server provides tools to fetch webpages, extract clean content using Trafilatura, and discover links for batch processing.
Features
- Fetch Webpages: Extract clean markdown content from any URL
- Batch Fetching: Fetch up to 10 URLs in a single request
- Link Discovery: Find and filter links on any webpage
- llms.txt Support: Parse and fetch LLM-friendly documentation indexes
- Smart Extraction: Trafilatura removes boilerplate (navbars, ads, footers)
- Robots.txt Compliance: Respects robots.txt with graceful timeout handling
- Pagination Support: Handle large pages with
start_indexparameter
Prerequisites
- Install
uvfrom Astral - Install Python 3.10 or newer using
uv python install 3.10
Installation
| Cursor | VS Code |
|---|---|
| Install MCP Server | Install on VS Code |
Or configure manually in your MCP client:
{
"mcpServers": {
"fetchv2": {
"command": "uvx",
"args": ["fetchv2-mcp-server@latest"],
"disabled": false,
"autoApprove": []
}
}
}
Config file locations:
- Claude Desktop (macOS):
~/Library/Application Support/Claude/claude_desktop_config.json - Claude Desktop (Windows):
%APPDATA%\Claude\claude_desktop_config.json - Windsurf:
~/.codeium/windsurf/mcp_config.json - Kiro:
.kiro/settings/mcp.jsonin your project
Install from PyPI
# Using uv
uv add fetchv2-mcp-server
# Using pip
pip install fetchv2-mcp-server
Basic Usage
Example prompts to try:
- "Fetch the documentation from
<URL>" - "Find all links on
<docs URL>that contain 'tutorial'" - "Read these three pages and summarize the differences:
[url1, url2, url3]"
Available Tools
fetch
Fetches a webpage and extracts its main content as clean markdown.
fetch(url: str, max_length: int = 5000, start_index: int = 0) -> str
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str | required | The webpage URL to fetch |
max_length |
int | 5000 | Maximum characters to return |
start_index |
int | 0 | Character offset for pagination |
get_raw_html |
bool | false | Skip extraction, return raw HTML |
include_metadata |
bool | true | Include title, author, date |
include_tables |
bool | true | Preserve tables in markdown |
include_links |
bool | false | Preserve hyperlinks |
bypass_robots_txt |
bool | false | Skip robots.txt check |
fetch_batch
Fetches multiple webpages in a single request.
fetch_batch(urls: list[str], max_length_per_url: int = 2000) -> str
| Parameter | Type | Default | Description |
|---|---|---|---|
urls |
list[str] | required | List of URLs (max 10) |
max_length_per_url |
int | 2000 | Character limit per URL |
get_raw_html |
bool | false | Skip extraction for all URLs |
discover_links
Discovers all links on a webpage with optional filtering.
discover_links(url: str, filter_pattern: str = "") -> str
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str | required | The webpage URL to scan |
filter_pattern |
str | "" | Regex to filter links (e.g., /docs/) |
fetch_llms_txt
Fetch and parse an llms.txt file to discover LLM-friendly documentation.
fetch_llms_txt(url: str, include_content: bool = False) -> str
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str | required | URL to an llms.txt file |
include_content |
bool | false | Also fetch content of all linked pages |
max_length_per_url |
int | 2000 | When include_content=True, max chars per page |
⚠️ Important: By default, only the llms.txt index is fetched — the linked markdown files are NOT downloaded to context. Set
include_content=Trueto explicitly fetch all linked pages.
Example:
# DEFAULT: Only fetches the index (lightweight, ~1KB)
fetch_llms_txt(url="https://docs.example.com/llms.txt")
# Returns: title + list of links with descriptions
# EXPLICIT: Fetches index + all linked .md files (can be large)
fetch_llms_txt(url="https://docs.example.com/llms.txt", include_content=True)
# Returns: structure + content of all linked pages
Note: Relative URLs (e.g., /docs/guide.md) are automatically resolved to absolute URLs.
Workflow Example
Step 1: Discover relevant documentation pages
discover_links(url="https://docs.example.com/", filter_pattern="/guide/")
Step 2: Batch fetch the pages you need
fetch_batch(urls=["https://docs.example.com/guide/intro", "https://docs.example.com/guide/setup"])
Prompts
- fetch_manual - User-initiated fetch that bypasses robots.txt
- research_topic - Research a topic by fetching multiple relevant URLs
Development
# Clone and install
git clone https://github.com/praveenc/fetchv2-mcp-server.git
cd fetchv2-mcp-server
uv sync --dev
source .venv/bin/activate
# Run tests
uv run pytest
# Run with MCP Inspector
mcp dev src/fetchv2_mcp_server/server.py
# Linting and type checking
uv run ruff check .
uv run pyright
License
MIT - see LICENSE for details.
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Support
For issues and questions, use the GitHub issue tracker.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fetchv2_mcp_server-1.1.0.tar.gz.
File metadata
- Download URL: fetchv2_mcp_server-1.1.0.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8dcaefee811084bc2af43c2dcaa2f64e00da81ccd6499ac746e8c3ef9d12c89d
|
|
| MD5 |
ad0d808cb701bf14068435346a441991
|
|
| BLAKE2b-256 |
7c16beedc608037a2b13b5b2d03557a5909505118d9940938bcff91a38301e13
|
Provenance
The following attestation bundles were made for fetchv2_mcp_server-1.1.0.tar.gz:
Publisher:
publish.yml on praveenc/fetchv2-mcp-server
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fetchv2_mcp_server-1.1.0.tar.gz -
Subject digest:
8dcaefee811084bc2af43c2dcaa2f64e00da81ccd6499ac746e8c3ef9d12c89d - Sigstore transparency entry: 744500140
- Sigstore integration time:
-
Permalink:
praveenc/fetchv2-mcp-server@c3729726386e26cddaf5d93c09119ec40dcc3274 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/praveenc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c3729726386e26cddaf5d93c09119ec40dcc3274 -
Trigger Event:
release
-
Statement type:
File details
Details for the file fetchv2_mcp_server-1.1.0-py3-none-any.whl.
File metadata
- Download URL: fetchv2_mcp_server-1.1.0-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
474d90dcfb71fd86cede8d9ddfdf552750f7019f2107802c8d5c9e8c1678c356
|
|
| MD5 |
17aab384af864b95cbd0ea6ec9b3f129
|
|
| BLAKE2b-256 |
2f65321e5a83e526f83b8ccbf4817baeb2d20bc404e4dc10a25fbd9f0420ea93
|
Provenance
The following attestation bundles were made for fetchv2_mcp_server-1.1.0-py3-none-any.whl:
Publisher:
publish.yml on praveenc/fetchv2-mcp-server
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fetchv2_mcp_server-1.1.0-py3-none-any.whl -
Subject digest:
474d90dcfb71fd86cede8d9ddfdf552750f7019f2107802c8d5c9e8c1678c356 - Sigstore transparency entry: 744500154
- Sigstore integration time:
-
Permalink:
praveenc/fetchv2-mcp-server@c3729726386e26cddaf5d93c09119ec40dcc3274 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/praveenc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c3729726386e26cddaf5d93c09119ec40dcc3274 -
Trigger Event:
release
-
Statement type: