A robust MCP server for fetching and extracting web content using Trafilatura
Project description
FetchV2 MCP Server
A robust Model Context Protocol server for fetching and extracting web content using Trafilatura. Optimized for AI agents with clean markdown output.
Why FetchV2?
Trafilatura is the real star. Unlike basic HTML-to-markdown converters, Trafilatura is specifically designed for web content extraction:
- Removes boilerplate (navbars, footers, ads, cookie banners)
- Preserves article structure and tables
- Extracts metadata (title, author, date) automatically
- Handles edge cases like minimal-content SPAs
Graceful robots.txt handling. Instead of failing hard when robots.txt is unreachable, FetchV2 treats timeout/unavailable as "allowed" - more practical for real-world use.
Features
- Superior Content Extraction: Uses Trafilatura for high-quality HTML-to-markdown conversion
- Robots.txt Compliance: Respects robots.txt by default, gracefully handles timeouts
- Pagination Support: Handle large pages with
start_indexparameter - Multi-URL Fetching: Fetch up to 10 URLs in a single request
- Link Discovery: Extract and filter links from any webpage
- Raw Mode: Get unprocessed content when needed
- Markdown Detection: Automatically handles
.mdfiles without extraction
Installation
# Clone the repo
git clone https://github.com/praveenc/fetchv2-mcp-server.git
cd fetchv2-mcp-server
# Using uv (recommended)
uv sync
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Or using pip
python -m venv .venv
source .venv/bin/activate
pip install -e .
Available Tools
fetch
Fetch a single webpage and extract its main content as clean markdown.
Use when: Reading an article, documentation page, or blog post.
Parameters:
url(required): The webpage URL to fetchmax_length(default: 5000): Maximum characters to return (use 1000-2000 for summaries)start_index(default: 0): Character offset for paginationget_raw_html(default: false): Skip extraction, return original HTMLinclude_metadata(default: true): Include title, author, date at topinclude_tables(default: true): Preserve tables in markdown formatinclude_links(default: false): Preserve hyperlinks in outputbypass_robots_txt(default: false): Skip robots.txt check (user-initiated only)
fetch_batch
Fetch multiple webpages in a single request. Fewer round trips = faster workflows.
Use when: You have 2-10 URLs to read (e.g., from discover_links results).
Parameters:
urls(required): List of URLs (max 10)max_length_per_url(default: 2000): Character limit per URLget_raw_html(default: false): Skip extraction for all URLs
discover_links
Discover all links on a webpage. Use before fetch_batch to find relevant URLs.
Use when: Exploring a site to find relevant pages before fetching.
Parameters:
url(required): The webpage URL to scan for linksfilter_pattern(optional): Regex to filter links (e.g.,/docs/,\.pdf$)
Real-World Use Cases
Discovery → Batch Fetch Workflow
First, discover what pages exist:
discover_links(url="https://kiro.dev/docs/", filter_pattern="/docs/")
Tool Output:
# Links from https://kiro.dev/docs/
Found 11 links
- https://kiro.dev/docs/getting-started/installation/
- https://kiro.dev/docs/getting-started/first-project/
- https://kiro.dev/docs/specs/
- https://kiro.dev/docs/hooks/
- https://kiro.dev/docs/chat/
- https://kiro.dev/docs/steering/
- https://kiro.dev/docs/mcp/
...
Then fetch multiple pages at once:
fetch_batch(
urls=["https://kiro.dev/docs/specs/", "https://kiro.dev/docs/hooks/", "https://kiro.dev/docs/steering/"],
max_length_per_url=1500
)
Tool Output:
## https://kiro.dev/docs/specs/
<!-- Type: markdown (extracted) -->
Specs or specifications are structured artifacts that formalize the development
process for complex features in your application...
---
## https://kiro.dev/docs/hooks/
<!-- Type: markdown (extracted) -->
Agent hooks are powerful automation tools that streamline your development
workflow by automatically executing predefined agent actions...
---
## https://kiro.dev/docs/steering/
<!-- Type: markdown (extracted) -->
Steering gives Kiro persistent knowledge about your workspace through markdown
files. Instead of explaining your conventions in every chat...
Use Case Examples
discover_links:
- Docs crawling - Find all pages before scraping
- Competitive research - Extract blog post links from a site
- API discovery - Find all API endpoint documentation pages
fetch_batch:
- Comparison research - Fetch React, Vue, and Svelte docs to compare approaches
- Onboarding context - Grab multiple docs pages to understand a new tool
- Multi-source fact-checking - Get the same topic from different sources
Key value: fewer round trips. Instead of 10 separate fetch calls (10 tool invocations, 10 approvals in supervised mode), you get everything in 1-2 calls.
Configuration
Kiro / VS Code
Add to .kiro/settings/mcp.json:
{
"mcpServers": {
"fetchv2": {
"command": "uv",
"args": ["--directory", "/path/to/fetchv2-mcp-server", "run", "python", "-m", "fetchv2_mcp_server"]
}
}
}
Claude Desktop
{
"mcpServers": {
"fetchv2": {
"command": "uv",
"args": ["--directory", "/path/to/fetchv2-mcp-server", "run", "python", "-m", "fetchv2_mcp_server"]
}
}
}
Prompts
- fetch_manual - User-initiated fetch that bypasses robots.txt
- research_topic - Research a topic by fetching multiple relevant URLs
Development
# Install dev dependencies
uv sync --dev
# Run with MCP Inspector
mcp dev server.py
# Type checking
uv run pyright
# Linting
uv run ruff check .
Project Structure
fetchv2_mcp_server/
├── pyproject.toml
├── README.md
└── src/
└── fetchv2_mcp_server/
├── __init__.py
├── __main__.py
└── server.py
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fetchv2_mcp_server-1.0.0.tar.gz.
File metadata
- Download URL: fetchv2_mcp_server-1.0.0.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dc65bd19210ecb4a98709357fdd8520c763837af76e24cb5b77b5af33f8f050
|
|
| MD5 |
21cd54a4f0f0a9fdb38b05acc3798747
|
|
| BLAKE2b-256 |
eb2e447b383c399e2167749ed7b574b5a71db073ceb11559d832593bf4232177
|
Provenance
The following attestation bundles were made for fetchv2_mcp_server-1.0.0.tar.gz:
Publisher:
publish.yml on praveenc/fetchv2-mcp-server
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fetchv2_mcp_server-1.0.0.tar.gz -
Subject digest:
2dc65bd19210ecb4a98709357fdd8520c763837af76e24cb5b77b5af33f8f050 - Sigstore transparency entry: 742322078
- Sigstore integration time:
-
Permalink:
praveenc/fetchv2-mcp-server@f61f4f5515032af84fa442f63127ab74cb5eac78 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/praveenc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f61f4f5515032af84fa442f63127ab74cb5eac78 -
Trigger Event:
release
-
Statement type:
File details
Details for the file fetchv2_mcp_server-1.0.0-py3-none-any.whl.
File metadata
- Download URL: fetchv2_mcp_server-1.0.0-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89c63353e115fcd572d8c28e9e03f2b0f4a7e1da7eb8c3c48a71dc12bc1449ba
|
|
| MD5 |
1e1264623da3c4d9ef6c56265fc394e3
|
|
| BLAKE2b-256 |
e6e7b588f72aaf54e3f1f068710171d6e2962e2706847ca0d8813acbb94d558c
|
Provenance
The following attestation bundles were made for fetchv2_mcp_server-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on praveenc/fetchv2-mcp-server
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fetchv2_mcp_server-1.0.0-py3-none-any.whl -
Subject digest:
89c63353e115fcd572d8c28e9e03f2b0f4a7e1da7eb8c3c48a71dc12bc1449ba - Sigstore transparency entry: 742322082
- Sigstore integration time:
-
Permalink:
praveenc/fetchv2-mcp-server@f61f4f5515032af84fa442f63127ab74cb5eac78 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/praveenc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f61f4f5515032af84fa442f63127ab74cb5eac78 -
Trigger Event:
release
-
Statement type: