Skip to main content

MCP server for intelligent web content extraction using single-file and trafilatura

Project description

Single-File MCP Server

A powerful Model Context Protocol (MCP) server that provides intelligent web content extraction using single-file and trafilatura. Perfect for AI agents that need to access and analyze web content from JavaScript-heavy sites.

GitHub Repository: https://github.com/kwinsch/singlefile-mcp

Features

🌐 Universal Web Content Access

  • JavaScript Support: Handles modern SPA/React/Vue apps that require browser rendering
  • Clean Content Extraction: Uses Mozilla's Readability algorithm via trafilatura
  • Rich Metadata: Extracts title, author, date, description, and more
  • Multiple Output Formats: Raw HTML or clean markdown-like content

📄 Smart Pagination & Token Management

  • Flexible Pagination: Offset/limit system like file reading tools
  • Token Limits: Configurable max tokens (up to 25,000)
  • Smart Truncation: Summary mode shows beginning + end, truncate mode cuts cleanly
  • Navigation Hints: Clear guidance on how to continue reading large documents

⚡ Performance & Control

  • Selective Loading: Block images/scripts for faster processing
  • Content Compression: Optional HTML compression
  • Timeout Protection: Configurable timeouts prevent hanging
  • Error Handling: Graceful degradation when extraction fails

Installation

Prerequisites

  • Python 3.8+
  • single-file CLI - Web page capture tool
  • Node.js 16+ (for single-file)
  • A supported browser (Chromium, Chrome, Edge, Firefox, etc.)

Install single-file CLI

The single-file CLI is essential for this MCP server to work. It uses a real browser engine to accurately capture JavaScript-rendered content.

npm install -g single-file-cli

Usage with Claude Code

Quick Install (from PyPI)

claude mcp add singlefile-mcp -s user -- uvx singlefile-mcp

This will automatically install and run the package from PyPI, similar to how Brave Search works!

Development Install (from local directory)

claude mcp add singlefile-mcp -s user -- uvx --from /path/to/single-file_mcp singlefile-mcp

Remove old server (if upgrading)

claude mcp remove single-file-fetcher --scope user

Optional: Add Brave Search MCP

claude mcp add brave-search -s user -- env BRAVE_API_KEY=YOUR_KEY npx -y @modelcontextprotocol/server-brave-search

API Reference

fetch_webpage

Fetch and process web content with intelligent extraction.

Parameters

Parameter Type Default Description
url string required URL of the webpage to fetch
output_content boolean true Whether to return content in response
extract_content boolean false Extract clean text content (recommended)
include_metadata boolean true Include page metadata (title, author, etc.)
block_images boolean false Block image downloads for faster processing
block_scripts boolean true Block JavaScript execution
compress_html boolean true Compress HTML output
max_tokens number 20000 Maximum tokens in response (max: 25000)
truncate_method string "truncate" How to handle large content: "truncate" or "summary"
offset number 0 Character offset to start reading from
limit number null Maximum characters to return

Examples

Basic content extraction:

fetch_webpage(
    url="https://example.com/article",
    extract_content=True,
    include_metadata=True
)

Paginated reading of large documents:

# Get overview
fetch_webpage(
    url="https://docs.example.com/guide",
    extract_content=True,
    limit=5000
)

# Continue reading from offset
fetch_webpage(
    url="https://docs.example.com/guide", 
    extract_content=True,
    offset=5000,
    limit=5000
)

Raw HTML for complex parsing:

fetch_webpage(
    url="https://app.example.com/dashboard",
    extract_content=False,
    block_scripts=False,
    max_tokens=15000
)

Practical Example: Research Workflow

Here's a real-world example combining Brave Search and Single-File MCP:

Step 1: Search for information

# Using Brave Search MCP
brave_web_search(
    query="artificial intelligence history timeline",
    count=5
)

Step 2: Fetch and analyze Wikipedia article

# Using Single-File MCP to extract content
fetch_webpage(
    url="https://en.wikipedia.org/wiki/History_of_artificial_intelligence",
    extract_content=True,
    include_metadata=True,
    limit=5000  # Get first 5000 chars
)

Result:

Successfully fetched webpage: https://en.wikipedia.org/wiki/History_of_artificial_intelligence

## Metadata
**Title:** History of artificial intelligence - Wikipedia
**Description:** The history of artificial intelligence (AI) began in antiquity...
**Site:** wikipedia.org

## Extracted Content (chars 0-5000 of 45000)
*Note: More content available. Use offset=5000 to continue.*

# History of artificial intelligence

The history of artificial intelligence (AI) began in antiquity, with myths, 
stories and rumors of artificial beings endowed with intelligence...

[Clean, readable article content follows...]

Step 3: Continue reading with pagination

# Get next section
fetch_webpage(
    url="https://en.wikipedia.org/wiki/History_of_artificial_intelligence",
    extract_content=True,
    offset=5000,
    limit=5000
)

This workflow enables AI agents to:

  1. Search for current information beyond their training data
  2. Extract clean, structured content from any webpage
  3. Process JavaScript-heavy sites that other tools can't handle
  4. Paginate through long documents intelligently

Output Format

With Content Extraction

Successfully fetched webpage: https://example.com

## Metadata
**Title:** Example Article
**Author:** John Doe
**Date:** 2024-01-15
**Description:** An informative article about...
**Site:** example.com

## Extracted Content (chars 0-5000 of 12000)
*Note: More content available. Use offset=5000 to continue.*

# Article Title

This is the clean, readable content extracted from the webpage...

Pagination Info

When using offset/limit, responses include:

  • Current position: chars 1000-6000 of 12000
  • Navigation hint: Use offset=6000 to continue
  • Total size information

Use Cases

📚 Documentation Analysis

Perfect for reading large technical docs, API references, and guides that span multiple pages.

📰 News & Article Processing

Extract clean article content from news sites, blogs, and publications for analysis.

🔍 Research & Data Gathering

Gather structured data from websites, including metadata and clean text content.

🤖 AI Agent Integration

Enable AI agents to browse and understand web content, even from JavaScript-heavy applications.

⚖️ Legal Document Processing

Handle complex legal documents and government sites that require JavaScript rendering.

Technical Details

Content Extraction Pipeline

  1. single-file: Renders JavaScript and saves complete webpage
  2. trafilatura: Extracts main content using Mozilla Readability algorithm
  3. Pagination: Applies offset/limit for manageable chunks
  4. Token Management: Ensures responses fit within LLM context limits

Browser Engine

Uses a browser via single-file for full JavaScript support:

  • Works with any supported browser installed on your system
  • Waits for network idle before capture
  • Removes hidden elements and unused styles
  • Handles dynamic content loading

Metadata Extraction

Automatically extracts:

  • Page title and description
  • Author and publication date
  • Site name and language
  • Categories and tags (when available)

Error Handling

  • Network Issues: Graceful timeout with informative errors
  • JavaScript Errors: Continues processing even if some scripts fail
  • Large Content: Automatic truncation with clear indicators
  • Invalid URLs: Clear validation error messages

Development Setup

  1. Clone the repository:
git clone https://github.com/kwinsch/singlefile-mcp.git
cd singlefile-mcp
  1. Install dependencies:
pip install -r requirements.txt
  1. Install in development mode:
pip install -e .
  1. Test locally with Claude Code:
claude mcp add singlefile-mcp -s user -- uvx --from . singlefile-mcp

License

MIT License - see LICENSE file for details.

Dependencies

  • single-file - Core web page capture tool that handles JavaScript rendering
  • trafilatura - Content extraction using Mozilla's Readability algorithm
  • mcp - Model Context Protocol for AI integration

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

singlefile_mcp-0.1.1.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

singlefile_mcp-0.1.1-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file singlefile_mcp-0.1.1.tar.gz.

File metadata

  • Download URL: singlefile_mcp-0.1.1.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.9

File hashes

Hashes for singlefile_mcp-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fbb6291a4c89a08053f584ca6e03562b67fd92b8a086846d30ab78eff3537898
MD5 6fa9d533e03e851b3a20ba003319f79c
BLAKE2b-256 5c4461cdab0aae321226b1c1aa9326dc11023d164a93956727541d0f563be8cc

See more details on using hashes here.

File details

Details for the file singlefile_mcp-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for singlefile_mcp-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f533e259c8187db86cd19ec4f79fbe1efc9b1ed1f0fc5cae906ad597ba83ca3d
MD5 d0cee12684cd3816863655bda7221ffe
BLAKE2b-256 5e5ad14250e1959467de39f2549c471132ec583ffd1038c02c7512b06540071d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page