MCP server for converting HTML to Markdown with browser support, section extraction, and auto-summary for large documents

These details have not been verified by PyPI

Project links

Project description

HTML to Markdown MCP Server

MCP (Model Context Protocol) server for converting HTML webpages to clean Markdown format. Reduces HTML size by ~90-95% while preserving tables, images, and important content - perfect for AI context.

Features

Converts HTML from URLs to clean Markdown
Preserves tables, images, and links
Removes unnecessary elements (scripts, styles, navigation, footers, headers)
Significant size reduction (typically 90-95% compression)
Configurable options for images, tables, and links
Built with trafilatura and BeautifulSoup4 for robust extraction
Stream processing for efficient handling of large pages
Size limits to prevent downloading excessively large content (1MB-50MB)
Optional caching to speed up repeated conversions of the same URLs
🌐 Browser mode with Playwright - Handles JavaScript-heavy sites and authenticated pages
- Execute JavaScript (perfect for SPAs: React, Vue, Angular)
- Use your browser profile with cookies (access authenticated pages!)
- Support for Chrome, Firefox, WebKit
- Configurable wait strategies for dynamic content

Installation

Prerequisites

Python 3.10 or higher
uv package manager (recommended) or pip

Install with uv (recommended)

# Clone the repository
git clone <your-repo-url>
cd html2md

# Install dependencies
uv pip install -e .

# Install Playwright browsers (required for browser mode)
playwright install chromium

Install with pip

# Clone the repository
git clone <your-repo-url>
cd html2md

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -e .

# Install Playwright browsers (required for browser mode)
playwright install chromium

Docker Installation (Recommended for Production)

The easiest way to use html2md is with Docker:

# Build the image
docker build -t html2md .

# Or use pre-built image (when published)
docker pull your-registry/html2md:latest

For Claude Desktop, configure with Docker:

{
  "mcpServers": {
    "html2md": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "html2md"
      ]
    }
  }
}

Docker Image Features:

Pre-installed Playwright with Chromium
Optimized for minimal size (~1GB)
Non-root user for security
Ready to use - no additional setup required

Configuration

Add the server to your Claude Desktop configuration file:

macOS

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "html2md": {
      "command": "uv",
      "args": [
        "--directory",
        "/absolute/path/to/html2md",
        "run",
        "html2md"
      ]
    }
  }
}

Windows

Edit %APPDATA%/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "html2md": {
      "command": "uv",
      "args": [
        "--directory",
        "C:\\absolute\\path\\to\\html2md",
        "run",
        "html2md"
      ]
    }
  }
}

Linux

Edit ~/.config/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "html2md": {
      "command": "uv",
      "args": [
        "--directory",
        "/absolute/path/to/html2md",
        "run",
        "html2md"
      ]
    }
  }
}

Usage

Once configured, the MCP server will be available in Claude Desktop. You can use the html_to_markdown tool:

Example 1: Basic conversion

Convert this webpage to markdown: https://example.com/article

Example 2: With options

Use the html_to_markdown tool with:
- url: https://example.com/docs
- include_images: false
- include_tables: true

Example 3: Browser mode for JavaScript-heavy sites

Use the html_to_markdown tool with:
- url: https://spa-application.com
- fetch_method: playwright
- wait_for: networkidle

Example 4: Access authenticated pages

Use the html_to_markdown tool with:
- url: https://private-site.com/dashboard
- fetch_method: playwright
- use_user_profile: true
- browser_type: chromium

Note: For use_user_profile=true, make sure Chrome is closed before running.

Tool Parameters

Basic Parameters:

url (required): URL of the webpage to convert
include_images (optional, default: true): Include images in Markdown
include_tables (optional, default: true): Include tables in Markdown
include_links (optional, default: true): Include links in Markdown
timeout (optional, default: 30): Request timeout in seconds (5-120)

Performance Parameters:

max_size (optional, default: 10MB): Maximum size of content to download in bytes (1MB-50MB)
use_cache (optional, default: false): Enable caching for faster repeated conversions
cache_ttl (optional, default: 3600): Cache time-to-live in seconds (60-86400)

Browser Mode Parameters:

fetch_method (optional, default: "fetch"): Fetch method - "fetch" (fast) or "playwright" (handles JS, auth)
browser_type (optional, default: "chromium"): Browser to use - "chromium", "firefox", or "webkit"
headless (optional, default: true): Run browser in headless mode
wait_for (optional, default: "networkidle"): Wait strategy - "load", "domcontentloaded", or "networkidle"
use_user_profile (optional, default: false): Use your browser profile with cookies (requires Chrome closed)

Development

Install development dependencies

uv pip install -e ".[dev]"

Run tests

pytest

Code formatting

# Format with black
black src/ tests/

# Lint with ruff
ruff check src/ tests/

Type checking

mypy src/

Architecture

The project consists of three main modules:

`converter.py`

Core HTML to Markdown conversion functionality:

fetch_html(): Downloads HTML from URL
clean_html(): Removes unnecessary elements with BeautifulSoup
convert_to_markdown(): Converts cleaned HTML to Markdown with trafilatura
html_to_markdown(): Main workflow combining all steps

`server.py`

MCP server implementation:

Registers the html_to_markdown tool
Handles tool calls and error responses
Runs async MCP server with stdio transport

`utils.py`

Utility functions:

Hash calculation for caching
Text formatting and truncation
Domain extraction
Filename sanitization

`cache.py`

In-memory caching system:

SimpleCache class with TTL support
Global cache instance management
Automatic expiration of old entries
Hash-based cache keys for URL + parameters

`browser.py`

Playwright browser automation:

fetch_html_playwright() - Async browser-based HTML fetching
Support for Chromium, Firefox, WebKit
User profile integration for authenticated access
Configurable wait strategies for dynamic content

Troubleshooting

Server not appearing in Claude Desktop

Check that the path in claude_desktop_config.json is absolute and correct
Restart Claude Desktop completely
Check Claude Desktop logs for errors

Installation issues

# Verify Python version
python --version  # Should be 3.10+

# Try reinstalling dependencies
uv pip install --force-reinstall -e .

Conversion errors

Timeout errors: Increase the timeout parameter
Empty content: Some websites may block automated requests or use JavaScript rendering
- Solution: Use fetch_method: playwright to execute JavaScript
Parse errors: The webpage structure may be unusual or malformed
Content too large: Increase the max_size parameter (up to 50MB) or the page exceeds limits
Cache issues: Disable caching with use_cache: false if you need fresh content

Browser mode issues

Playwright not installed: Run playwright install chromium
Browser launch fails: Check that you have sufficient permissions and disk space
User profile error: Make sure Chrome is completely closed before using use_user_profile: true
Page doesn't load fully: Try different wait_for strategies:
- "load" - fastest, waits for page load event
- "domcontentloaded" - waits for DOM to be ready
- "networkidle" - slowest but most reliable, waits for network to be idle
Authentication not working: Ensure you're using browser_type: chromium and use_user_profile: true

Performance

Typical conversion results:

Original HTML: ~500KB - 2MB
Markdown output: ~25KB - 100KB
Compression: 90-95%
Processing time: 2-10 seconds (depending on page size and network)

License

MIT

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Credits

Built with:

MCP SDK - Model Context Protocol
trafilatura - Web content extraction
BeautifulSoup4 - HTML parsing

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Oct 31, 2025

0.2.0

Oct 31, 2025

0.1.0

Oct 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2md_mcp-0.3.0.tar.gz (16.7 kB view details)

Uploaded Oct 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

html2md_mcp-0.3.0-py3-none-any.whl (20.3 kB view details)

Uploaded Oct 31, 2025 Python 3

File details

Details for the file html2md_mcp-0.3.0.tar.gz.

File metadata

Download URL: html2md_mcp-0.3.0.tar.gz
Upload date: Oct 31, 2025
Size: 16.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for html2md_mcp-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`9b45c31ff7f3ffcba8bfc49ca08be5d0cf8c7e31bd812e5033eb5329031f1cf9`
MD5	`da6e014bafbcde83a57b2d6942c7b796`
BLAKE2b-256	`c3e001405a7fe437d55e7045c58893d37fb57291adc764cf9388b3b26bce2de3`

See more details on using hashes here.

File details

Details for the file html2md_mcp-0.3.0-py3-none-any.whl.

File metadata

Download URL: html2md_mcp-0.3.0-py3-none-any.whl
Upload date: Oct 31, 2025
Size: 20.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for html2md_mcp-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9db7f70bd0bfc7a5267d7b79f7035f98b7242f1922bb3673d7c74d81922cb967`
MD5	`3e55ee64e06dcfa01c2c44370c1c55c3`
BLAKE2b-256	`598b22fda45a7d43f11347f7511e9d6c25732dc5bfb8a6a31f34a77e5643ea3c`

See more details on using hashes here.

html2md-mcp 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HTML to Markdown MCP Server

Features

Installation

Prerequisites

Install with uv (recommended)

Install with pip

Docker Installation (Recommended for Production)

Configuration

macOS

Windows

Linux

Usage

Example 1: Basic conversion

Example 2: With options

Example 3: Browser mode for JavaScript-heavy sites

Example 4: Access authenticated pages

Tool Parameters

Development

Install development dependencies

Run tests

Code formatting

Type checking

Architecture

converter.py

server.py

utils.py

cache.py

browser.py

Troubleshooting

Server not appearing in Claude Desktop

Installation issues

Conversion errors

Browser mode issues

Performance

License

Contributing

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`converter.py`

`server.py`

`utils.py`

`cache.py`

`browser.py`