Skip to main content

MCP Document Converter - 支持多格式文档转换的 MCP 工具

Project description

MCP Document Converter

MCP (Model Context Protocol) Document Converter - A powerful MCP tool for converting documents between multiple formats, enabling AI agents to easily transform documents.

GitHub Gitee CSDN Python 3.10+ License: MIT MCP Protocol

Features

  • Multi-format Support: Supports 5 mainstream document formats: Markdown, HTML, DOCX, PDF, and Text
  • Bidirectional Conversion: Any format can be converted to any other format (5×5=25 conversion combinations)
  • MCP Protocol: Compliant with MCP standards, can be used as a tool for AI assistants like Trae IDE
  • Plugin Architecture: Easy to extend with new parsers and renderers
  • Syntax Highlighting: HTML and PDF outputs support code syntax highlighting
  • Style Customization: Support for custom CSS styles
  • Metadata Preservation: Preserves document title, author, creation time, and other metadata during conversion

Supported Formats

Input Formats (Parsers)

Format Extensions MIME Type Features
Markdown .md, .markdown, .mdown, .mkd text/markdown YAML Front Matter, GFM extensions
HTML .html, .htm text/html Semantic tag parsing
DOCX .docx application/vnd.openxmlformats-officedocument.wordprocessingml.document Styles, tables, images
PDF .pdf application/pdf Text extraction and structure recognition
Text .txt, .text text/plain Auto encoding detection and structure recognition

Output Formats (Renderers)

Format Extension MIME Type Features
HTML .html text/html Beautiful styling, code highlighting, responsive design
Markdown .md text/markdown Standard Markdown format, YAML Front Matter
DOCX .docx application/vnd.openxmlformats-officedocument.wordprocessingml.document Word document format, style preservation
PDF .pdf application/pdf Generated with WeasyPrint, pagination support
Text .txt text/plain Plain text, basic formatting preserved

Conversion Matrix

Source \ Target HTML PDF Markdown DOCX Text
Markdown
HTML
DOCX
PDF
Text

Installation

Using pip (Recommended)

pip install mcp-document-converter

From Source

git clone https://github.com/xt765/mcp-document-converter.git
cd mcp-document-converter
pip install -e .

MCP Tools

This server provides the following tools:

convert_document

Convert a document from one format to another.

Arguments:

  • source_path (string, required): Path to the source document.
  • target_format (string, required): Target format (html, pdf, markdown, docx, text).
  • output_path (string, optional): Path for the output file.
  • source_format (string, optional): Format of the source file (auto-detected if not provided).
  • options (object, optional): Additional options like template, css, and preserve_metadata.

Configuration

Using in Trae IDE

Add the following to your Trae IDE MCP configuration:

Option 1: Using GitHub repository (Recommended)

{
  "mcpServers": {
    "mcp-document-converter": {
      "command": "uvx",
      "args": [
        "--from",
        "git+https://github.com/xt765/mcp-document-converter",
        "mcp-document-converter"
      ]
    }
  }
}

Option 2: Using Gitee repository (Faster access in China)

{
  "mcpServers": {
    "mcp-document-converter": {
      "command": "uvx",
      "args": [
        "--from",
        "git+https://gitee.com/xt765/mcp-document-converter",
        "mcp-document-converter"
      ]
    }
  }
}

Usage

As an MCP Tool

After configuration, AI assistants can directly call the following tools:

1. convert_document (Recommended)

Use a unified interface to convert any supported document type.

# Markdown to HTML
convert_document(
    source_path="document.md",
    target_format="html"
)

# HTML to PDF
convert_document(
    source_path="document.html",
    target_format="pdf"
)

# DOCX to Markdown
convert_document(
    source_path="document.docx",
    target_format="markdown"
)

# Conversion with options
convert_document(
    source_path="document.md",
    target_format="html",
    output_path="output.html",
    options={
        "css": "custom.css",
        "preserve_metadata": True
    }
)

2. list_supported_formats

List all supported document formats.

list_supported_formats()

3. get_conversion_matrix

Get the complete format conversion matrix.

get_conversion_matrix()

4. can_convert

Check if conversion from source format to target format is supported.

can_convert(source_format="markdown", target_format="pdf")

5. get_format_info

Get detailed information about a specific format.

get_format_info(format="markdown")

As a Python Library

from mcp_document_converter import DocumentConverter
from mcp_document_converter.registry import get_registry
from mcp_document_converter.parsers import MarkdownParser, HTMLParser
from mcp_document_converter.renderers import HTMLRenderer, PDFRenderer

# Register parsers and renderers
registry = get_registry()
registry.register_parser(MarkdownParser())
registry.register_parser(HTMLParser())
registry.register_renderer(HTMLRenderer())
registry.register_renderer(PDFRenderer())

# Create converter
converter = DocumentConverter(registry)

# Convert document
result = converter.convert(
    source="input.md",
    target_format="html",
    output_path="output.html"
)

if result.success:
    print(f"✅ Conversion successful: {result.output_path}")
else:
    print(f"❌ Conversion failed: {result.error_message}")

Tool Interface Details

convert_document

Convert a document from one format to another.

Parameters:

Parameter Type Required Description
source_path string Source file path, supports absolute or relative paths
target_format string Target format: html, pdf, markdown, docx, text
output_path string Output file path (optional, defaults to source filename)
source_format string Source format (optional, auto-detected from file extension)
options object Conversion options

Options:

Option Type Default Description
template string - Template name
css string - Custom CSS styles
preserve_metadata boolean true Whether to preserve metadata
extract_images boolean true Whether to extract images

Example:

{
  "source_path": "/path/to/document.md",
  "target_format": "html",
  "output_path": "/path/to/output.html",
  "options": {
    "css": "body { font-family: Arial; }",
    "preserve_metadata": true
  }
}

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    MCP Document Converter                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Parsers                          Renderers                     │
│   ┌─────────────┐                  ┌─────────────┐              │
│   │ Markdown    │ ───────────────→ │ HTML        │              │
│   │ DOCX        │ ───────────────→ │ PDF         │              │
│   │ HTML        │ ───────────────→ │ Markdown    │              │
│   │ PDF         │ ───────────────→ │ DOCX        │              │
│   │ Text        │ ───────────────→ │ Text        │              │
│   └─────────────┘                  └─────────────┘              │
│          ↓                                ↓                     │
│   ┌─────────────────────────────────────────────────────┐       │
│   │         Intermediate Representation (IR)             │       │
│   │  - Document Tree                                     │       │
│   │  - Metadata                                          │       │
│   │  - Assets (images, attachments, etc.)                │       │
│   └─────────────────────────────────────────────────────┘       │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Core Components

  1. DocumentIR (Intermediate Representation): Unified abstraction for all documents, containing document tree, metadata, assets, etc.
  2. BaseParser (Parser Base Class): Defines the parser interface, parses various formats into DocumentIR
  3. BaseRenderer (Renderer Base Class): Defines the renderer interface, renders DocumentIR into various formats
  4. ConverterRegistry (Registry): Manages all parsers and renderers, provides format lookup and auto-matching
  5. DocumentConverter (Conversion Engine): Coordinates parsers and renderers to complete document conversion

Extension Development

Adding a New Parser

from typing import List, Union
from pathlib import Path
from mcp_document_converter.core.parser import BaseParser
from mcp_document_converter.core.ir import DocumentIR, Node, NodeType

class MyParser(BaseParser):
    @property
    def supported_extensions(self) -> List[str]:
        return [".myext"]
    
    @property
    def format_name(self) -> str:
        return "myformat"
    
    @property
    def mime_types(self) -> List[str]:
        return ["application/x-myformat"]
    
    def parse(self, source: Union[str, Path, bytes], **options) -> DocumentIR:
        # Read source file
        content = self._read_source(source)
        
        # Parse into DocumentIR
        document = DocumentIR()
        document.title = "My Document"
        
        # Add content nodes
        document.add_node(Node(
            type=NodeType.PARAGRAPH,
            content=[Node(type=NodeType.TEXT, content="Hello World")]
        ))
        
        return document

Adding a New Renderer

from typing import Any
from mcp_document_converter.core.renderer import BaseRenderer
from mcp_document_converter.core.ir import DocumentIR

class MyRenderer(BaseRenderer):
    @property
    def output_extension(self) -> str:
        return ".myext"
    
    @property
    def format_name(self) -> str:
        return "myformat"
    
    @property
    def mime_type(self) -> str:
        return "application/x-myformat"
    
    def render(self, document: DocumentIR, **options: Any) -> str:
        # Render DocumentIR to target format
        parts = []
        
        if document.title:
            parts.append(f"# {document.title}")
        
        for node in document.content:
            # Render each node
            pass
        
        return "\n".join(parts)

Registering Extensions

from mcp_document_converter.registry import get_registry

# Register new parser and renderer
registry = get_registry()
registry.register_parser(MyParser())
registry.register_renderer(MyRenderer())

Testing

# Run all tests
python tests/test_conversion.py

# Run specific test
python tests/test_conversion.py::test_markdown_to_html

Environment Variables

Variable Description Default
MCP_CONVERTER_LOG_LEVEL Log level INFO
MCP_CONVERTER_TEMP_DIR Temporary files directory System temp directory

Dependencies

Core Dependencies

  • mcp >= 1.0.0 - MCP protocol implementation
  • pydantic >= 2.0.0 - Data validation

Parser Dependencies

  • markdown >= 3.5.0 - Markdown parsing
  • beautifulsoup4 >= 4.12.0 - HTML parsing
  • python-docx >= 1.1.0 - DOCX parsing
  • PyPDF2 >= 3.0.0 - PDF parsing
  • chardet >= 5.0.0 - Encoding detection
  • pyyaml >= 6.0.0 - YAML parsing

Renderer Dependencies

  • weasyprint >= 60.0 - PDF rendering
  • pygments >= 2.17.0 - Code highlighting
  • jinja2 >= 3.1.0 - Template engine

License

MIT License

Contributing

Issues and Pull Requests are welcome!

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_document_converter-0.1.1.tar.gz (286.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_document_converter-0.1.1-py3-none-any.whl (47.2 kB view details)

Uploaded Python 3

File details

Details for the file mcp_document_converter-0.1.1.tar.gz.

File metadata

  • Download URL: mcp_document_converter-0.1.1.tar.gz
  • Upload date:
  • Size: 286.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for mcp_document_converter-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d1f8071a64ded94d7259c068b1a6ec6e760e0a7101bc5f30ea9d84c161b89caa
MD5 70e1dea0d6ee876d1eeff604ff11f2c7
BLAKE2b-256 0275804bade1adf75030605e235e8577775affb2ba4d0dbc00ddf57fa9e36f91

See more details on using hashes here.

File details

Details for the file mcp_document_converter-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for mcp_document_converter-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 731d79b92852b0071541f41764dad8e1b5b96d9cdf4539d484d8060fa813437b
MD5 2b11b5928416eabd57114951dd1ca938
BLAKE2b-256 6fd41fb0abc85e79a4831843b091bee7ba80a0ae35da2c87c3fe08e050e2b360

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page