High-quality PDF ↔ Markdown converter with MCP integration and Unicode support
Project description
活水 PDF 转换器 (Huoshui PDF Converter)
A high-quality, cross-platform PDF ↔ Markdown converter implemented as an MCP (Model Context Protocol) server. Supports bidirectional conversion with full Unicode/CJK character support.
Features
Core Capabilities
- PDF → Markdown: Extract text and images with layout preservation
- Markdown → PDF: Generate beautiful PDFs with multiple rendering engines
- Unicode Support: Full support for Chinese, Japanese, Korean, and other Unicode characters
- Cross-Platform: Works on Windows, macOS, and Linux
- MCP Integration: Use with Claude Desktop or any MCP-compatible client
Technical Features
- Pure Python: No external system dependencies required
- Automatic Font Detection: Finds and uses system Unicode fonts
- Smart Engine Selection: Automatically switches engines based on content
- Comprehensive Error Handling: Graceful degradation and detailed logging
- Async Architecture: Non-blocking operations for better performance
Installation
From MCP Registry (Recommended)
This server is available in the Model Context Protocol Registry. Install it using your MCP client.
mcp-name: io.github.huoshuiai42/huoshui-pdf-converter
As a Python Package
pip install huoshui-pdf-converter
Or using uv (recommended):
uv pip install huoshui-pdf-converter
As an MCP Server
Add to your Claude Desktop configuration:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json
{
"mcpServers": {
"huoshui-pdf-converter": {
"command": "uvx",
"args": ["huoshui-pdf-converter"],
"env": {}
}
}
}
Or if you prefer to use a specific Python environment:
{
"mcpServers": {
"huoshui-pdf-converter": {
"command": "python",
"args": ["-m", "huoshui_pdf_converter.server"],
"env": {}
}
}
}
Usage
Command Line Interface
# Convert PDF to Markdown
huoshui-pdf pdf-to-md input.pdf output.md
# Convert Markdown to PDF
huoshui-pdf md-to-pdf input.md output.pdf
# With options
huoshui-pdf md-to-pdf input.md output.pdf --page-size A4 --margin 2cm --font-size 12
As a Python Library
import asyncio
from huoshui_pdf_converter import PDFToMarkdownConverter, MarkdownToPDFConverter
async def main():
# PDF to Markdown
pdf_converter = PDFToMarkdownConverter()
result = await pdf_converter.convert(
pdf_path="input.pdf",
output_path="output.md",
extract_images=True,
preserve_formatting=True
)
# Markdown to PDF
md_converter = MarkdownToPDFConverter()
result = await md_converter.convert(
markdown_path="input.md",
output_path="output.pdf",
page_size="A4",
margin="2cm",
font_size=12
)
asyncio.run(main())
MCP Tools
When used as an MCP server, the following tools are available:
-
pdf_to_markdown: Convert PDF files to Markdown
{ "pdf_path": "path/to/input.pdf", "output_path": "path/to/output.md", "extract_images": true, "preserve_formatting": true }
-
markdown_to_pdf: Convert Markdown files to PDF
{ "markdown_path": "path/to/input.md", "output_path": "path/to/output.pdf", "page_size": "A4", "margin": "2cm", "font_size": 12 }
-
list_supported_formats: Get supported formats and engines
-
validate_file: Validate input files before conversion
Supported Formats
Input Formats
- PDF: All standard PDF files (PDF 1.0 - 1.7)
- Markdown: CommonMark and GitHub Flavored Markdown
Output Options
- Page Sizes: A4, A3, Letter, Legal
- Margins: Customizable (e.g., "1cm", "0.5in")
- Font Sizes: Any size in points
- Images: PNG, JPEG extraction from PDFs
Unicode and Font Support
The converter automatically detects and uses appropriate fonts for different languages:
- macOS: Arial Unicode, PingFang SC, STHeiti
- Windows: Microsoft YaHei, SimSun, Arial Unicode MS
- Linux: Noto Sans CJK, Source Han Sans, WenQuanYi
Architecture
Conversion Engines
PDF → Markdown
- PyMuPDF (MuPDF): High-quality text and image extraction
Markdown → PDF
- ReportLab: Best Unicode support, cross-platform compatibility
- xhtml2pdf: Good HTML/CSS rendering (fallback)
- fpdf2: Basic PDF generation (last resort)
Engine Selection Logic
- Detects CJK characters → Uses ReportLab
- Complex formatting → Uses xhtml2pdf
- Basic documents → Uses any available engine
Development
Setup Development Environment
# Clone the repository
git clone https://github.com/yourusername/huoshui-pdf-converter.git
cd huoshui-pdf-converter
# Install dependencies
uv pip install -e ".[dev]"
# Run tests
python test_converter.py
Project Structure
huoshui-pdf-converter/
├── huoshui_pdf_converter/
│ ├── __init__.py
│ ├── server.py # MCP server implementation
│ ├── pdf_converter.py # PDF to Markdown converter
│ └── markdown_converter.py # Markdown to PDF converter
├── pyproject.toml
├── README.md
├── LICENSE
└── test_converter.py
Troubleshooting
Common Issues
-
Chinese characters not displaying:
- Ensure Arial Unicode or similar fonts are installed
- The converter will automatically detect and use appropriate fonts
-
Import errors:
- Install all dependencies:
pip install huoshui-pdf-converter[all]
- Install all dependencies:
-
MCP connection issues:
- Check Claude Desktop logs
- Ensure Python is in your PATH
Logging
Enable debug logging:
import logging
logging.basicConfig(level=logging.DEBUG)
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built with FastMCP for Model Context Protocol support
- Uses PyMuPDF for PDF parsing
- Uses ReportLab for PDF generation
- Inspired by the need for better PDF ↔ Markdown conversion tools
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: your.email@example.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file huoshui_pdf_converter-1.0.4.tar.gz.
File metadata
- Download URL: huoshui_pdf_converter-1.0.4.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96e9c157984c608377cf98a124b57f3f61a346913c2532b4e7dabdbe47a0946a
|
|
| MD5 |
3e11f5ea1f1ccd8c90e1d6d83b7fd793
|
|
| BLAKE2b-256 |
14f74c62813a3f5aec20b0648e63eadc8376bb54d91b3784d6541e3d9ddf31db
|
File details
Details for the file huoshui_pdf_converter-1.0.4-py3-none-any.whl.
File metadata
- Download URL: huoshui_pdf_converter-1.0.4-py3-none-any.whl
- Upload date:
- Size: 23.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0aae1f058d625dc37f71d46ff6ab53322097cdb48d14699c200759984d9b1bf
|
|
| MD5 |
974f01723e70d48a4a0dd3c4e5c16e09
|
|
| BLAKE2b-256 |
81ae58681aab2375d934bdc7f0b21f8adc117dd4c3be1bf8c32c3e2135e4550a
|