MCP server for comprehensive PDF processing with 18 specialized tools
Project description
PDF Reader MCP Server
An MCP (Model Context Protocol) server that provides comprehensive PDF processing capabilities with 18 powerful tools for text extraction, OCR, image conversion, metadata management, and optimization.
🚀 Latest Updates
- ✅ All 18 tools fully tested and working (September 2025)
- 🔧 Fixed JSON serialization issues - 100% compatibility achieved
- ⚡ Enhanced performance with intelligent caching system
- 🌍 Multi-language OCR support with Chinese and English optimization
Features
📖 Smart Text Extraction
- Intelligent PDF parsing with
pdfplumberfor high-quality text extraction - Automatic quality detection to identify when OCR is needed
- Page-wise processing with flexible page range syntax
🔍 Advanced OCR Support
- Tesseract integration for scanned documents and image-based PDFs
- Multi-language support with focus on Chinese and English
- Confidence scoring for OCR quality assessment
- Windows-friendly installation and setup
⚡ Performance Optimized
- Smart caching system to avoid reprocessing unchanged files
- Chunking strategies for handling large documents
- Parallel page processing for improved performance
🎯 Flexible Page Selection
Support for complex page ranges:
"1,3,5"- Specific pages"1-10"- Page ranges"-1"- Last page"1,3,5-10,-1"- Combined syntax
Installation
🚀 Quick Installation (Recommended)
Install and run with uvx (easiest method):
# Install and run directly with uvx (no setup required)
uvx pdfmcp-tools
# Or install globally for repeated use
uv tool install pdfmcp-tools
pdfmcp-tools
Install from PyPI with pip:
# Install from PyPI
pip install pdfmcp-tools
# Run the server (both commands work)
pdfmcp-tools
# or
pdfreadermcp
Prerequisites
- Python 3.11+ (automatically handled by uvx/pip)
- Tesseract OCR engine (for OCR functionality)
Install Tesseract OCR Engine
macOS:
# Using Homebrew (recommended)
brew install tesseract tesseract-lang
Linux (Ubuntu/Debian):
sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra
Windows:
- Download from: https://github.com/UB-Mannheim/tesseract/wiki
- Install the latest version (recommended: tesseract-ocr-w64-setup-v5.3.3.20231005.exe)
- During installation, select "Additional Language Data" and install Chinese language packs
- Add Tesseract to your PATH, or note the installation path for configuration
Development Installation (Advanced)
For development or local modification:
-
Install uv package manager (if not already installed):
macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh
Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
-
Clone and install for development:
git clone https://github.com/lihongwen/pdfreadermcp.git cd pdfreadermcp uv sync --dev uv run pdfreadermcp
Usage
🚀 Running the Server
With uvx (recommended):
# Run directly (auto-downloads and starts)
uvx pdfreadermcp
# Or if globally installed
pdfreadermcp
With pip installation:
# After pip install pdfreadermcp
pdfreadermcp
Development mode:
# In project directory
uv run pdfreadermcp
Integration with Claude Desktop
Add to your Claude Desktop MCP configuration file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Option 1: Using uvx (recommended):
{
"mcpServers": {
"pdfreadermcp": {
"command": "uvx",
"args": ["pdfmcp-tools"]
}
}
}
Option 2: Using global installation:
{
"mcpServers": {
"pdfreadermcp": {
"command": "pdfmcp-tools"
}
}
}
Option 3: Development/local installation:
{
"mcpServers": {
"pdfreadermcp": {
"command": "uv",
"args": [
"--directory",
"/path/to/your/pdfreadermcp",
"run",
"pdfreadermcp"
]
}
}
}
📚 Complete Tool Suite (18 Tools)
All tools have been thoroughly tested and are fully functional. The server provides comprehensive PDF processing capabilities across multiple categories:
📖 Text Processing Tools (5 tools)
read_pdf- Intelligent text extraction with chunkingextract_page_text- Single page text extraction with multiple modessearch_pdf_text- Advanced text search with regex and context supportfind_and_highlight_text- Text search with highlighting coordinatesget_pdf_metadata- Comprehensive metadata reading with XMP support
📄 Document Operations Tools (5 tools)
split_pdf- Split PDFs into multiple files by page rangesextract_pages- Extract specific pages to new PDF filemerge_pdfs- Combine multiple PDFs into single documentset_pdf_metadata- Write/update PDF metadata fieldsremove_pdf_metadata- Remove specific or all metadata fields
🖼️ Image Conversion Tools (3 tools)
pdf_to_images- Convert PDF pages to high-quality imagesimages_to_pdf- Convert multiple images to single PDFextract_pdf_images- Extract embedded images from PDF pages
🔍 OCR Tool (1 tool)
ocr_pdf- Advanced OCR with multi-language support and confidence scoring
⚡ Optimization Tools (4 tools)
optimize_pdf- Comprehensive PDF optimization with multiple levelscompress_pdf_images- Image compression within PDF documentsremove_pdf_content- Remove specific content to reduce file sizeanalyze_pdf_size- File size analysis and optimization recommendations
Tools
read_pdf - Text Extraction Tool
Extracts text from PDF files with intelligent processing.
Parameters:
file_path(required): Path to PDF filepages(optional): Page range string (e.g., "1,3,5-10,-1")chunk_size(optional): Maximum chunk size (default: 1000)chunk_overlap(optional): Chunk overlap (default: 100)
Example:
Extract text from document.pdf, pages 1-5 and last page
ocr_pdf - OCR Recognition Tool
Performs OCR on PDF pages using Tesseract for scanned documents and image-based PDFs.
Parameters:
file_path(required): Path to PDF filepages(optional): Page range string (e.g., "1,3,5-10,-1")language(optional): OCR language code (default: "chi_sim" for Chinese)chunk_size(optional): Maximum chunk size (default: 1000)chunk_overlap(optional): Chunk overlap (default: 100)dpi(optional): DPI for PDF to image conversion (default: 200)
Supported Languages:
chi_sim: Simplified Chinese (默认)chi_tra: Traditional Chineseeng: Englishchi_sim+eng: Chinese and English mixed
Example:
Perform OCR on scanned_doc.pdf with Chinese text recognition
split_pdf - PDF Splitting Tool
Split PDF into multiple files based on page ranges.
Parameters:
file_path(required): Path to source PDF filesplit_ranges(required): List of page ranges (e.g., ["1-5", "6-10", "11-15"])output_dir(optional): Output directory (defaults to source file directory)prefix(optional): Output file prefix (defaults to source filename)
Example:
Split document.pdf into multiple files: pages 1-10, 11-20, 21-30
extract_pages - Page Extraction Tool
Extract specific pages from PDF to a new file.
Parameters:
file_path(required): Path to source PDF filepages(required): Page range (e.g., "1,3,5-7")output_file(optional): Output filename (auto-generated if not provided)output_dir(optional): Output directory (defaults to source file directory)
Example:
Extract pages 1, 5-8, and 15 from document.pdf
merge_pdfs - PDF Merging Tool
Merge multiple PDF files into a single file.
Parameters:
file_paths(required): List of PDF file paths to mergeoutput_file(optional): Output filename (auto-generated if not provided)output_dir(optional): Output directory (defaults to first file's directory)
Example:
Merge file1.pdf, file2.pdf, and file3.pdf into a single document
pdf_to_images - PDF to Images Converter
Convert PDF pages to high-quality images using pdf2image.
Parameters:
file_path(required): Path to PDF filepages(optional): Page range (e.g., "1,3,5-10,-1")dpi(optional): Resolution for conversion (default: 200)image_format(optional): Output format ('PNG', 'JPEG', etc.)output_dir(optional): Directory to save imagessave_to_disk(optional): Save to disk or keep in memory (default: True)
Example:
Convert first 5 pages of document.pdf to PNG images at 300 DPI
images_to_pdf - Images to PDF Converter
Convert multiple images into a single PDF document.
Parameters:
image_paths(required): List of image file pathsoutput_file(required): Output PDF file pathpage_size(optional): Page size ('A4', 'Letter', 'Legal', 'auto')quality(optional): JPEG compression quality (1-100, default: 95)title(optional): PDF document titleauthor(optional): PDF document author
Example:
Convert scan1.jpg, scan2.jpg, scan3.jpg to a single PDF with A4 pages
extract_pdf_images - PDF Image Extractor
Extract all embedded images from PDF pages.
Parameters:
file_path(required): Path to PDF filepages(optional): Page range (e.g., "1,3,5-10,-1")min_size(optional): Minimum image size ("WIDTHxHEIGHT", default: "100x100")output_dir(optional): Directory to save extracted images
Example:
Extract all images larger than 200x200 pixels from PDF pages 1-10
get_pdf_metadata - PDF Metadata Reader
Read comprehensive metadata information from PDF documents.
Parameters:
file_path(required): Path to PDF fileinclude_xmp(optional): Include advanced XMP metadata (default: False)
Example:
Read all metadata from document.pdf including title, author, creation date
set_pdf_metadata - PDF Metadata Writer
Write or update PDF metadata fields.
Parameters:
file_path(required): Path to source PDF fileoutput_file(optional): Output PDF file pathtitle(optional): Document titleauthor(optional): Document authorsubject(optional): Document subjectcreator(optional): Creator application nameproducer(optional): Producer application namekeywords(optional): Keywords or tagspreserve_existing(optional): Preserve existing metadata (default: True)
Example:
Set metadata for report.pdf with title "Annual Report 2024" and author "John Doe"
remove_pdf_metadata - PDF Metadata Remover
Remove specific metadata fields or all metadata from PDF.
Parameters:
file_path(required): Path to source PDF fileoutput_file(optional): Output PDF file pathfields_to_remove(optional): List of specific fields to removeremove_all(optional): Remove all metadata (default: False)
Example:
Remove author and title metadata from sensitive_document.pdf
search_pdf_text - PDF Text Search Engine
Search for text content across PDF pages with detailed match information.
Parameters:
file_path(required): Path to PDF filequery(required): Text to search for (or regex pattern)pages(optional): Page range (e.g., "1,3,5-10,-1")case_sensitive(optional): Case-sensitive search (default: False)regex_search(optional): Treat query as regex pattern (default: False)context_chars(optional): Context characters around matches (default: 100)max_matches(optional): Maximum matches to return (default: 100)
Example:
Search for "financial report" in document.pdf with case-insensitive matching
extract_page_text - Single Page Text Extractor
Extract text from a specific PDF page with various extraction options.
Parameters:
file_path(required): Path to PDF filepage_number(required): Page number to extract (1-based)extraction_mode(optional): Extraction mode ("default", "layout", "simple")
Example:
Extract text from page 5 of document.pdf with layout preservation
find_and_highlight_text - Text Highlighting Tool
Find text and return information for highlighting matches.
Parameters:
file_path(required): Path to PDF filequery(required): Text to search forpages(optional): Page range (e.g., "1,3,5-10,-1")case_sensitive(optional): Case-sensitive search (default: False)
Example:
Find all instances of "important" in document.pdf for highlighting
optimize_pdf - PDF Optimization Tool
Optimize PDF file using various compression techniques.
Parameters:
file_path(required): Path to source PDF fileoutput_file(optional): Output PDF file pathoptimization_level(optional): Optimization preset ("light", "medium", "heavy", "maximum")
Example:
Optimize large_document.pdf using medium compression level
compress_pdf_images - PDF Image Compression
Compress images in PDF while preserving document structure.
Parameters:
file_path(required): Path to source PDF fileoutput_file(optional): Output PDF file pathquality(optional): Image compression quality (1-100, default: 80)
Example:
Compress images in photo_heavy.pdf to 60% quality
remove_pdf_content - PDF Content Remover
Remove specific content from PDF to reduce file size.
Parameters:
file_path(required): Path to source PDF fileoutput_file(optional): Output PDF file pathremove_images(optional): Remove all images (default: False)remove_annotations(optional): Remove annotations (default: False)compress_streams(optional): Compress content streams (default: True)
Example:
Remove all images and annotations from document.pdf to reduce size
analyze_pdf_size - PDF Size Analysis Tool
Analyze PDF file to identify optimization opportunities.
Parameters:
file_path(required): Path to PDF file to analyze
Example:
Analyze large_file.pdf to get optimization recommendations
Output Format
All tools return structured JSON containing relevant data. Text extraction and OCR tools return:
{
"success": true,
"file_path": "/path/to/file.pdf",
"total_pages": 10,
"processed_pages": [1, 2, 3],
"chunks": [
{
"content": "Extracted text...",
"page_number": 1,
"chunk_index": 0,
"metadata": {
"quality_score": 0.95,
"word_count": 150
}
}
],
"summary": {
"total_chunks": 5,
"total_chars": 2500,
"pages": [1, 2, 3]
},
"extraction_method": "text_extraction"
}
Language Support
OCR Languages
The ocr_pdf tool supports multiple languages via Tesseract:
- Chinese:
chi_sim(Simplified),chi_tra(Traditional) - English:
eng - Combined:
chi_sim+eng(mixed Chinese and English) - Others: Available based on your Tesseract installation
Performance Features
Caching System
- File-based invalidation - Cache automatically invalidates when files change
- Operation-specific caching - Different cache entries for different operations
- Memory management - Configurable cache size and TTL
Text Quality Analysis
The system automatically analyzes extracted text quality using:
- Character-to-word ratios
- Sentence structure analysis
- Letter-to-character ratios
- Special character detection
Low-quality text triggers OCR recommendations.
Chunking Strategy
- Recursive character splitting with semantic separators
- Configurable overlap to preserve context
- Metadata preservation including page numbers and positions
Error Handling
The server provides detailed error information:
- Missing file errors
- Invalid page range errors
- OCR engine initialization errors
- Processing timeout errors
Development
Project Structure
pdfreadermcp/
\x00\x00 pyproject.toml # uv project configuration
\x00\x00 README.md
\x00\x00 src/pdfreadermcp/
\x00\x00 __init__.py
\x00\x00 __main__.py # Entry point
\x00\x00 server.py # MCP server implementation
\x00\x00 tools/
\x00\x00 pdf_reader.py # Text extraction tool
\x00\x00 pdf_ocr.py # OCR processing tool
pdf_operations.py # PDF splitting, merging, extraction
pdf_image_converter.py # PDF-image conversion tools
pdf_metadata.py # PDF metadata management
pdf_text_search.py # PDF text search and highlighting
pdf_optimizer.py # PDF compression and optimization
\x00\x00 utils/
\x00\x00 chunker.py # Text chunking utilities
\x00\x00 cache.py # Caching system
\x00\x00 file_handler.py # File operations
Running Tests
# Install with dev dependencies
uv sync --dev
# Run tests (when available)
uv run pytest
Dependencies
Core Dependencies
- mcp - Model Context Protocol server framework
- pypdf - PDF text extraction and manipulation
- pdf2image - PDF to image conversion
- pytesseract - Python wrapper for Tesseract OCR
- tesseract - OCR engine
- pillow - Image processing and manipulation
System Requirements
- For OCR: Tesseract OCR engine must be installed
- For PDF conversion: poppler-utils may be required on some systems
Troubleshooting
Common Issues
1. Tesseract OCR Installation Issues
If Tesseract is not found, you may see errors like "TesseractNotFoundError". Solutions:
Windows:
- Ensure Tesseract is installed and added to PATH
- Or set the path manually in your environment:
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
macOS/Linux:
- Install via package manager:
brew install tesseract(macOS) orapt install tesseract-ocr(Ubuntu) - Make sure Chinese language packs are installed
2. pdf2image Dependencies On Linux, you may need to install poppler:
# Ubuntu/Debian
sudo apt-get install poppler-utils
# CentOS/RHEL
sudo yum install poppler-utils
3. Chinese Language Pack Issues
If OCR fails for Chinese text or produces poor results:
- Windows: During Tesseract installation, select "Additional Language Data" and install Chinese packs
- macOS:
brew install tesseract-lang - Linux:
sudo apt install tesseract-ocr-chi-sim tesseract-ocr-chi-tra
Verify language packs are installed:
tesseract --list-langs
4. Memory Issues with Large PDFs
- Reduce
chunk_sizeparameter - Process pages in smaller ranges
- Ensure sufficient system memory
- Lower
dpiparameter for faster processing
Performance Tips
- Use caching - The same file with same parameters will use cached results
- Process specific pages - Use page ranges instead of processing entire documents
- Adjust chunk sizes - Smaller chunks for memory-constrained environments
- Choose appropriate tools - Use
read_pdffirst, thenocr_pdfif needed - OCR optimization:
- Lower
dpi(150-200) for faster processing - Use
chi_simonly if document is purely Chinese - Process problematic pages only, not entire document
- Lower
🧪 Testing & Quality Assurance
This project has been thoroughly tested with comprehensive test coverage:
- ✅ 18/18 tools fully functional (100% success rate)
- 🔧 All JSON serialization issues resolved
- 📋 Extensive testing with real PDF documents
- ⚡ Performance validation with caching system
- 🌍 Multi-language OCR testing (Chinese/English)
License
This project is licensed under the MIT License.
Contributing
Contributions are welcome! Please feel free to submit issues and enhancement requests.
Support
For questions and support:
- Create an issue in the project repository
- Check the troubleshooting section above
- Review the MCP documentation at https://modelcontextprotocol.io
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfmcp_tools-0.1.0.tar.gz.
File metadata
- Download URL: pdfmcp_tools-0.1.0.tar.gz
- Upload date:
- Size: 241.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af1de4122d939232fb5d40dfde7ef0f091b1fd0d23d45d1d4d70c4ff61f92f39
|
|
| MD5 |
bb7737ea8b2c424bb376dc16ceb11237
|
|
| BLAKE2b-256 |
72b68a8724212865dd5f64285e7745e8d52e81c7c910aba38e909d68761aa2f8
|
File details
Details for the file pdfmcp_tools-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdfmcp_tools-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c15dfc6d6552018bd145efdba259319dd9d970226b876963b0ea0a1a83cb5e4
|
|
| MD5 |
3b11ce5b0de638da33e73a78bc2b74a6
|
|
| BLAKE2b-256 |
ee7e544a50dac430f1820b87059388143a2e95a5473e2b38b8f2fec65e37ad2a
|