MCP server for comprehensive PDF processing with 18 specialized tools

These details have not been verified by PyPI

Project links

Project description

PDF Reader MCP Server

An MCP (Model Context Protocol) server that provides comprehensive PDF processing capabilities with 18 powerful tools for text extraction, OCR, image conversion, metadata management, and optimization.

🚀 Latest Updates

✅ All 18 tools fully tested and working (September 2025)
🔧 Fixed JSON serialization issues - 100% compatibility achieved
⚡ Enhanced performance with intelligent caching system
🌍 Multi-language OCR support with Chinese and English optimization

Features

📖 Smart Text Extraction

Intelligent PDF parsing with pdfplumber for high-quality text extraction
Automatic quality detection to identify when OCR is needed
Page-wise processing with flexible page range syntax

🔍 Advanced OCR Support

Tesseract integration for scanned documents and image-based PDFs
Multi-language support with focus on Chinese and English
Confidence scoring for OCR quality assessment
Windows-friendly installation and setup

⚡ Performance Optimized

Smart caching system to avoid reprocessing unchanged files
Chunking strategies for handling large documents
Parallel page processing for improved performance

🎯 Flexible Page Selection

Support for complex page ranges:

"1,3,5" - Specific pages
"1-10" - Page ranges
"-1" - Last page
"1,3,5-10,-1" - Combined syntax

Installation

🚀 Quick Installation (Recommended)

Install and run with uvx (easiest method):

# Install and run directly with uvx (no setup required)
uvx pdfmcp-tools

# Or install globally for repeated use
uv tool install pdfmcp-tools
pdfmcp-tools

Install from PyPI with pip:

# Install from PyPI
pip install pdfmcp-tools

# Run the server (both commands work)
pdfmcp-tools
# or
pdfreadermcp

Prerequisites

Python 3.11+ (automatically handled by uvx/pip)
Tesseract OCR engine (for OCR functionality)

Install Tesseract OCR Engine

macOS:

# Using Homebrew (recommended)
brew install tesseract tesseract-lang

Linux (Ubuntu/Debian):

sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra

Windows:

Download from: https://github.com/UB-Mannheim/tesseract/wiki
Install the latest version (recommended: tesseract-ocr-w64-setup-v5.3.3.20231005.exe)
During installation, select "Additional Language Data" and install Chinese language packs
Add Tesseract to your PATH, or note the installation path for configuration

Development Installation (Advanced)

For development or local modification:

Install uv package manager (if not already installed):

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Clone and install for development:

git clone https://github.com/lihongwen/pdfreadermcp.git
cd pdfreadermcp
uv sync --dev
uv run pdfreadermcp

Usage

🚀 Running the Server

With uvx (recommended):

# Run directly (auto-downloads and starts)
uvx pdfreadermcp

# Or if globally installed
pdfreadermcp

With pip installation:

# After pip install pdfreadermcp
pdfreadermcp

Development mode:

# In project directory
uv run pdfreadermcp

Integration with Claude Desktop

Add to your Claude Desktop MCP configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json

Option 1: Using uvx (recommended):

{
  "mcpServers": {
    "pdfreadermcp": {
      "command": "uvx",
      "args": ["pdfmcp-tools"]
    }
  }
}

Option 2: Using global installation:

{
  "mcpServers": {
    "pdfreadermcp": {
      "command": "pdfmcp-tools"
    }
  }
}

Option 3: Development/local installation:

{
  "mcpServers": {
    "pdfreadermcp": {
      "command": "uv",
      "args": [
        "--directory",
        "/path/to/your/pdfreadermcp",
        "run",
        "pdfreadermcp"
      ]
    }
  }
}

📚 Complete Tool Suite (18 Tools)

All tools have been thoroughly tested and are fully functional. The server provides comprehensive PDF processing capabilities across multiple categories:

📖 Text Processing Tools (5 tools)

read_pdf - Intelligent text extraction with chunking
extract_page_text - Single page text extraction with multiple modes
search_pdf_text - Advanced text search with regex and context support
find_and_highlight_text - Text search with highlighting coordinates
get_pdf_metadata - Comprehensive metadata reading with XMP support

📄 Document Operations Tools (5 tools)

split_pdf - Split PDFs into multiple files by page ranges
extract_pages - Extract specific pages to new PDF file
merge_pdfs - Combine multiple PDFs into single document
set_pdf_metadata - Write/update PDF metadata fields
remove_pdf_metadata - Remove specific or all metadata fields

🖼️ Image Conversion Tools (3 tools)

pdf_to_images - Convert PDF pages to high-quality images
images_to_pdf - Convert multiple images to single PDF
extract_pdf_images - Extract embedded images from PDF pages

🔍 OCR Tool (1 tool)

ocr_pdf - Advanced OCR with multi-language support and confidence scoring

⚡ Optimization Tools (4 tools)

optimize_pdf - Comprehensive PDF optimization with multiple levels
compress_pdf_images - Image compression within PDF documents
remove_pdf_content - Remove specific content to reduce file size
analyze_pdf_size - File size analysis and optimization recommendations

Tools

`read_pdf` - Text Extraction Tool

Extracts text from PDF files with intelligent processing.

Parameters:

file_path (required): Path to PDF file
pages (optional): Page range string (e.g., "1,3,5-10,-1")
chunk_size (optional): Maximum chunk size (default: 1000)
chunk_overlap (optional): Chunk overlap (default: 100)

Example:

Extract text from document.pdf, pages 1-5 and last page

`ocr_pdf` - OCR Recognition Tool

Performs OCR on PDF pages using Tesseract for scanned documents and image-based PDFs.

Parameters:

file_path (required): Path to PDF file
pages (optional): Page range string (e.g., "1,3,5-10,-1")
language (optional): OCR language code (default: "chi_sim" for Chinese)
chunk_size (optional): Maximum chunk size (default: 1000)
chunk_overlap (optional): Chunk overlap (default: 100)
dpi (optional): DPI for PDF to image conversion (default: 200)

Supported Languages:

chi_sim: Simplified Chinese (默认)
chi_tra: Traditional Chinese
eng: English
chi_sim+eng: Chinese and English mixed

Example:

Perform OCR on scanned_doc.pdf with Chinese text recognition

`split_pdf` - PDF Splitting Tool

Split PDF into multiple files based on page ranges.

Parameters:

file_path (required): Path to source PDF file
split_ranges (required): List of page ranges (e.g., ["1-5", "6-10", "11-15"])
output_dir (optional): Output directory (defaults to source file directory)
prefix (optional): Output file prefix (defaults to source filename)

Example:

Split document.pdf into multiple files: pages 1-10, 11-20, 21-30

`extract_pages` - Page Extraction Tool

Extract specific pages from PDF to a new file.

Parameters:

file_path (required): Path to source PDF file
pages (required): Page range (e.g., "1,3,5-7")
output_file (optional): Output filename (auto-generated if not provided)
output_dir (optional): Output directory (defaults to source file directory)

Example:

Extract pages 1, 5-8, and 15 from document.pdf

`merge_pdfs` - PDF Merging Tool

Merge multiple PDF files into a single file.

Parameters:

file_paths (required): List of PDF file paths to merge
output_file (optional): Output filename (auto-generated if not provided)
output_dir (optional): Output directory (defaults to first file's directory)

Example:

Merge file1.pdf, file2.pdf, and file3.pdf into a single document

`pdf_to_images` - PDF to Images Converter

Convert PDF pages to high-quality images using pdf2image.

Parameters:

file_path (required): Path to PDF file
pages (optional): Page range (e.g., "1,3,5-10,-1")
dpi (optional): Resolution for conversion (default: 200)
image_format (optional): Output format ('PNG', 'JPEG', etc.)
output_dir (optional): Directory to save images
save_to_disk (optional): Save to disk or keep in memory (default: True)

Example:

Convert first 5 pages of document.pdf to PNG images at 300 DPI

`images_to_pdf` - Images to PDF Converter

Convert multiple images into a single PDF document.

Parameters:

image_paths (required): List of image file paths
output_file (required): Output PDF file path
page_size (optional): Page size ('A4', 'Letter', 'Legal', 'auto')
quality (optional): JPEG compression quality (1-100, default: 95)
title (optional): PDF document title
author (optional): PDF document author

Example:

Convert scan1.jpg, scan2.jpg, scan3.jpg to a single PDF with A4 pages

`extract_pdf_images` - PDF Image Extractor

Extract all embedded images from PDF pages.

Parameters:

file_path (required): Path to PDF file
pages (optional): Page range (e.g., "1,3,5-10,-1")
min_size (optional): Minimum image size ("WIDTHxHEIGHT", default: "100x100")
output_dir (optional): Directory to save extracted images

Example:

Extract all images larger than 200x200 pixels from PDF pages 1-10

`get_pdf_metadata` - PDF Metadata Reader

Read comprehensive metadata information from PDF documents.

Parameters:

file_path (required): Path to PDF file
include_xmp (optional): Include advanced XMP metadata (default: False)

Example:

Read all metadata from document.pdf including title, author, creation date

`set_pdf_metadata` - PDF Metadata Writer

Write or update PDF metadata fields.

Parameters:

file_path (required): Path to source PDF file
output_file (optional): Output PDF file path
title (optional): Document title
author (optional): Document author
subject (optional): Document subject
creator (optional): Creator application name
producer (optional): Producer application name
keywords (optional): Keywords or tags
preserve_existing (optional): Preserve existing metadata (default: True)

Example:

Set metadata for report.pdf with title "Annual Report 2024" and author "John Doe"

`remove_pdf_metadata` - PDF Metadata Remover

Remove specific metadata fields or all metadata from PDF.

Parameters:

file_path (required): Path to source PDF file
output_file (optional): Output PDF file path
fields_to_remove (optional): List of specific fields to remove
remove_all (optional): Remove all metadata (default: False)

Example:

Remove author and title metadata from sensitive_document.pdf

`search_pdf_text` - PDF Text Search Engine

Search for text content across PDF pages with detailed match information.

Parameters:

file_path (required): Path to PDF file
query (required): Text to search for (or regex pattern)
pages (optional): Page range (e.g., "1,3,5-10,-1")
case_sensitive (optional): Case-sensitive search (default: False)
regex_search (optional): Treat query as regex pattern (default: False)
context_chars (optional): Context characters around matches (default: 100)
max_matches (optional): Maximum matches to return (default: 100)

Example:

Search for "financial report" in document.pdf with case-insensitive matching

`extract_page_text` - Single Page Text Extractor

Extract text from a specific PDF page with various extraction options.

Parameters:

file_path (required): Path to PDF file
page_number (required): Page number to extract (1-based)
extraction_mode (optional): Extraction mode ("default", "layout", "simple")

Example:

Extract text from page 5 of document.pdf with layout preservation

`find_and_highlight_text` - Text Highlighting Tool

Find text and return information for highlighting matches.

Parameters:

file_path (required): Path to PDF file
query (required): Text to search for
pages (optional): Page range (e.g., "1,3,5-10,-1")
case_sensitive (optional): Case-sensitive search (default: False)

Example:

Find all instances of "important" in document.pdf for highlighting

`optimize_pdf` - PDF Optimization Tool

Optimize PDF file using various compression techniques.

Parameters:

file_path (required): Path to source PDF file
output_file (optional): Output PDF file path
optimization_level (optional): Optimization preset ("light", "medium", "heavy", "maximum")

Example:

Optimize large_document.pdf using medium compression level

`compress_pdf_images` - PDF Image Compression

Compress images in PDF while preserving document structure.

Parameters:

file_path (required): Path to source PDF file
output_file (optional): Output PDF file path
quality (optional): Image compression quality (1-100, default: 80)

Example:

Compress images in photo_heavy.pdf to 60% quality

`remove_pdf_content` - PDF Content Remover

Remove specific content from PDF to reduce file size.

Parameters:

file_path (required): Path to source PDF file
output_file (optional): Output PDF file path
remove_images (optional): Remove all images (default: False)
remove_annotations (optional): Remove annotations (default: False)
compress_streams (optional): Compress content streams (default: True)

Example:

Remove all images and annotations from document.pdf to reduce size

`analyze_pdf_size` - PDF Size Analysis Tool

Analyze PDF file to identify optimization opportunities.

Parameters:

file_path (required): Path to PDF file to analyze

Example:

Analyze large_file.pdf to get optimization recommendations

Output Format

All tools return structured JSON containing relevant data. Text extraction and OCR tools return:

{
  "success": true,
  "file_path": "/path/to/file.pdf",
  "total_pages": 10,
  "processed_pages": [1, 2, 3],
  "chunks": [
    {
      "content": "Extracted text...",
      "page_number": 1,
      "chunk_index": 0,
      "metadata": {
        "quality_score": 0.95,
        "word_count": 150
      }
    }
  ],
  "summary": {
    "total_chunks": 5,
    "total_chars": 2500,
    "pages": [1, 2, 3]
  },
  "extraction_method": "text_extraction"
}

Language Support

OCR Languages

The ocr_pdf tool supports multiple languages via Tesseract:

Chinese: chi_sim (Simplified), chi_tra (Traditional)
English: eng
Combined: chi_sim+eng (mixed Chinese and English)
Others: Available based on your Tesseract installation

Performance Features

Caching System

File-based invalidation - Cache automatically invalidates when files change
Operation-specific caching - Different cache entries for different operations
Memory management - Configurable cache size and TTL

Text Quality Analysis

The system automatically analyzes extracted text quality using:

Character-to-word ratios
Sentence structure analysis
Letter-to-character ratios
Special character detection

Low-quality text triggers OCR recommendations.

Chunking Strategy

Recursive character splitting with semantic separators
Configurable overlap to preserve context
Metadata preservation including page numbers and positions

Error Handling

The server provides detailed error information:

Missing file errors
Invalid page range errors
OCR engine initialization errors
Processing timeout errors

Development

Project Structure

pdfreadermcp/
\x00\x00 pyproject.toml              # uv project configuration
\x00\x00 README.md
\x00\x00 src/pdfreadermcp/
    \x00\x00 __init__.py
    \x00\x00 __main__.py             # Entry point
    \x00\x00 server.py               # MCP server implementation
    \x00\x00 tools/
       \x00\x00 pdf_reader.py       # Text extraction tool
       \x00\x00 pdf_ocr.py          # OCR processing tool
          pdf_operations.py   # PDF splitting, merging, extraction
          pdf_image_converter.py  # PDF-image conversion tools
          pdf_metadata.py         # PDF metadata management
          pdf_text_search.py      # PDF text search and highlighting
          pdf_optimizer.py        # PDF compression and optimization
    \x00\x00 utils/
        \x00\x00 chunker.py          # Text chunking utilities
        \x00\x00 cache.py            # Caching system
        \x00\x00 file_handler.py     # File operations

Running Tests

# Install with dev dependencies
uv sync --dev

# Run tests (when available)
uv run pytest

Dependencies

Core Dependencies

mcp - Model Context Protocol server framework
pypdf - PDF text extraction and manipulation
pdf2image - PDF to image conversion
pytesseract - Python wrapper for Tesseract OCR
tesseract - OCR engine
pillow - Image processing and manipulation

System Requirements

For OCR: Tesseract OCR engine must be installed
For PDF conversion: poppler-utils may be required on some systems

Troubleshooting

Common Issues

1. Tesseract OCR Installation Issues

If Tesseract is not found, you may see errors like "TesseractNotFoundError". Solutions:

Windows:

Ensure Tesseract is installed and added to PATH
Or set the path manually in your environment:

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

macOS/Linux:

Install via package manager: brew install tesseract (macOS) or apt install tesseract-ocr (Ubuntu)
Make sure Chinese language packs are installed

2. pdf2image Dependencies On Linux, you may need to install poppler:

# Ubuntu/Debian
sudo apt-get install poppler-utils

# CentOS/RHEL  
sudo yum install poppler-utils

3. Chinese Language Pack Issues

If OCR fails for Chinese text or produces poor results:

Windows: During Tesseract installation, select "Additional Language Data" and install Chinese packs
macOS: brew install tesseract-lang
Linux: sudo apt install tesseract-ocr-chi-sim tesseract-ocr-chi-tra

Verify language packs are installed:

tesseract --list-langs

4. Memory Issues with Large PDFs

Reduce chunk_size parameter
Process pages in smaller ranges
Ensure sufficient system memory
Lower dpi parameter for faster processing

Performance Tips

Use caching - The same file with same parameters will use cached results
Process specific pages - Use page ranges instead of processing entire documents
Adjust chunk sizes - Smaller chunks for memory-constrained environments
Choose appropriate tools - Use read_pdf first, then ocr_pdf if needed
OCR optimization:
- Lower dpi (150-200) for faster processing
- Use chi_sim only if document is purely Chinese
- Process problematic pages only, not entire document

🧪 Testing & Quality Assurance

This project has been thoroughly tested with comprehensive test coverage:

✅ 18/18 tools fully functional (100% success rate)
🔧 All JSON serialization issues resolved
📋 Extensive testing with real PDF documents
⚡ Performance validation with caching system
🌍 Multi-language OCR testing (Chinese/English)

License

This project is licensed under the MIT License.

Contributing

Contributions are welcome! Please feel free to submit issues and enhancement requests.

Support

For questions and support:

Create an issue in the project repository
Check the troubleshooting section above
Review the MCP documentation at https://modelcontextprotocol.io

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Sep 25, 2025

This version

0.1.0

Sep 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfmcp_tools-0.1.0.tar.gz (241.8 kB view details)

Uploaded Sep 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfmcp_tools-0.1.0-py3-none-any.whl (45.0 kB view details)

Uploaded Sep 25, 2025 Python 3

File details

Details for the file pdfmcp_tools-0.1.0.tar.gz.

File metadata

Download URL: pdfmcp_tools-0.1.0.tar.gz
Upload date: Sep 25, 2025
Size: 241.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.16

File hashes

Hashes for pdfmcp_tools-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`af1de4122d939232fb5d40dfde7ef0f091b1fd0d23d45d1d4d70c4ff61f92f39`
MD5	`bb7737ea8b2c424bb376dc16ceb11237`
BLAKE2b-256	`72b68a8724212865dd5f64285e7745e8d52e81c7c910aba38e909d68761aa2f8`

See more details on using hashes here.

File details

Details for the file pdfmcp_tools-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdfmcp_tools-0.1.0-py3-none-any.whl
Upload date: Sep 25, 2025
Size: 45.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.16

File hashes

Hashes for pdfmcp_tools-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c15dfc6d6552018bd145efdba259319dd9d970226b876963b0ea0a1a83cb5e4`
MD5	`3b11ce5b0de638da33e73a78bc2b74a6`
BLAKE2b-256	`ee7e544a50dac430f1820b87059388143a2e95a5473e2b38b8f2fec65e37ad2a`

See more details on using hashes here.

pdfmcp-tools 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF Reader MCP Server

🚀 Latest Updates

Features

📖 Smart Text Extraction

🔍 Advanced OCR Support

⚡ Performance Optimized

🎯 Flexible Page Selection

Installation

🚀 Quick Installation (Recommended)

Prerequisites

Install Tesseract OCR Engine

Development Installation (Advanced)

Usage

🚀 Running the Server

Integration with Claude Desktop

📚 Complete Tool Suite (18 Tools)

📖 Text Processing Tools (5 tools)

📄 Document Operations Tools (5 tools)

🖼️ Image Conversion Tools (3 tools)

🔍 OCR Tool (1 tool)

⚡ Optimization Tools (4 tools)

Tools

read_pdf - Text Extraction Tool

ocr_pdf - OCR Recognition Tool

split_pdf - PDF Splitting Tool

extract_pages - Page Extraction Tool

merge_pdfs - PDF Merging Tool

pdf_to_images - PDF to Images Converter

images_to_pdf - Images to PDF Converter

extract_pdf_images - PDF Image Extractor

get_pdf_metadata - PDF Metadata Reader

set_pdf_metadata - PDF Metadata Writer

remove_pdf_metadata - PDF Metadata Remover

search_pdf_text - PDF Text Search Engine

extract_page_text - Single Page Text Extractor

find_and_highlight_text - Text Highlighting Tool

optimize_pdf - PDF Optimization Tool

compress_pdf_images - PDF Image Compression

remove_pdf_content - PDF Content Remover

analyze_pdf_size - PDF Size Analysis Tool

Output Format

Language Support

OCR Languages

Performance Features

Caching System

Text Quality Analysis

Chunking Strategy

Error Handling

Development

Project Structure

Running Tests

Dependencies

Core Dependencies

System Requirements

Troubleshooting

Common Issues

Performance Tips

🧪 Testing & Quality Assurance

License

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

`read_pdf` - Text Extraction Tool

`ocr_pdf` - OCR Recognition Tool

`split_pdf` - PDF Splitting Tool

`extract_pages` - Page Extraction Tool

`merge_pdfs` - PDF Merging Tool

`pdf_to_images` - PDF to Images Converter

`images_to_pdf` - Images to PDF Converter

`extract_pdf_images` - PDF Image Extractor

`get_pdf_metadata` - PDF Metadata Reader

`set_pdf_metadata` - PDF Metadata Writer

`remove_pdf_metadata` - PDF Metadata Remover

`search_pdf_text` - PDF Text Search Engine

`extract_page_text` - Single Page Text Extractor

`find_and_highlight_text` - Text Highlighting Tool

`optimize_pdf` - PDF Optimization Tool

`compress_pdf_images` - PDF Image Compression

`remove_pdf_content` - PDF Content Remover

`analyze_pdf_size` - PDF Size Analysis Tool