Skip to main content

MCP-compatible PDF reading server with intelligent file search and extraction

Project description

MokuPDF - Intelligent PDF Reading Server for AI

Python 3.8+ PyPI version License: MIT MCP Compatible

MokuPDF is a powerful, MCP (Model Context Protocol) compatible server that enables AI applications to read and process PDF files with advanced capabilities. It combines intelligent file search, comprehensive text extraction, image processing, and optional OCR support to handle any type of PDF document - from simple text files to complex scanned documents.

๐Ÿš€ Perfect for Claude Desktop, ChatGPT plugins, and any AI application that needs PDF processing capabilities!

๐Ÿ“‹ Table of Contents

โœจ Key Features

๐Ÿ” Intelligent PDF Processing

  • ๐Ÿ“„ Full Text Extraction - Extract all text content from any PDF
  • ๐Ÿ–ผ๏ธ Advanced Image Handling - Extract embedded images as base64 PNG with proper format conversion
  • ๐Ÿ“ฑ Scanned PDF Support - Auto-detects and renders image-based/scanned PDFs at high resolution
  • ๐Ÿ”ค Optional OCR Integration - Extract text from scanned documents using Tesseract (optional)
  • ๐Ÿ“‘ Page-by-Page Processing - Handle large PDFs efficiently without memory issues

๐ŸŽฏ Smart File Operations

  • ๐Ÿง  Intelligent File Search - Find PDFs using natural language: "find the report", "open invoice"
  • ๐Ÿ“ Multi-Location Search - Automatically searches Desktop, Downloads, Documents, and OneDrive
  • ๐Ÿ”— Fuzzy Matching - Handles typos and partial filenames intelligently
  • ๐Ÿ” Advanced Text Search - Search within PDFs with regex support and context

๐Ÿค– AI Integration

  • โšก MCP Protocol Compliant - Seamlessly integrates with Claude Desktop and other AI tools
  • ๐Ÿ”Œ FastMCP Architecture - Built on the official MCP Python SDK for reliability
  • ๐Ÿ“ก JSON-RPC Interface - Clean, standardized API for easy integration
  • โš™๏ธ Configurable & Lightweight - Minimal dependencies, fast startup, customizable options

๐Ÿ“ฆ Installation

From Source

# Clone the repository
git clone https://github.com/jameslovespancakes/mokupdf.git
cd mokupdf

# Install the package
pip install .

# Or install in development mode
pip install -e .

Using pip (when published)

# Basic installation
pip install mokupdf

# With OCR support for scanned PDFs
pip install mokupdf[ocr]

Note: For OCR functionality, you'll also need Tesseract installed on your system:

  • Windows: Download from GitHub releases
  • Mac: brew install tesseract
  • Linux: sudo apt-get install tesseract-ocr

๐ŸŽฏ Quick Start

Running the Server

# Start with default settings (port 8000)
mokupdf

# Start with custom port
mokupdf --port 8080

# Enable verbose logging
mokupdf --verbose

# Set custom PDF directory
mokupdf --base-dir ./documents

Command Line Options

Option Description Default
--port Port to listen on 8000
--verbose Enable verbose logging False
--base-dir Base directory for PDF files Current directory
--max-file-size Maximum PDF file size in MB 100
--version Show version information -
--help Show help message -

๐Ÿ”ง MCP Configuration

Add MokuPDF to your MCP configuration file:

{
  "mcpServers": {
    "mokupdf": {
      "command": "python",
      "args": ["-m", "mokupdf", "--port", "8000"],
      "name": "MokuPDF",
      "description": "PDF reading server with text and image extraction",
      "env": {
        "PYTHONUNBUFFERED": "1"
      }
    }
  }
}

๐Ÿ“š Available MCP Tools

1. open_pdf

Open a PDF file for processing.

{
  "tool": "open_pdf",
  "arguments": {
    "file_path": "document.pdf"
  }
}

2. read_pdf

Read PDF pages with text and images. Supports page ranges for efficient processing.

{
  "tool": "read_pdf",
  "arguments": {
    "file_path": "document.pdf",
    "start_page": 1,
    "end_page": 5,
    "max_pages": 10
  }
}

Response includes:

  • Text content with [IMAGE: ...] placeholders
  • Base64-encoded images
  • Page information

3. search_text

Search for text within the current PDF.

{
  "tool": "search_text",
  "arguments": {
    "query": "introduction",
    "case_sensitive": false
  }
}

4. get_page_text

Extract text from a specific page.

{
  "tool": "get_page_text",
  "arguments": {
    "page_number": 1
  }
}

5. get_metadata

Get metadata from the current PDF.

{
  "tool": "get_metadata",
  "arguments": {}
}

6. close_pdf

Close the current PDF and free memory.

{
  "tool": "close_pdf", 
  "arguments": {}
}

7. find_pdf_files

Find PDF files using intelligent search across common directories.

{
  "tool": "find_pdf_files",
  "arguments": {
    "query": "financial report",
    "limit": 5
  }
}

๐Ÿ’ก Usage Examples

๐ŸŽฏ Natural Language File Access

# Instead of exact paths, use natural language
User: "Can you read the financial report from last quarter?"
Claude: Uses find_pdf_files("financial report") โ†’ Opens Q3_Financial_Report.pdf

User: "Look at the user manual on my desktop"  
Claude: Searches Desktop โ†’ Finds User_Manual_v2.pdf โ†’ Processes it

User: "Find all invoices"
Claude: Returns list of all PDFs containing "invoice" from common locations

๐Ÿ“„ Text-Based PDFs

# Regular PDF with embedded images
{
  "tool": "read_pdf",
  "arguments": {
    "file_path": "annual_report.pdf",
    "start_page": 1,
    "max_pages": 10
  }
}

# Response includes:
# - Extracted text content
# - Image placeholders: [IMAGE: Image 1 - 800x600px]  
# - Base64-encoded images array
# - Page metadata

๐Ÿ–ผ๏ธ Scanned PDFs (Image-Based)

# Scanned document without OCR
{
  "tool": "read_pdf",
  "arguments": {
    "file_path": "scanned_contract.pdf"
  }
}

# Response:
# - "[SCANNED PAGE: This page appears to be a scanned image]"
# - "[IMAGE: Full Page Scan - 1654x2339px]"
# - High-resolution page image as base64

# With OCR enabled (pip install mokupdf[ocr])
# Response:
# - "[SCANNED PAGE - OCR EXTRACTED TEXT]:"  
# - "Actual extracted text content..."
# - "[IMAGE: Full Page Scan - 1654x2339px]"
# - Original page image as base64

๐Ÿ” Smart Search & Discovery

# Find files by content or name
{
  "tool": "find_pdf_files", 
  "arguments": {
    "query": "invoice 2024",
    "limit": 5
  }
}

# Response includes:
# - Ranked list of matching files
# - File metadata (size, modification date, location)
# - Relevance scores

๐Ÿ–ผ๏ธ Image & Scanned PDF Support

MokuPDF automatically handles different PDF types:

PDF Type Text Extraction Image Handling OCR Support
Text-based PDF โœ… Direct extraction โœ… Embedded images extracted โž– Not needed
Mixed PDF โœ… Text + images โœ… All images extracted โž– Not needed
Scanned PDF โš ๏ธ Limited/None โœ… Full page rendered โœ… Optional OCR
Image-only PDF โž– None โœ… Full page rendered โœ… Optional OCR

OCR Installation

# Install with OCR support
pip install mokupdf[ocr]

# Install Tesseract system dependency
# Windows: Download from GitHub releases
# Mac: brew install tesseract  
# Linux: sudo apt-get install tesseract-ocr

๐Ÿ” Smart File Search

MokuPDF's intelligent file finder works with natural language:

Search Patterns

  • Exact matches: "report" โ†’ Annual_Report.pdf
  • Partial matches: "ann" โ†’ Annual_Report.pdf
  • Multiple terms: "financial report 2024" โ†’ Financial_Report_2024.pdf
  • Fuzzy matching: "finacial" โ†’ Financial_Report.pdf (handles typos)

Search Locations

  • Current working directory
  • ~/Desktop
  • ~/Downloads
  • ~/Documents
  • ~/OneDrive/Desktop (if available)
  • ~/OneDrive/Documents (if available)

Ranking System

Files are ranked by:

  • Exact name matches (highest priority)
  • Word boundary matches
  • Partial string matches
  • Recent modification time (boost for recent files)
  • File location (Desktop files prioritized)

โš™๏ธ Configuration Options

Command Line Arguments

mokupdf --help

Options:
  --base-dir PATH        Base directory for PDF files (default: current)
  --max-file-size INT    Maximum PDF size in MB (default: 100)
  --port INT            Port number (legacy, ignored by FastMCP)
  --verbose             Enable verbose logging (legacy, ignored)
  --version             Show version information

MCP Server Configuration

{
  "mcpServers": {
    "mokupdf": {
      "command": "python",
      "args": ["-m", "mokupdf", "--base-dir", "./documents", "--max-file-size", "200"],
      "name": "MokuPDF",
      "description": "Advanced PDF processing with smart search and OCR"
    }
  }
}

๐Ÿ’ป Development

Project Structure

mokupdf/
โ”œโ”€โ”€ mokupdf/
โ”‚   โ”œโ”€โ”€ __init__.py       # Package initialization
โ”‚   โ”œโ”€โ”€ server.py         # Main server implementation
โ”‚   โ””โ”€โ”€ __main__.py       # Module entry point
โ”œโ”€โ”€ setup.py              # Package setup script
โ”œโ”€โ”€ pyproject.toml        # Modern Python packaging
โ”œโ”€โ”€ requirements.txt      # Direct dependencies
โ”œโ”€โ”€ LICENSE              # MIT License
โ””โ”€โ”€ README.md           # This file

Running Tests

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=mokupdf

Code Quality

# Format code
black mokupdf/

# Lint code  
flake8 mokupdf/

Architecture

MokuPDF is built using:

  • FastMCP: Official MCP Python SDK for reliable protocol handling
  • PyMuPDF (fitz): High-performance PDF processing and rendering
  • Pillow: Image format conversion and processing
  • pytesseract: Optional OCR text extraction from scanned documents

๐Ÿ› ๏ธ Troubleshooting

Common Issues

๐Ÿ”ธ "ModuleNotFoundError: No module named 'mokupdf'"

# Install the package
pip install mokupdf

๐Ÿ”ธ "No PDF is currently open"

# Always open a PDF first, or provide file_path in read_pdf
{
  "tool": "open_pdf",
  "arguments": {"file_path": "document.pdf"}
}

๐Ÿ”ธ "PDF file not found"

# Use smart search instead of exact paths
{
  "tool": "find_pdf_files",
  "arguments": {"query": "document"}
}

๐Ÿ”ธ OCR not working

# Install OCR dependencies
pip install mokupdf[ocr]

# Windows: Download Tesseract from GitHub releases
# Mac: brew install tesseract
# Linux: sudo apt-get install tesseract-ocr  

๐Ÿ”ธ "File too large" errors

# Increase file size limit
mokupdf --max-file-size 500  # Allow 500MB files

Debug Mode

# Enable verbose logging for detailed information
mokupdf --verbose

# Check MCP connection in Claude Desktop developer tools
# Press Ctrl+Shift+I in Claude Desktop

๐Ÿ“ˆ Performance Tips

  • Large PDFs: Use start_page and end_page parameters for chunked processing
  • Memory usage: Close PDFs when done with close_pdf tool
  • OCR speed: OCR processing adds significant time - disable if not needed
  • File search: Search is cached - repeated searches are faster
  • Image quality: Scanned pages rendered at 2x resolution for clarity

๐Ÿ—บ๏ธ Roadmap

  • Advanced OCR: Multiple language support, confidence scores
  • Enhanced Search: Content-based PDF search (search inside PDF text)
  • Batch Processing: Process multiple PDFs simultaneously
  • Format Support: Add support for other document formats (DOCX, PPTX)
  • Cloud Integration: Support for cloud storage (Google Drive, OneDrive API)
  • Performance: Async processing for better concurrent handling

๐Ÿ” Example Usage

Python Script Example

import json
import subprocess

# Start MokuPDF server
process = subprocess.Popen(
    ["mokupdf", "--port", "8000"],
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    text=True
)

# Send a request to open a PDF
request = {
    "jsonrpc": "2.0",
    "method": "tools/call",
    "params": {
        "name": "open_pdf",
        "arguments": {"file_path": "example.pdf"}
    },
    "id": 1
}

# Send request
process.stdin.write(json.dumps(request) + "\n")
process.stdin.flush()

# Read response
response = json.loads(process.stdout.readline())
print(f"PDF opened: {response['result']}")

Integration with LLMs

MokuPDF is designed to work seamlessly with LLM applications through MCP. The read_pdf tool returns content in a format optimized for LLM consumption:

  1. Text is extracted with page markers
  2. Images are embedded as base64 PNG with placeholders in text
  3. Large PDFs can be read page-by-page to avoid context limits

๐Ÿ› ๏ธ Troubleshooting

Common Issues

Issue: ModuleNotFoundError: No module named 'mokupdf'

  • Solution: Install the package with pip install .

Issue: Port already in use

  • Solution: Use a different port with --port 8081

Issue: PDF file not found

  • Solution: Check the base directory and ensure paths are relative to it

Issue: Large PDF causes timeout

  • Solution: Use page-by-page reading with start_page and end_page parameters

Debug Mode

Enable verbose logging for detailed information:

mokupdf --verbose

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿค Contributing

We welcome contributions! MokuPDF is designed to be the best PDF processing tool for AI applications.

How to Contribute

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒฟ Create a feature branch: git checkout -b feature/amazing-feature
  3. ๐Ÿ“ Make your changes with clear, documented code
  4. โœ… Add tests for new functionality
  5. ๐Ÿงน Run code formatting: black mokupdf/
  6. โœจ Submit a pull request

Development Setup

# Clone your fork
git clone https://github.com/yourusername/mokupdf.git
cd mokupdf

# Install in development mode with all dependencies  
pip install -e ".[dev,ocr]"

# Run tests
pytest

# Format code
black mokupdf/
flake8 mokupdf/

Contribution Ideas

  • ๐ŸŒ Multi-language OCR support
  • โšก Performance optimizations
  • ๐Ÿ” Advanced search algorithms
  • ๐Ÿ“ฑ New document format support
  • ๐Ÿ› Bug fixes and improvements
  • ๐Ÿ“š Documentation enhancements

๐Ÿ“ž Support & Community

Getting Help

  • ๐Ÿ“ Issues: Open a GitHub issue for bugs or feature requests
  • ๐Ÿ’ฌ Discussions: Use GitHub Discussions for questions and community support
  • ๐Ÿ”ง Troubleshooting: Enable --verbose mode for detailed debugging information

Reporting Bugs

When reporting issues, please include:

  • Operating system and Python version
  • MokuPDF version (mokupdf --version)
  • Sample PDF file (if possible)
  • Complete error message and traceback
  • Steps to reproduce the issue

๐Ÿ™ Acknowledgments

MokuPDF is built on the shoulders of giants:

  • PyMuPDF - Exceptional PDF processing and rendering capabilities
  • FastMCP - Official MCP Python SDK for reliable protocol handling
  • Tesseract OCR - Open-source OCR engine for text extraction
  • Pillow - Python Imaging Library for image processing
  • Model Context Protocol - Standardized protocol for AI tool integration

Special thanks to the AI and open-source communities for inspiration and feedback.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

  • โœ… Commercial use - Use in commercial applications
  • โœ… Modification - Modify and distribute changes
  • โœ… Distribution - Distribute original or modified versions
  • โœ… Private use - Use privately without restrictions
  • โŒ No warranty - Software provided "as-is"
  • โš–๏ธ License notice - Include original license in copies

๐Ÿš€ Made with โค๏ธ for the AI community

PyPI Downloads GitHub stars

โญ Star us on GitHub โ€ข ๐Ÿ“ฆ Install from PyPI โ€ข ๐Ÿ“š Read the Docs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mokupdf-1.0.1.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mokupdf-1.0.1-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file mokupdf-1.0.1.tar.gz.

File metadata

  • Download URL: mokupdf-1.0.1.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for mokupdf-1.0.1.tar.gz
Algorithm Hash digest
SHA256 92b7a4d95602bed5d71a184f4adc09030e69edb850485b51c5a3d704bdc54e27
MD5 3264c2fce95647ae2bc1e7b2f8dc2611
BLAKE2b-256 9f8a316f3720bd9c2bf27b48f93de0a2ff1a23a52eb5f0fbbff8bcaa23dc4e7f

See more details on using hashes here.

File details

Details for the file mokupdf-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: mokupdf-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for mokupdf-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cf027125eaaa9cccf7eaf490d051b18e11cb1fe037935a033d0ef278e16a240d
MD5 fdf11073d833e995ff7f9575f25c3b2e
BLAKE2b-256 064267b21c3b324803a65e58787647fe16a72c8f10d4111ec6a9bc15bbd39ac0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page