Skip to main content

Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more

Project description

📄 MCP PDF

MCP PDF

A FastMCP server for PDF processing

46 tools for text extraction, OCR, tables, forms, annotations, and more

Python 3.11+ FastMCP License: MIT PyPI

Works great with MCP Office Tools


What It Does

MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.

Core capabilities:

  • Text extraction via PyMuPDF, pdfplumber, or pypdf (auto-fallback)
  • Table extraction via Camelot, pdfplumber, or Tabula (auto-fallback)
  • OCR for scanned documents via Tesseract
  • Form handling - extract, fill, and create PDF forms
  • Document assembly - merge, split, reorder pages
  • Annotations - sticky notes, highlights, stamps
  • Vector graphics - extract to SVG for schematics and technical drawings

Quick Start

# Install from PyPI
uvx mcp-pdf

# Or add to Claude Code
claude mcp add pdf-tools uvx mcp-pdf
Development Installation
git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf
uv sync

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript

# Verify
uv run python examples/verify_installation.py

Tools

Content Extraction

Tool What it does
extract_text Pull text from PDF pages with automatic chunking for large files
extract_tables Extract tables to JSON, CSV, or Markdown
extract_images Extract embedded images
extract_links Get all hyperlinks with page filtering
pdf_to_markdown Convert PDF to markdown preserving structure
ocr_pdf OCR scanned documents using Tesseract
extract_vector_graphics Export vector graphics to SVG (schematics, charts, drawings)

Document Analysis

Tool What it does
extract_metadata Get title, author, creation date, page count, etc.
get_document_structure Extract table of contents and bookmarks
analyze_layout Detect columns, headers, footers
is_scanned_pdf Check if PDF needs OCR
compare_pdfs Diff two PDFs by text, structure, or metadata
analyze_pdf_health Check for corruption, optimization opportunities
analyze_pdf_security Report encryption, permissions, signatures

Forms

Tool What it does
extract_form_data Get form field names and values
fill_form_pdf Fill form fields from JSON
create_form_pdf Create new forms with text fields, checkboxes, dropdowns
add_form_fields Add fields to existing PDFs

Permit Forms (Coordinate-Based)

For scanned PDFs or forms without interactive fields. Draws text at (x, y) coordinates.

Tool What it does
fill_permit_form Fill any PDF by drawing at coordinates (works with scanned forms)
get_field_schema Get field definitions for validation or UI generation
validate_permit_form_data Check data against field schema before filling
preview_field_positions Generate PDF showing field boundaries (debugging)
insert_attachment_pages Insert image/text pages with "See page X" references

Requires: pip install mcp-pdf[forms] (adds reportlab dependency)

Document Assembly

Tool What it does
merge_pdfs Combine multiple PDFs with bookmark preservation
split_pdf_by_pages Split by page ranges
split_pdf_by_bookmarks Split at chapter/section boundaries
reorder_pdf_pages Rearrange pages in custom order

Annotations

Tool What it does
add_sticky_notes Add comment annotations
add_highlights Highlight text regions
add_stamps Add Approved/Draft/Confidential stamps
extract_all_annotations Export annotations to JSON

How Fallbacks Work

The server tries multiple libraries for each operation:

Text extraction:

  1. PyMuPDF (fastest)
  2. pdfplumber (better for complex layouts)
  3. pypdf (most compatible)

Table extraction:

  1. Camelot (best accuracy, requires Ghostscript)
  2. pdfplumber (no dependencies)
  3. Tabula (requires Java)

If a PDF fails with one library, the next is tried automatically.


Token Management

Large PDFs can overflow MCP response limits. The server handles this:

  • Automatic chunking splits large documents into page groups
  • Table row limits prevent huge tables from blowing up responses
  • Summary mode returns structure without full content
# Get first 10 pages
result = await extract_text("huge.pdf", pages="1-10")

# Limit table rows
tables = await extract_tables("data.pdf", max_rows_per_table=50)

# Structure only
tables = await extract_tables("data.pdf", summary_only=True)

URL Processing

PDFs can be fetched directly from HTTPS URLs:

result = await extract_text("https://example.com/report.pdf")

Files are cached locally for subsequent operations.


System Dependencies

Some features require system packages:

Feature Dependency
OCR tesseract-ocr
Camelot tables ghostscript
Tabula tables default-jre-headless
PDF to images poppler-utils

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless

Configuration

Optional environment variables:

Variable Purpose
MCP_PDF_ALLOWED_PATHS Colon-separated directories for file output
PDF_TEMP_DIR Temp directory for processing (default: /tmp/mcp-pdf-processing)
TESSDATA_PREFIX Tesseract language data location

Development

# Run tests
uv run pytest

# With coverage
uv run pytest --cov=mcp_pdf

# Format
uv run black src/ tests/

# Lint
uv run ruff check src/ tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_pdf-2.1.4.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_pdf-2.1.4-py3-none-any.whl (186.3 kB view details)

Uploaded Python 3

File details

Details for the file mcp_pdf-2.1.4.tar.gz.

File metadata

  • Download URL: mcp_pdf-2.1.4.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"EndeavourOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mcp_pdf-2.1.4.tar.gz
Algorithm Hash digest
SHA256 abc7275eed2a9ed76343e5a1b9baae4a8d5a4cd32c430e3a7f02168d87118a5d
MD5 f994a042224641150b4d48818d0a3c74
BLAKE2b-256 45cb35e152bfc7e035fe31517fd7aa9a12983500e833d02bfcf1a8b339b8ee68

See more details on using hashes here.

File details

Details for the file mcp_pdf-2.1.4-py3-none-any.whl.

File metadata

  • Download URL: mcp_pdf-2.1.4-py3-none-any.whl
  • Upload date:
  • Size: 186.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"EndeavourOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mcp_pdf-2.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e3c0f2f02b872dfc7eabf928c4a64bb5dea5c7478f81f88490986dad565a3658
MD5 ece4b1257bf27be8997878c5f7a5133f
BLAKE2b-256 4d5be3c9dd526907bfab78cdfa53249d1ded165ae9a2142c21e1382ff82a9aaa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page