Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more

These details have not been verified by PyPI

Project links

Project description

📄 MCP PDF

A FastMCP server for PDF processing

46 tools for text extraction, OCR, tables, forms, annotations, and more

Works great with MCP Office Tools

What It Does

MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.

Core capabilities:

Text extraction via PyMuPDF, pdfplumber, or pypdf (auto-fallback)
Table extraction via Camelot, pdfplumber, or Tabula (auto-fallback)
OCR for scanned documents via Tesseract
Form handling - extract, fill, and create PDF forms
Document assembly - merge, split, reorder pages
Annotations - sticky notes, highlights, stamps
Vector graphics - extract to SVG for schematics and technical drawings

Quick Start

# Install from PyPI
uvx mcp-pdf

# Or add to Claude Code
claude mcp add pdf-tools uvx mcp-pdf

Development Installation

git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf
uv sync

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript

# Verify
uv run python examples/verify_installation.py

Tools

Content Extraction

Tool	What it does
`extract_text`	Pull text from PDF pages with automatic chunking for large files
`extract_tables`	Extract tables to JSON, CSV, or Markdown
`extract_images`	Extract embedded images
`extract_links`	Get all hyperlinks with page filtering
`pdf_to_markdown`	Convert PDF to markdown preserving structure
`ocr_pdf`	OCR scanned documents using Tesseract
`extract_vector_graphics`	Export vector graphics to SVG (schematics, charts, drawings)

Document Analysis

Tool	What it does
`extract_metadata`	Get title, author, creation date, page count, etc.
`get_document_structure`	Extract table of contents and bookmarks
`analyze_layout`	Detect columns, headers, footers
`is_scanned_pdf`	Check if PDF needs OCR
`compare_pdfs`	Diff two PDFs by text, structure, or metadata
`analyze_pdf_health`	Check for corruption, optimization opportunities
`analyze_pdf_security`	Report encryption, permissions, signatures

Forms

Tool	What it does
`extract_form_data`	Get form field names and values
`fill_form_pdf`	Fill form fields from JSON
`create_form_pdf`	Create new forms with text fields, checkboxes, dropdowns
`add_form_fields`	Add fields to existing PDFs

Permit Forms (Coordinate-Based)

For scanned PDFs or forms without interactive fields. Draws text at (x, y) coordinates.

Tool	What it does
`fill_permit_form`	Fill any PDF by drawing at coordinates (works with scanned forms)
`get_field_schema`	Get field definitions for validation or UI generation
`validate_permit_form_data`	Check data against field schema before filling
`preview_field_positions`	Generate PDF showing field boundaries (debugging)
`insert_attachment_pages`	Insert image/text pages with "See page X" references

Requires: pip install mcp-pdf[forms] (adds reportlab dependency)

Document Assembly

Tool	What it does
`merge_pdfs`	Combine multiple PDFs with bookmark preservation
`split_pdf_by_pages`	Split by page ranges
`split_pdf_by_bookmarks`	Split at chapter/section boundaries
`reorder_pdf_pages`	Rearrange pages in custom order

Annotations

Tool	What it does
`add_sticky_notes`	Add comment annotations
`add_highlights`	Highlight text regions
`add_stamps`	Add Approved/Draft/Confidential stamps
`extract_all_annotations`	Export annotations to JSON

How Fallbacks Work

The server tries multiple libraries for each operation:

Text extraction:

PyMuPDF (fastest)
pdfplumber (better for complex layouts)
pypdf (most compatible)

Table extraction:

Camelot (best accuracy, requires Ghostscript)
pdfplumber (no dependencies)
Tabula (requires Java)

If a PDF fails with one library, the next is tried automatically.

Token Management

Large PDFs can overflow MCP response limits. The server handles this:

Automatic chunking splits large documents into page groups
Table row limits prevent huge tables from blowing up responses
Summary mode returns structure without full content

# Get first 10 pages
result = await extract_text("huge.pdf", pages="1-10")

# Limit table rows
tables = await extract_tables("data.pdf", max_rows_per_table=50)

# Structure only
tables = await extract_tables("data.pdf", summary_only=True)

URL Processing

PDFs can be fetched directly from HTTPS URLs:

result = await extract_text("https://example.com/report.pdf")

Files are cached locally for subsequent operations.

System Dependencies

Some features require system packages:

Feature	Dependency
OCR	`tesseract-ocr`
Camelot tables	`ghostscript`
Tabula tables	`default-jre-headless`
PDF to images	`poppler-utils`

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless

Configuration

Optional environment variables:

Variable	Purpose
`MCP_PDF_ALLOWED_PATHS`	Colon-separated directories for file output
`PDF_TEMP_DIR`	Temp directory for processing (default: `/tmp/mcp-pdf-processing`)
`TESSDATA_PREFIX`	Tesseract language data location

Development

# Run tests
uv run pytest

# With coverage
uv run pytest --cov=mcp_pdf

# Format
uv run black src/ tests/

# Lint
uv run ruff check src/ tests/

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.2.1

May 5, 2026

2.2.0

May 5, 2026

2.1.7

Apr 25, 2026

2.1.6

Mar 8, 2026

2.1.5

Mar 8, 2026

This version

2.1.4

Mar 7, 2026

2.1.3

Mar 5, 2026

2.1.2

Mar 5, 2026

2.1.1

Mar 2, 2026

2.1.0

Mar 2, 2026

2.0.14

Feb 19, 2026

2.0.13

Feb 18, 2026

2.0.12

Feb 18, 2026

2.0.11

Feb 13, 2026

2.0.10

Feb 8, 2026

2.0.9

Feb 8, 2026

2.0.8

Feb 7, 2026

2.0.7

Nov 4, 2025

2.0.6

Nov 4, 2025

2.0.5

Nov 3, 2025

2.0.4

Nov 2, 2025

2.0.3

Nov 2, 2025

2.0.2

Sep 30, 2025

2.0.1

Sep 30, 2025

2.0.0

Sep 29, 2025

1.2.0

Sep 27, 2025

1.1.2

Sep 26, 2025

1.1.1

Sep 24, 2025

1.1.0

Sep 24, 2025

1.0.1

Sep 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_pdf-2.1.4.tar.gz (2.3 MB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcp_pdf-2.1.4-py3-none-any.whl (186.3 kB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file mcp_pdf-2.1.4.tar.gz.

File metadata

Download URL: mcp_pdf-2.1.4.tar.gz
Upload date: Mar 7, 2026
Size: 2.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"EndeavourOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mcp_pdf-2.1.4.tar.gz
Algorithm	Hash digest
SHA256	`abc7275eed2a9ed76343e5a1b9baae4a8d5a4cd32c430e3a7f02168d87118a5d`
MD5	`f994a042224641150b4d48818d0a3c74`
BLAKE2b-256	`45cb35e152bfc7e035fe31517fd7aa9a12983500e833d02bfcf1a8b339b8ee68`

See more details on using hashes here.

File details

Details for the file mcp_pdf-2.1.4-py3-none-any.whl.

File metadata

Download URL: mcp_pdf-2.1.4-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 186.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"EndeavourOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mcp_pdf-2.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3c0f2f02b872dfc7eabf928c4a64bb5dea5c7478f81f88490986dad565a3658`
MD5	`ece4b1257bf27be8997878c5f7a5133f`
BLAKE2b-256	`4d5be3c9dd526907bfab78cdfa53249d1ded165ae9a2142c21e1382ff82a9aaa`

See more details on using hashes here.

mcp-pdf 2.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📄 MCP PDF

What It Does

Quick Start

Tools

Content Extraction

Document Analysis

Forms

Permit Forms (Coordinate-Based)

Document Assembly

Annotations

How Fallbacks Work

Token Management

URL Processing

System Dependencies

Configuration

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes