Skip to main content

Block-based PDF extraction MCP server optimized for LLM consumption

Project description

PDF MCP for vLLM

vLLM reads PDF files automatically.

License: MIT Python 3.12+ MCP

English | 한국어


The Problem

Feeding PDFs to vLLM is really not easy.

Read as text?
Corrupted text encoding    → garbage
Documents with mixed text and images → Text and images don't match up

Read as image?
Massive token usage     → Especially with many pages? Context explosion

The Solution

Other tools assume PDFs are clean. PDF MCP for vLLM assumes PDFs are messy. PDF MCP for vLLM and vLLM handle it automatically.

┌─────────────────────────────────────────────────┐
│  PDF Input                                       │
└──────────────────┬──────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────┐
│  Corruption Detection                            │
│  • pdfminer.six warnings                         │
│  • Character pattern analysis                    │
│  • Automatic fallback decision                   │
└──────────────────┬──────────────────────────────┘
                   │
         ┌─────────┴─────────┐
         │                   │
    Corrupted?           Clean?
    Image only?
         │                   │
         ▼                   ▼
  ┌──────────┐        ┌──────────┐
  │ Vision   │        │ Text     │
  │ Mode     │        │ Mode     │
  │          │        │          │
  │ Page     │        │ Text     │
  │ Image    │        │ Tables   │
  │ Only     │        │ Images   │
  └─────┬────┘        └────┬─────┘
        │                  │
        └────────┬─────────┘
                 │
                 ▼
        ┌────────────────┐
        │ Ordered Blocks │
        │ • Text         │
        │ • Tables (MD)  │
        │ • Images       │
        │ • Page Images  │
        └────────────────┘
                 │
                 ▼
        ┌────────────────┐
        │ JSON Output    │
        │ Clean          │
        │ Structured     │
        │ LLM-Ready      │
        └────────────────┘

Structured Blocks = Better Understanding

PDF MCP for vLLM preserves reading order with typed blocks:

Page 1: Text → Table → Text → Image → Text
         ↓
[
  {type: "text", content: "Introduction..."},
  {type: "table", content: "| Item | Amount |"},
  {type: "text", content: "Analysis..."},
  {type: "image", content: "base64..."},
  {type: "text", content: "Conclusion..."}
]

vLLM reads naturally, not fighting scrambled content.


vLLM and PDF MCP for vLLM handle it automatically.

100-page PDF requested
↓
PDF MCP for vLLM: "Too large! Try pages 1-10, 11-20, ..."
↓
LLM makes multiple smart requests
↓
All content processed without context overflow
+ Resolution adjustment included (100dpi default)
User: "What if I change the background color of scanned_contract.pdf to red?"

vLLM: Calls with extraction_mode="image_only"

PDF MCP for vLLM:
  - Skips useless text extraction attempt
  - Renders each page as image
  - Sends directly to vision

Result: Fast, accurate, no waste

Smart Corruption Handling

Automatically detects PDFs that can't be read as text
↓
Automatically sends as image
vLLM reads perfectly with vision

Before PDF MCP for vLLM:

{
  "text": "�㍻��㍺�������..."  // 5000 tokens of garbage
}

With PDF MCP for vLLM (Auto Mode):

{
  "content_blocks": [],           // Garbage blocked
  "page_image": "base64...",      // Clean image for vision
  "text_corrupted": true          // LLM knows why
}

Intelligent Image Processing

Extract all images from PDF
↓
Filter: Remove decorative junk (< 28px)
Filter: Remove extreme aspect ratios (> 15:1 ratio lines)
Filter: Remove headers/footers
↓
Crop: Scale down to A4 height (842px default)
DPI limit (100dpi default)
↓
Result: Only meaningful images, LLM-optimized sizes

Before: 50 images including logos, lines, borders After: 5 meaningful content images


See It In Action

python test_server.py

Visual test interface shows:

  • Corrupted text detection in real-time
  • How blocks are ordered
  • Inline image rendering
  • Markdown table rendering
  • Mode switching effects

Test with your own PDFs before deploying to LLM.


Quick Start

Installation

Method 1: PyPI (Easiest)

pip install pdf4vllm-mcp
# With test server: pip install pdf4vllm-mcp[test]

Method 2: Git Clone

git clone https://github.com/PyJudge/pdf4vllm-mcp.git
cd pdf4vllm-mcp
pip install -e .

Test Locally

python test_server.py
# → http://localhost:8000

See it working:

  • Upload a corrupted PDF → Watch it auto-detect and switch to image
  • Upload a clean PDF → See structured text blocks
  • Try all 3 modes → Visual rendering shows the difference

MCP Integration

Easy Install (Recommended)

git clone https://github.com/PyJudge/pdf4vllm-mcp.git
cd pdf4vllm-mcp
pip install -e .

# Cross-platform (Windows/macOS/Linux)
python scripts/install_mcp.py

# macOS/Linux only
./scripts/install_mcp.sh

The script automatically:

  • Detects your OS (Windows/macOS/Linux)
  • Finds Python path
  • Creates Claude Desktop config with correct settings
  • Backs up existing config

Then restart Claude Desktop.

Claude Code (CLI)

Create .mcp.json in your project directory:

{
  "mcpServers": {
    "pdf4vllm": {
      "command": "python",
      "args": ["-m", "src.server"]
    }
  }
}

Manual Install (Claude Desktop)

Configuration: ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "pdf4vllm": {
      "command": "/path/to/your/python",
      "args": ["/full/path/to/pdf4vllm-mcp/src/server.py"]
    }
  }
}

Complete Example (Conda)

{
  "mcpServers": {
    "pdf4vllm": {
      "command": "/opt/anaconda3/envs/pdfmcp/bin/python",
      "args": ["/Users/username/pdf4vllm-mcp/src/server.py"]
    }
  }
}

Complete Example (Homebrew)

{
  "mcpServers": {
    "pdf4vllm": {
      "command": "/usr/local/bin/python3",
      "args": ["/Users/username/pdf4vllm-mcp/src/server.py"]
    }
  }
}

Important:

  • Use absolute paths for both command and args

Restart Claude Desktop

Completely quit and restart Claude Desktop to load the MCP server.

Troubleshooting

"server disconnected"

  • Wrong Python path or server.py path. Verify both exist

Check logs: ~/Library/Logs/Claude/mcp-server-pdf4vllm.log


Real-World Examples

Example 1: Corrupted Legal Document

User: "Read court_document.pdf"

PDF MCP for vLLM Auto Mode:
  1. Detects: 87% text corruption
  2. Blocks: Garbage text from reaching LLM
  3. Provides: Clean page image
  4. Result: LLM reads with vision, perfect understanding

Tokens saved: ~15,000 (blocked corrupted text)
Accuracy: 100% (vision) vs 0% (garbage text)

Example 2: 200-Page Report

User: "Read annual_report.pdf"

PDF MCP for vLLM:
  "PAGE_LIMIT_EXCEEDED: Requested 200 pages exceeds limit (10).

   Suggested ranges:
   - Pages 1-10 (10 pages, ~5 images)
   - Pages 11-20 (10 pages, ~8 images)
   - Pages 21-30 (10 pages, ~12 images)
   - Pages 31-40 (10 pages, ~6 images)
   - Pages 41-50 (10 pages, ~15 images)"

User: "Read pages 1-10"
PDF MCP for vLLM: Extracts first section

User: "Read pages 11-20"
PDF MCP for vLLM: Extracts next section

Result: No context explosion, systematic reading

Configuration

Create config.json:

{
  "max_pages_per_request": 20,        // Your context size
  "max_images_per_request": 100,      // Your needs
  "max_image_dimension": 1024,        // Higher quality
  "min_image_dimension": 50,          // More aggressive filtering
  "max_aspect_ratio": 10,             // Stricter line filtering
  "page_image_dpi": 150               // Higher DPI for vision
}

Or use environment variables:

export PDF_MAX_PAGES=20
export PDF_PAGE_IMAGE_DPI=150

3 Extraction Modes

Choose based on your PDF:

Auto (Default) - PDF MCP for vLLM Decides

extraction_mode: "auto"  # Smart detection

What it does:

  1. Tries text extraction first
  2. Detects corruption automatically
  3. If corrupted → Blocks garbage text + Adds page image
  4. If clean → Returns text normally

Use when: You don't know PDF quality (most cases)


Text Only - Fast & Lightweight

extraction_mode: "text_only"

What it does:

  • Extract text + tables only
  • Never add page images
  • Minimal tokens

Use when: You KNOW the PDF is clean


Image Only - Vision First

extraction_mode: "image_only"

What it does:

  • Skip text extraction entirely
  • Render pages as images only
  • Direct to LLM vision

Use when: Scanned PDFs, known corrupted text



API

read_pdf

One parameter that matters:

{
  "file_path": "document.pdf",
  "extraction_mode": "auto"  // That's it. Everything else has smart defaults.
}

Advanced options (when you need them):

{
  "file_path": "document.pdf",
  "start_page": 1,               // Default: 1
  "end_page": 10,                // Default: last page
  "extraction_mode": "auto",     // Default: "auto"
  "filter_header_footer": true,  // Default: true
  "crop_images": true,           // Default: true
  "max_image_dimension": 842,    // Default: 842 (A4 height)
  "page_image_dpi": 100          // Default: 100
}

License

MIT

Dependencies:

  • pypdfium2: Apache 2.0 / BSD
  • pikepdf: MPL 2.0
  • pdfplumber: MIT
  • Pillow: HPND
  • pydantic: MIT

PDF MCP for vLLM v1.0

Don't fight PDFs. Let PDF MCP for vLLM handle it.

GitHub · Issues · Discussions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf4vllm_mcp-1.0.0.tar.gz (33.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf4vllm_mcp-1.0.0-py3-none-any.whl (30.6 kB view details)

Uploaded Python 3

File details

Details for the file pdf4vllm_mcp-1.0.0.tar.gz.

File metadata

  • Download URL: pdf4vllm_mcp-1.0.0.tar.gz
  • Upload date:
  • Size: 33.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pdf4vllm_mcp-1.0.0.tar.gz
Algorithm Hash digest
SHA256 12235e4b419c1828732af25b4641315e9e14b5083536e3a79708e7089c3f8471
MD5 ddb2e55f5ed8c07ea4485f94a94df7fa
BLAKE2b-256 c7ce6bff1c267f68a207ece135daae9c28907949d8fa02f33c878f5ae27d579b

See more details on using hashes here.

File details

Details for the file pdf4vllm_mcp-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pdf4vllm_mcp-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 30.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pdf4vllm_mcp-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 591ce1609937e55dd33af48df1953a8a563da83cd84b010c41fd1245f06e3813
MD5 702bba0161800d0cab9cee7d0d8ad318
BLAKE2b-256 6da149cdb529ca9daa0ebaac4c5db65f772d753c5336702aa1cca9dac479e370

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page