Skip to main content

Block-based PDF extraction MCP server optimized for LLM consumption

Project description

pdf4vllm

License: MIT Python 3.10+ PyPI Open in Gitpod

PDF reading MCP server optimized for vision LLMs.

한국어

문제

방식 문제점
텍스트 추출 인코딩 깨짐 → 쓰레기 출력, 이미지-텍스트 순서 뒤섞임
이미지 변환 토큰 폭발 (특히 페이지 많을 때)

해결

pdf4vllm은 PDF가 지저분하다고 가정합니다.

  • 텍스트 손상 자동 감지 → 이미지로 자동 전환
  • 읽기 순서 보존 (텍스트 → 표 → 이미지 블록 순서대로)
  • 페이지 제한으로 컨텍스트 오버플로우 방지
  • 불필요한 이미지 자동 필터링 (로고, 선, 헤더/푸터)

설치

pip install pdf4vllm-mcp
# 또는
uvx pdf4vllm-mcp

Claude Desktop 설정

git clone https://github.com/PyJudge/pdf4vllm-mcp.git
cd pdf4vllm-mcp
python scripts/install_mcp.py

또는 직접 설정 (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "pdf4vllm": {
      "command": "/python/경로",
      "args": ["/pdf4vllm-mcp/경로/src/server.py"]
    }
  }
}

도구

도구 설명
list_pdfs PDF 파일 찾기 (glob 패턴 name_pattern 지원)
read_pdf PDF 내용 블록으로 추출
grep_pdf PDF 내 텍스트 검색 (pdfgrep 설치 필요)

추출 모드

모드 설명
auto (기본) 텍스트 추출 시도 → 손상 감지 시 이미지로 전환
text_only 텍스트/표만 추출, 이미지 없음
image_only 페이지를 이미지로만 렌더링

Problem

Approach Issue
Text extraction Encoding corruption → garbage output, mixed text-image ordering
Image conversion Token explosion (especially with many pages)

Solution

pdf4vllm assumes PDFs are messy.

  • Auto-detects text corruption → switches to image automatically
  • Preserves reading order (text → table → image blocks in sequence)
  • Page limits prevent context overflow
  • Filters unnecessary images (logos, lines, headers/footers)
PDF Input
    ↓
Corruption Detection (pdfminer.six + pattern analysis)
    ↓
┌─────────────┬─────────────┐
│  Corrupted  │    Clean    │
│  → Image    │  → Text +   │
│    only     │    Tables + │
│             │    Images   │
└─────────────┴─────────────┘
    ↓
Ordered Blocks (JSON)

Install

pip install pdf4vllm-mcp
# or run without installing
uvx pdf4vllm-mcp

Claude Desktop Setup

git clone https://github.com/PyJudge/pdf4vllm-mcp.git
cd pdf4vllm-mcp
python scripts/install_mcp.py

Or manually edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "pdf4vllm": {
      "command": "/path/to/python",
      "args": ["/path/to/pdf4vllm-mcp/src/server.py"]
    }
  }
}

Claude Code Setup

Create .mcp.json in your project:

{
  "mcpServers": {
    "pdf4vllm": {
      "command": "uvx",
      "args": ["pdf4vllm-mcp"]
    }
  }
}

Tools

Tool Description
list_pdfs Find PDF files with glob filtering (name_pattern)
read_pdf Extract PDF content as ordered blocks
grep_pdf Search text in PDFs using pdfgrep (requires pdfgrep installed)

Extraction Modes

Mode Description
auto (default) Try text extraction → switch to image if corrupted
text_only Text/tables only, no images
image_only Render pages as images only

Output Format

{
  "pages": [
    {
      "page_number": 1,
      "content_blocks": [
        {"type": "text", "content": "..."},
        {"type": "table", "content": "| A | B |"},
        {"type": "image", "content": "[IMAGE_0]"}
      ]
    }
  ]
}

When text is corrupted:

{
  "page_number": 2,
  "content_blocks": [],
  "text_corrupted": true,
  "page_image": "[IMAGE_1]"
}

Configuration

config.json or environment variables:

{
  "max_pages_per_request": 10,
  "max_image_dimension": 842,
  "page_image_dpi": 100
}
export PDF_MAX_PAGES=20
export PDF_PAGE_IMAGE_DPI=150

Test Server

pip install pdf4vllm-mcp[test]
python test_server.py
# → http://localhost:8000

License

MIT


GitHub · PyPI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf4vllm_mcp-1.1.2.tar.gz (28.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf4vllm_mcp-1.1.2-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file pdf4vllm_mcp-1.1.2.tar.gz.

File metadata

  • Download URL: pdf4vllm_mcp-1.1.2.tar.gz
  • Upload date:
  • Size: 28.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for pdf4vllm_mcp-1.1.2.tar.gz
Algorithm Hash digest
SHA256 8258d749ff8c286745be2578dc97878ad1c53973f3618c861e21077a0dab3a91
MD5 be10f2107a05a00d6fcb77950c9e34cc
BLAKE2b-256 425d6b5778bd6a431e9628ca4e0c5f57d0f9a88bbd3344d0b0e0fe228c0300bf

See more details on using hashes here.

File details

Details for the file pdf4vllm_mcp-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: pdf4vllm_mcp-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 31.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for pdf4vllm_mcp-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1ce6248bfbcb12bfab7d6bb7b9fcd77d5c939f16007c56e96427c0713af49ebd
MD5 262513d5187c1ac83864c359b9ede73d
BLAKE2b-256 511ce0a520aee40b0eb05dc6d34dbba9ccea895a67c3cd584c66f3667c221ad8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page