Block-based PDF extraction MCP server optimized for LLM consumption
Project description
pdf4vllm
PDF reading MCP server optimized for vision LLMs.
한국어
문제
| 방식 | 문제점 |
|---|---|
| 텍스트 추출 | 인코딩 깨짐 → 쓰레기 출력, 이미지-텍스트 순서 뒤섞임 |
| 이미지 변환 | 토큰 폭발 (특히 페이지 많을 때) |
해결
pdf4vllm은 PDF가 지저분하다고 가정합니다.
- 텍스트 손상 자동 감지 → 이미지로 자동 전환
- 읽기 순서 보존 (텍스트 → 표 → 이미지 블록 순서대로)
- 페이지 제한으로 컨텍스트 오버플로우 방지
- 불필요한 이미지 자동 필터링 (로고, 선, 헤더/푸터)
설치
pip install pdf4vllm-mcp
# 또는
uvx pdf4vllm-mcp
Claude Desktop 설정
git clone https://github.com/PyJudge/pdf4vllm-mcp.git
cd pdf4vllm-mcp
python scripts/install_mcp.py
또는 직접 설정 (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"pdf4vllm": {
"command": "/python/경로",
"args": ["/pdf4vllm-mcp/경로/src/server.py"]
}
}
}
도구
| 도구 | 설명 |
|---|---|
list_pdfs |
PDF 파일 찾기 (glob 패턴 name_pattern 지원) |
read_pdf |
PDF 내용 블록으로 추출 |
grep_pdf |
PDF 내 텍스트 검색 (pdfgrep 설치 필요) |
추출 모드
| 모드 | 설명 |
|---|---|
auto (기본) |
텍스트 추출 시도 → 손상 감지 시 이미지로 전환 |
text_only |
텍스트/표만 추출, 이미지 없음 |
image_only |
페이지를 이미지로만 렌더링 |
Problem
| Approach | Issue |
|---|---|
| Text extraction | Encoding corruption → garbage output, mixed text-image ordering |
| Image conversion | Token explosion (especially with many pages) |
Solution
pdf4vllm assumes PDFs are messy.
- Auto-detects text corruption → switches to image automatically
- Preserves reading order (text → table → image blocks in sequence)
- Page limits prevent context overflow
- Filters unnecessary images (logos, lines, headers/footers)
PDF Input
↓
Corruption Detection (pdfminer.six + pattern analysis)
↓
┌─────────────┬─────────────┐
│ Corrupted │ Clean │
│ → Image │ → Text + │
│ only │ Tables + │
│ │ Images │
└─────────────┴─────────────┘
↓
Ordered Blocks (JSON)
Install
pip install pdf4vllm-mcp
# or run without installing
uvx pdf4vllm-mcp
Claude Desktop Setup
git clone https://github.com/PyJudge/pdf4vllm-mcp.git
cd pdf4vllm-mcp
python scripts/install_mcp.py
Or manually edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"pdf4vllm": {
"command": "/path/to/python",
"args": ["/path/to/pdf4vllm-mcp/src/server.py"]
}
}
}
Claude Code Setup
Create .mcp.json in your project:
{
"mcpServers": {
"pdf4vllm": {
"command": "uvx",
"args": ["pdf4vllm-mcp"]
}
}
}
Tools
| Tool | Description |
|---|---|
list_pdfs |
Find PDF files with glob filtering (name_pattern) |
read_pdf |
Extract PDF content as ordered blocks |
grep_pdf |
Search text in PDFs using pdfgrep (requires pdfgrep installed) |
Extraction Modes
| Mode | Description |
|---|---|
auto (default) |
Try text extraction → switch to image if corrupted |
text_only |
Text/tables only, no images |
image_only |
Render pages as images only |
Output Format
{
"pages": [
{
"page_number": 1,
"content_blocks": [
{"type": "text", "content": "..."},
{"type": "table", "content": "| A | B |"},
{"type": "image", "content": "[IMAGE_0]"}
]
}
]
}
When text is corrupted:
{
"page_number": 2,
"content_blocks": [],
"text_corrupted": true,
"page_image": "[IMAGE_1]"
}
Configuration
config.json or environment variables:
{
"max_pages_per_request": 10,
"max_image_dimension": 842,
"page_image_dpi": 100
}
export PDF_MAX_PAGES=20
export PDF_PAGE_IMAGE_DPI=150
Test Server
pip install pdf4vllm-mcp[test]
python test_server.py
# → http://localhost:8000
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf4vllm_mcp-1.1.1.tar.gz.
File metadata
- Download URL: pdf4vllm_mcp-1.1.1.tar.gz
- Upload date:
- Size: 28.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1347e1c8f24b6981281565aece842d4cda96461563b78b2e322c5bfe1196b71b
|
|
| MD5 |
c7561fc3ea118af29dda5530c27aae7a
|
|
| BLAKE2b-256 |
325cefc07462070b5806c1a468c6e33547206d644574651006d865b953b19109
|
File details
Details for the file pdf4vllm_mcp-1.1.1-py3-none-any.whl.
File metadata
- Download URL: pdf4vllm_mcp-1.1.1-py3-none-any.whl
- Upload date:
- Size: 31.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9370a15ec122b47bce5776a1453d13cfe9b9143ec843c780fdf49dcc0f6f826d
|
|
| MD5 |
d4177f8d4952b380f57b5f2d84dff7ce
|
|
| BLAKE2b-256 |
d7b6b388bb54bab04862923ae070b7056ffb3138ed29b4dde514c0cb0e79f403
|