Production-ready MCP server for PDF processing with intelligent caching. Extract text, search, and analyze PDFs with AI agents.
Project description
pdf-mcp 📄
Production-ready MCP server for PDF processing with intelligent caching.
A Python implementation of the Model Context Protocol (MCP) server that enables AI agents like Claude to read, search, and extract content from PDF files efficiently.
✨ Features
- 🚀 8 Specialized Tools - Purpose-built tools for different PDF operations
- 💾 SQLite Caching - Persistent cache survives server restarts (essential for STDIO transport)
- 📄 Smart Pagination - Read large PDFs in manageable chunks
- 🔍 Full-Text Search - Find content without loading entire document
- 🖼️ Image Extraction - Extract images as base64 PNG
- 🌐 URL Support - Read PDFs from HTTP/HTTPS URLs
- ⚡ Fast Subsequent Access - Cached pages load instantly
📦 Installation
pip install pdf-mcp
🚀 Quick Start
Claude Code
claude mcp add pdf-mcp -- pdf-mcp
Or add to ~/.claude.json:
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
Location of config file:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
After updating the config, restart Claude Desktop to load the MCP server.
🛠️ Tools
1. pdf_info - Get Document Information
Always call this first to understand the document before reading.
"Read the PDF at /path/to/document.pdf"
Returns: page count, metadata, table of contents, file size, estimated tokens.
2. pdf_read_pages - Read Specific Pages
Read pages in chunks to manage context size.
"Read pages 1-10 of the PDF"
"Read pages 15, 20, and 25-30"
3. pdf_read_all - Read Entire Document
For small documents only (has safety limit).
"Read the entire PDF (it's only 10 pages)"
4. pdf_search - Search Within PDF
Find relevant pages before loading content.
"Search for 'quarterly revenue' in the PDF"
5. pdf_get_toc - Get Table of Contents
"Show me the table of contents"
6. pdf_extract_images - Extract Images
"Extract images from pages 1-5"
7. pdf_cache_stats - View Cache Statistics
"Show PDF cache statistics"
8. pdf_cache_clear - Clear Cache
"Clear expired PDF cache entries"
📋 Example Workflow
For a large document (e.g., 200-page annual report):
User: "Summarize the risk factors in this annual report"
Claude's workflow:
1. pdf_info("report.pdf")
→ Learns: 200 pages, TOC shows "Risk Factors" on page 89
2. pdf_search("report.pdf", "risk factors")
→ Finds relevant pages: 89-110
3. pdf_read_pages("report.pdf", "89-100")
→ Reads first batch
4. pdf_read_pages("report.pdf", "101-110")
→ Reads second batch
5. Synthesizes answer from chunks
💾 Caching
The server uses SQLite for persistent caching because MCP with STDIO transport spawns a new process for each conversation.
Cache Location
~/.cache/pdf-mcp/cache.db
What's Cached
| Data | Benefit |
|---|---|
| Metadata | Instant document info |
| Page text | Skip re-extraction |
| Images | Skip re-encoding |
| TOC | Fast navigation |
Cache Invalidation
- Automatic when file modification time changes
- Manual via
pdf_cache_cleartool - TTL: 24 hours (configurable)
⚙️ Configuration
Environment variables:
# Cache directory (default: ~/.cache/pdf-mcp)
PDF_MCP_CACHE_DIR=/path/to/cache
# Cache TTL in hours (default: 24)
PDF_MCP_CACHE_TTL=48
🔧 Development
# Clone
git clone https://github.com/jztan/pdf-mcp.git
cd pdf-mcp
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Type checking
mypy src/
# Linting
ruff check src/
📊 Comparison
| Feature | Traditional Approach | pdf-mcp |
|---|---|---|
| Large PDFs | Context overflow | Chunked reading |
| Repeated access | Re-parse every time | SQLite cache |
| Find content | Load everything | Search first |
| Multiple tools | One monolithic tool | 8 specialized tools |
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📄 License
MIT License - see LICENSE file.
🔗 Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_mcp-1.1.0.tar.gz.
File metadata
- Download URL: pdf_mcp-1.1.0.tar.gz
- Upload date:
- Size: 26.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8846d0d9973c517aabeefcb2ec67cf5d41a2d1caac0fa76596c972085d3c13df
|
|
| MD5 |
b26228b1158530c6b9923dcdd38649a6
|
|
| BLAKE2b-256 |
1943252c26f57697695d755697015178dee99ddd2d739de634e4a1e376f06909
|
File details
Details for the file pdf_mcp-1.1.0-py3-none-any.whl.
File metadata
- Download URL: pdf_mcp-1.1.0-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a906534dd39f05a84844342e4059fa2195d7b091810a41d98789fb8fff2a7273
|
|
| MD5 |
18e05a56be004a194c689ae20d348a58
|
|
| BLAKE2b-256 |
0519e225446db4b7df97510dc5fe37dfdab32861b23a74b47df0ccf5adb9ebc2
|