A Model Context Protocol server for managing PDF documents with vector search capabilities
Project description
PDF Knowledgebase MCP Server
A Model Context Protocol (MCP) server that enables intelligent document search and retrieval from PDF collections. Built for seamless integration with Claude Desktop, Continue, Cline, and other MCP clients, this server provides semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.
Table of Contents
- ๐ Quick Start
- ๐๏ธ Architecture Overview
- ๐ฏ Parser Selection Guide
- โ๏ธ Configuration
- ๐ฅ๏ธ MCP Client Setup
- ๐ Performance & Troubleshooting
- ๐ง Advanced Configuration
- ๐ Appendix
๐ Quick Start
Step 1: Install the Server
uvx pdfkb-mcp
Step 2: Configure Your MCP Client
Claude Desktop (Most Common):
Configuration file locations:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json - Linux:
~/.config/Claude/claude_desktop_config.json
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs"
},
"transport": "stdio",
"autoRestart": true
}
}
}
VS Code (Native MCP) - Create .vscode/mcp.json in workspace:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
},
"transport": "stdio"
}
}
}
Step 3: Verify Installation
- Restart your MCP client completely
- Check for PDF KB tools: Look for
add_document,search_documents,list_documents,remove_document - Test functionality: Try adding a PDF and searching for content
๐๏ธ Architecture Overview
MCP Integration
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ MCP Client โ โ MCP Client โ โ MCP Client โ
โ (Claude Desktop)โ โ(VS Code/Continue)| โ (Other) โ
โโโโโโโโโโโฌโโโโโโโโ โโโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโโฌโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโ
โ Model Context โ
โ Protocol (MCP) โ
โ Standard Layer โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโดโโโโโโโโ โโโโโโโโโโโดโโโโโโโโโ โโโโโโโโโโโดโโโโโโโโ
โ PDF KB Server โ โ Other MCP โ โ Other MCP โ
โ (This Server) โ โ Server โ โ Server โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
Available Tools & Resources
Tools (Actions your client can perform):
add_document(path, metadata?)- Add PDF to knowledgebasesearch_documents(query, limit=5, metadata_filter?)- Semantic search across PDFslist_documents(metadata_filter?)- List all documents with metadataremove_document(document_id)- Remove document from knowledgebase </search_and_replace>
Resources (Data your client can access):
pdf://{document_id}- Full document content as JSONpdf://{document_id}/page/{page_number}- Specific page contentpdf://list- List of all documents with metadata
๐ฏ Parser Selection Guide
Decision Tree
Document Type & Priority?
โโโ ๐ Speed Priority โ PyMuPDF4LLM (fastest processing, low memory)
โโโ ๐ Academic Papers โ MinerU (fast with GPU, excellent formulas)
โโโ ๐ Business Reports โ Docling (medium speed, best tables)
โโโ โ๏ธ Balanced Quality โ Marker (medium speed, good structure)
โโโ ๐ฏ Maximum Accuracy โ LLM (slow, vision-based API calls)
```</search>
</search_and_replace>
### Performance Comparison
| Parser | Processing Speed | Memory | Text Quality | Table Quality | Best For |
|--------|------------------|--------|--------------|---------------|----------|
| **PyMuPDF4LLM** | **Fastest** | Low | Good | Basic | Speed priority |
| **MinerU** | Fast (with GPU) | High | Excellent | Excellent | Scientific papers |
| **Docling** | Medium | Medium | Excellent | **Excellent** | Business documents |
| **Marker** | Medium | Medium | Excellent | Good | **Balanced (default)** |
| **LLM** | Slow | Low | Excellent | Excellent | Maximum accuracy |</search>
</search_and_replace>
*Benchmarks from research studies and technical reports*
## โ๏ธ Configuration
### Tier 1: Basic Configurations (80% of users)
**Default (Recommended)**:
```json
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDF_PARSER": "marker"
},
"transport": "stdio"
}
}
}
Speed Optimized:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDF_PARSER": "pymupdf4llm",
"CHUNK_SIZE": "800"
},
"transport": "stdio"
}
}
}
Memory Efficient:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDF_PARSER": "pymupdf4llm",
"EMBEDDING_BATCH_SIZE": "50"
},
"transport": "stdio"
}
}
}
Tier 2: Use Case Specific (15% of users)
Academic Papers:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDF_PARSER": "mineru",
"CHUNK_SIZE": "1200"
},
"transport": "stdio"
}
}
}
Business Documents:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDF_PARSER": "docling",
"DOCLING_TABLE_MODE": "ACCURATE",
"DOCLING_DO_TABLE_STRUCTURE": "true"
},
"transport": "stdio"
}
}
}
Multi-language Documents:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDF_PARSER": "docling",
"DOCLING_OCR_LANGUAGES": "en,fr,de,es",
"DOCLING_DO_OCR": "true"
},
"transport": "stdio"
}
}
}
Maximum Quality:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
"PDF_PARSER": "llm",
"LLM_MODEL": "anthropic/claude-3.5-sonnet",
"EMBEDDING_MODEL": "text-embedding-3-large"
},
"transport": "stdio"
}
}
}
Essential Environment Variables
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
required | OpenAI API key for embeddings |
KNOWLEDGEBASE_PATH |
./pdfs |
Directory containing PDF files |
CACHE_DIR |
./.cache |
Cache directory for processing |
PDF_PARSER |
marker |
Parser: marker, pymupdf4llm, mineru, docling, llm |
CHUNK_SIZE |
1000 |
Target chunk size for LangChain chunker |
EMBEDDING_MODEL |
text-embedding-3-large |
OpenAI embedding model |
๐ฅ๏ธ MCP Client Setup
Claude Desktop
Configuration File Location:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json - Linux:
~/.config/Claude/claude_desktop_config.json
Configuration:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs",
"CACHE_DIR": "/Users/yourname/Documents/PDFs/.cache"
},
"transport": "stdio",
"autoRestart": true
}
}
}
Verification:
- Restart Claude Desktop completely
- Look for PDF KB tools in the interface
- Test with "Add a document" or "Search documents"
VS Code with Native MCP Support
Configuration (.vscode/mcp.json in workspace):
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
},
"transport": "stdio"
}
}
}
Verification:
- Reload VS Code window
- Check VS Code's MCP server status in Command Palette
- Use MCP tools in Copilot Chat
VS Code with Continue Extension
Configuration (.continue/config.json):
{
"models": [...],
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
},
"transport": "stdio"
}
}
}
Verification:
- Reload VS Code window
- Check Continue panel for server connection
- Use
@pdfkbin Continue chat </search_and_replace>
Generic MCP Client
Standard Configuration Template:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "required",
"KNOWLEDGEBASE_PATH": "required-absolute-path",
"PDF_PARSER": "optional-default-marker"
},
"transport": "stdio",
"autoRestart": true,
"timeout": 30000
}
}
}
๐ Performance & Troubleshooting
Common Issues
Server not appearing in MCP client:
// โ Wrong: Missing transport
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"]
}
}
}
// โ
Correct: Include transport and restart client
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"transport": "stdio"
}
}
}
Processing too slow:
// Switch to faster parser
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-key",
"PDF_PARSER": "pymupdf4llm"
},
"transport": "stdio"
}
}
}
Memory issues:
// Reduce memory usage
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-key",
"EMBEDDING_BATCH_SIZE": "25",
"CHUNK_SIZE": "500"
},
"transport": "stdio"
}
}
}
Poor table extraction:
// Use table-optimized parser
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-key",
"PDF_PARSER": "docling",
"DOCLING_TABLE_MODE": "ACCURATE"
},
"transport": "stdio"
}
}
}
Resource Requirements
| Configuration | RAM Usage | Processing Speed | Best For |
|---|---|---|---|
| Speed | 2-4 GB | Fastest | Large collections |
| Balanced | 4-6 GB | Medium | Most users |
| Quality | 6-12 GB | Medium-Fast | Accuracy priority |
| GPU | 8-16 GB | Very Fast | High-volume processing |
๐ง Advanced Configuration
Parser-Specific Options
MinerU Configuration:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-key",
"PDF_PARSER": "mineru",
"MINERU_LANG": "en",
"MINERU_METHOD": "auto",
"MINERU_VRAM": "16"
},
"transport": "stdio"
}
}
}
LLM Parser Configuration:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-key",
"OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
"PDF_PARSER": "llm",
"LLM_MODEL": "google/gemini-2.5-flash-lite",
"LLM_CONCURRENCY": "5",
"LLM_DPI": "150"
},
"transport": "stdio"
}
}
}
Performance Tuning
High-Performance Setup:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"OPENAI_API_KEY": "sk-key",
"PDF_PARSER": "mineru",
"KNOWLEDGEBASE_PATH": "/Volumes/FastSSD/Documents/PDFs",
"CACHE_DIR": "/Volumes/FastSSD/Documents/PDFs/.cache",
"EMBEDDING_BATCH_SIZE": "200",
"VECTOR_SEARCH_K": "15",
"FILE_SCAN_INTERVAL": "30"
},
"transport": "stdio"
}
}
}
Intelligent Caching
The server uses multi-stage caching:
- Parsing Cache: Stores converted markdown (
src/pdfkb/intelligent_cache.py:139) - Chunking Cache: Stores processed chunks
- Vector Cache: ChromaDB embeddings storage
Cache Invalidation Rules:
- Changing
PDF_PARSERโ Full reset (parsing + chunking + embeddings) - Changing
PDF_CHUNKERโ Partial reset (chunking + embeddings) - Changing
EMBEDDING_MODELโ Minimal reset (embeddings only)
๐ Appendix
Installation Options
Primary (Recommended):
uvx pdfkb-mcp
With Specific Parser Dependencies:
uvx pdfkb-mcp[marker] # Marker parser
uvx pdfkb-mcp[mineru] # MinerU parser
uvx pdfkb-mcp[docling] # Docling parser
uvx pdfkb-mcp[llm] # LLM parser
uvx pdfkb-mcp[langchain] # LangChain chunker
Development Installation:
git clone https://github.com/juanqui/pdfkb-mcp.git
cd pdfkb-mcp
pip install -e ".[dev]"
Complete Environment Variables Reference
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
required | OpenAI API key for embeddings |
OPENROUTER_API_KEY |
optional | Required for LLM parser |
KNOWLEDGEBASE_PATH |
./pdfs |
PDF directory path |
CACHE_DIR |
./.cache |
Cache directory |
PDF_PARSER |
marker |
PDF parser selection |
PDF_CHUNKER |
unstructured |
Chunking strategy |
CHUNK_SIZE |
1000 |
LangChain chunk size |
CHUNK_OVERLAP |
200 |
LangChain chunk overlap |
EMBEDDING_MODEL |
text-embedding-3-large |
OpenAI model |
EMBEDDING_BATCH_SIZE |
100 |
Embedding batch size |
VECTOR_SEARCH_K |
5 |
Default search results |
FILE_SCAN_INTERVAL |
60 |
File monitoring interval |
LOG_LEVEL |
INFO |
Logging level |
Parser Comparison Details
| Feature | PyMuPDF4LLM | Marker | MinerU | Docling | LLM |
|---|---|---|---|---|---|
| Speed | Fastest | Medium | Fast (GPU) | Medium | Slowest |
| Memory | Lowest | Medium | High | Medium | Lowest |
| Tables | Basic | Good | Excellent | Excellent | Excellent |
| Formulas | Basic | Good | Excellent | Good | Excellent |
| Images | Basic | Good | Good | Excellent | Excellent |
| Setup | Simple | Simple | Moderate | Simple | Simple |
| Cost | Free | Free | Free | Free | API costs |
Chunking Strategies
LangChain (PDF_CHUNKER=langchain):
- Header-aware splitting with
MarkdownHeaderTextSplitter - Configurable via
CHUNK_SIZEandCHUNK_OVERLAP - Best for customizable chunking
Unstructured (PDF_CHUNKER=unstructured):
- Intelligent semantic chunking with
unstructuredlibrary - Zero configuration required
- Best for document structure awareness
Troubleshooting Guide
API Key Issues:
- Verify key format starts with
sk- - Check account has sufficient credits
- Test connectivity:
curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models
Parser Installation Issues:
- MinerU:
pip install mineru[all]and verifymineru --version - Docling:
pip install doclingfor basic,pip install pdfkb-mcp[docling-complete]for all features - LLM: Requires
OPENROUTER_API_KEYenvironment variable
Performance Optimization:
- Speed: Use
pymupdf4llmparser - Memory: Reduce
EMBEDDING_BATCH_SIZEandCHUNK_SIZE - Quality: Use
mineru(GPU) ordocling(CPU) - Tables: Use
doclingwithDOCLING_TABLE_MODE=ACCURATE
For additional support, see implementation details in src/pdfkb/main.py and src/pdfkb/config.py.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfkb_mcp-0.1.1.tar.gz.
File metadata
- Download URL: pdfkb_mcp-0.1.1.tar.gz
- Upload date:
- Size: 98.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6984bf864c771161b31e8c9ba54c4d83c9b811fd6d68c4b2711d4ae9910f7704
|
|
| MD5 |
f884c814c368bf212f780bcf83352f52
|
|
| BLAKE2b-256 |
ce093fad5e3fe42418023ab8c4687a847064855b42ce1d6e6102360fa1ea3751
|
File details
Details for the file pdfkb_mcp-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pdfkb_mcp-0.1.1-py3-none-any.whl
- Upload date:
- Size: 74.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38de2924ec316ab6a421a470fc4eedbaecf9a6ffff65996b4c880836b3601530
|
|
| MD5 |
c880a6bee4ad8caab58e44e77655584c
|
|
| BLAKE2b-256 |
cfda57c98123be1091e4bf3a248620fb170b16e7cfe27ecc1cd80bef03111727
|