A Model Context Protocol server for managing PDF documents with vector search capabilities
Project description
PDF Knowledgebase MCP Server
A Model Context Protocol (MCP) server that enables intelligent document search and retrieval from PDF collections. Built for seamless integration with Claude Desktop, Continue, Cline, and other MCP clients, this server provides advanced search capabilities powered by local or OpenAI embeddings and ChromaDB vector storage.
🆕 NEW Features:
- Local Embeddings: Run embeddings locally with HuggingFace models - no API costs, full privacy
- Hybrid Search: Combines semantic similarity with keyword matching (BM25) for superior search quality
- Web Interface: Modern web UI for document management and search alongside the traditional MCP protocol
Table of Contents
- 🚀 Quick Start
- 🌐 Web Interface
- 🏗️ Architecture Overview
- 🤖 Local Embeddings
- 🔍 Hybrid Search
- 🎯 Parser Selection Guide
- ⚙️ Configuration
- 🖥️ MCP Client Setup
- 📊 Performance & Troubleshooting
- 🔧 Advanced Configuration
- 📚 Appendix
🚀 Quick Start
Step 1: Configure Your MCP Client
Option A: Local Embeddings w/ Hybrid Search (No API Key Required)
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp[hybrid]"],
"env": {
"PDFKB_KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs",
"PDFKB_ENABLE_HYBRID_SEARCH": "true"
},
"transport": "stdio",
"autoRestart": true
}
}
}
Option B: OpenAI Embeddings w/ Hybrid Search
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp[hybrid]"],
"env": {
"PDFKB_EMBEDDING_PROVIDER": "openai",
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs",
"PDFKB_ENABLE_HYBRID_SEARCH": "true"
},
"transport": "stdio",
"autoRestart": true
}
}
}
Step 3: Verify Installation
- Restart your MCP client completely
- Check for PDF KB tools: Look for
add_document,search_documents,list_documents,remove_document - Test functionality: Try adding a PDF and searching for content
🌐 Web Interface
The PDF Knowledgebase includes a modern web interface for easy document management and search. The web interface is enabled by default.
Server Modes
1. Integrated Mode (Default - Both MCP + Web):
pdfkb-mcp
- Runs both MCP server AND web interface concurrently
- Web interface available at http://localhost:8080
- Best of both worlds: API integration + web UI
2. MCP Only Mode (Disable Web Interface):
PDFKB_ENABLE_WEB=false pdfkb-mcp
- Runs only the MCP server for integration with Claude Desktop, VS Code, etc.
- Most resource-efficient option
- Uses same document storage as web interface
Web Interface Features
- 📄 Document Upload: Drag & drop PDF files or upload via file picker
- 🔍 Semantic Search: Powerful vector-based search with real-time results
- 📊 Document Management: List, preview, and manage your PDF collection
- 📈 Real-time Status: Live processing updates via WebSocket connections
- 🎯 Chunk Explorer: View and navigate document chunks for detailed analysis
- ⚙️ System Metrics: Monitor server performance and resource usage
Quick Web Setup
-
Install and run:
uvx pdfkb-mcp # Install if needed PDFKB_ENABLE_WEB=true pdfkb-mcp # Start integrated server
-
Open your browser: http://localhost:8080
-
Configure environment (create
.envfile):PDFKB_OPENAI_API_KEY=sk-proj-abc123def456ghi789... PDFKB_KNOWLEDGEBASE_PATH=/path/to/your/pdfs PDFKB_WEB_PORT=8080 PDFKB_WEB_HOST=localhost PDFKB_ENABLE_WEB=true
Web Configuration Options
| Environment Variable | Default | Description |
|---|---|---|
PDFKB_ENABLE_WEB |
true |
Enable/disable web interface |
PDFKB_WEB_PORT |
8080 |
Web server port |
PDFKB_WEB_HOST |
localhost |
Web server host |
PDFKB_WEB_CORS_ORIGINS |
http://localhost:3000,http://127.0.0.1:3000 |
CORS allowed origins |
Command Line Options
The server supports command line arguments:
# Customize web server port (web interface enabled by default)
pdfkb-mcp --port 9000
# Use custom configuration file
pdfkb-mcp --config myconfig.env
# Change log level
pdfkb-mcp --log-level DEBUG
# Enable web interface via command line
pdfkb-mcp --enable-web
API Documentation
When running with web interface enabled, comprehensive API documentation is available at:
- Swagger UI: http://localhost:8080/docs
- ReDoc: http://localhost:8080/redoc
🏗️ Architecture Overview
MCP Integration
graph TB
subgraph "MCP Clients"
C1[Claude Desktop]
C2[VS Code/Continue]
C3[Other MCP Clients]
end
subgraph "MCP Protocol Layer"
MCP[Model Context Protocol<br/>Standard Layer]
end
subgraph "MCP Servers"
PDFKB[PDF KB Server<br/>This Server]
S1[Other MCP<br/>Server]
S2[Other MCP<br/>Server]
end
C1 --> MCP
C2 --> MCP
C3 --> MCP
MCP --> PDFKB
MCP --> S1
MCP --> S2
classDef client fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef protocol fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef server fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
classDef highlight fill:#c8e6c9,stroke:#1b5e20,stroke-width:3px
class C1,C2,C3 client
class MCP protocol
class S1,S2 server
class PDFKB highlight
Internal Architecture
graph LR
subgraph "Input Layer"
PDF[PDF Files]
WEB[Web Interface<br/>Port 8080]
MCP_IN[MCP Protocol]
end
subgraph "Processing Pipeline"
PARSER[PDF Parser<br/>PyMuPDF/Marker/MinerU]
CHUNKER[Text Chunker<br/>LangChain/Unstructured]
EMBED[Embedding Service<br/>Local/OpenAI]
end
subgraph "Storage Layer"
CACHE[Intelligent Cache<br/>Multi-stage]
VECTOR[Vector Store<br/>ChromaDB]
TEXT[Text Index<br/>Whoosh BM25]
end
subgraph "Search Engine"
HYBRID[Hybrid Search<br/>RRF Fusion]
end
PDF --> PARSER
WEB --> PARSER
MCP_IN --> PARSER
PARSER --> CHUNKER
CHUNKER --> EMBED
EMBED --> CACHE
CACHE --> VECTOR
CACHE --> TEXT
VECTOR --> HYBRID
TEXT --> HYBRID
HYBRID --> WEB
HYBRID --> MCP_IN
classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
classDef process fill:#fff9c4,stroke:#f57f17,stroke-width:2px
classDef storage fill:#fce4ec,stroke:#880e4f,stroke-width:2px
classDef search fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
class PDF,WEB,MCP_IN input
class PARSER,CHUNKER,EMBED process
class CACHE,VECTOR,TEXT storage
class HYBRID search
Available Tools & Resources
Tools (Actions your client can perform):
add_document(path, metadata?)- Add PDF to knowledgebasesearch_documents(query, limit=5, metadata_filter?, search_type?)- Hybrid search across PDFs (semantic + keyword matching)list_documents(metadata_filter?)- List all documents with metadataremove_document(document_id)- Remove document from knowledgebase
Resources (Data your client can access):
pdf://{document_id}- Full document content as JSONpdf://{document_id}/page/{page_number}- Specific page contentpdf://list- List of all documents with metadata
🤖 Local Embeddings
The server now supports local embeddings as the default option, eliminating API costs and keeping your data completely private. Local embeddings run on your machine using HuggingFace models optimized for performance.
Features
- Zero API Costs: No OpenAI API charges for embeddings
- Complete Privacy: Your documents never leave your machine
- Hardware Acceleration: Automatic detection and use of Metal (macOS), CUDA (NVIDIA), or CPU
- Smart Caching: LRU cache for frequently embedded texts
- Multiple Model Sizes: Choose based on your hardware capabilities
Quick Start
Local embeddings are enabled by default. No configuration needed for basic usage:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_KNOWLEDGEBASE_PATH": "/path/to/pdfs"
}
}
}
}
Supported Models
| Model | Size | Dimensions | Max Context | Best For |
|---|---|---|---|---|
| Qwen/Qwen3-Embedding-0.6B (default) | 1.2GB | 1024 | 32K tokens | Best overall - long docs, fast |
| Qwen/Qwen3-Embedding-4B | 8.0GB | 2560 | 32K tokens | Maximum quality, long context |
| intfloat/multilingual-e5-large-instruct | 0.8GB | 1024 | 512 tokens | Multilingual, instruction-following |
| BAAI/bge-m3 | 2.0GB | 1024 | 8K tokens | Multilingual, balanced |
| jinaai/jina-embeddings-v3 | 1.3GB | 1024 | 8K tokens | Task-specific retrieval |
Configure your preferred model:
PDFKB_LOCAL_EMBEDDING_MODEL="Qwen/Qwen3-Embedding-0.6B" # Default
Hardware Optimization
The server automatically detects and uses the best available hardware:
- Apple Silicon (M1/M2/M3): Uses Metal Performance Shaders (MPS)
- NVIDIA GPUs: Uses CUDA acceleration
- CPU Fallback: Optimized for multi-core processing
Force a specific device if needed:
PDFKB_EMBEDDING_DEVICE="mps" # Force Metal/MPS
PDFKB_EMBEDDING_DEVICE="cuda" # Force CUDA
PDFKB_EMBEDDING_DEVICE="cpu" # Force CPU
Configuration Options
# Embedding provider (local or openai)
PDFKB_EMBEDDING_PROVIDER="local" # Default
# Model selection (choose from supported models)
PDFKB_LOCAL_EMBEDDING_MODEL="Qwen/Qwen3-Embedding-0.6B" # Default
# Other options:
# - "Qwen/Qwen3-Embedding-4B" (8GB, 2560 dims, best quality)
# - "intfloat/multilingual-e5-large-instruct" (0.8GB, multilingual)
# - "BAAI/bge-m3" (2GB, multilingual, 8K context)
# - "jinaai/jina-embeddings-v3" (1.3GB, task-specific)
# Performance tuning
PDFKB_LOCAL_EMBEDDING_BATCH_SIZE=32 # Adjust based on memory
PDFKB_EMBEDDING_CACHE_SIZE=10000 # Number of cached embeddings
PDFKB_MAX_SEQUENCE_LENGTH=512 # Maximum text length
# Fallback options
PDFKB_FALLBACK_TO_OPENAI=false # Use OpenAI if local fails
Switching to OpenAI
If you prefer OpenAI embeddings:
{
"env": {
"PDFKB_EMBEDDING_PROVIDER": "openai",
"PDFKB_OPENAI_API_KEY": "sk-proj-...",
"PDFKB_EMBEDDING_MODEL": "text-embedding-3-large"
}
}
Performance Tips
-
Batch Size: Larger batches are faster but use more memory
- Apple Silicon: 32-64 recommended
- NVIDIA GPUs: 64-128 recommended
- CPU: 16-32 recommended
-
Model Selection: Choose based on your needs
- Default (Qwen3-0.6B): Best for most users - 32K context, fast, 1.2GB
- Long documents: Use Qwen3-4B for 32K context with higher quality
- Multilingual: Use bge-m3 or multilingual-e5-large-instruct
- Specific tasks: Use jina-embeddings-v3 with task parameters
-
Memory Management: The server automatically handles OOM errors by reducing batch size
🔍 Hybrid Search
The server now supports Hybrid Search, which combines the strengths of semantic similarity search (vector embeddings) with traditional keyword matching (BM25) for improved search quality.
How It Works
- Dual Indexing: Documents are indexed in both a vector database (ChromaDB) and a full-text search index (Whoosh)
- Parallel Search: Queries execute both semantic and keyword searches simultaneously
- Reciprocal Rank Fusion (RRF): Results are intelligently merged using RRF algorithm for optimal ranking
Benefits
- Better Recall: Finds documents that match exact keywords even if semantically different
- Improved Precision: Combines conceptual understanding with keyword relevance
- Technical Terms: Excellent for technical documentation, code references, and domain-specific terminology
- Balanced Results: Configurable weights let you adjust the balance between semantic and keyword matching
Configuration
Enable hybrid search by setting:
PDFKB_ENABLE_HYBRID_SEARCH=true # Enable hybrid search (default: true)
PDFKB_HYBRID_VECTOR_WEIGHT=0.6 # Weight for semantic search (default: 0.6)
PDFKB_HYBRID_TEXT_WEIGHT=0.4 # Weight for keyword search (default: 0.4)
PDFKB_RRF_K=60 # RRF constant (default: 60)
Installation
To use hybrid search, install with the optional dependency:
pip install "pdfkb-mcp[hybrid]"
Or if using uvx, it's included by default when hybrid search is enabled.
🎯 Parser Selection Guide
Decision Tree
Document Type & Priority?
├── 🏃 Speed Priority → PyMuPDF4LLM (fastest processing, low memory)
├── 📚 Academic Papers → MinerU (GPU-accelerated, excellent formulas/tables)
├── 📊 Business Reports → Docling (accurate tables, structured output)
├── ⚖️ Balanced Quality → Marker (good multilingual, selective OCR)
└── 🎯 Maximum Accuracy → LLM (slow, API costs, complex layouts)
Performance Comparison
| Parser | Processing Speed | Memory | Text Quality | Table Quality | Best For |
|---|---|---|---|---|---|
| PyMuPDF4LLM | Fastest | Low | Good | Basic-Moderate | RAG pipelines, bulk ingestion |
| MinerU | Fast with GPU¹ | ~4GB VRAM² | Excellent | Excellent | Scientific/technical PDFs |
| Docling | 0.9-2.5 pages/s³ | 2.5-6GB⁴ | Excellent | Excellent | Structured documents, tables |
| Marker | ~25 p/s batch⁵ | ~4GB VRAM⁶ | Excellent | Good-Excellent⁷ | Scientific papers, multilingual |
| LLM | Slow⁸ | Variable⁹ | Excellent¹⁰ | Excellent | Complex layouts, high-value docs |
Notes: ¹ >10,000 tokens/s on RTX 4090 with sglang ² Reported for <1B parameter model ³ CPU benchmarks: 0.92-1.34 p/s (native), 1.57-2.45 p/s (pypdfium) ⁴ 2.42-2.56GB (pypdfium), 6.16-6.20GB (native backend) ⁵ Projected on H100 GPU in batch mode ⁶ Benchmark configuration on NVIDIA A6000 ⁷ Enhanced with optional LLM mode for table merging ⁸ Order of magnitude slower than traditional parsers ⁹ Depends on token usage and model size ¹⁰ 98.7-100% accuracy when given clean text
⚙️ Configuration
Tier 1: Basic Configurations (80% of users)
Default (Recommended):
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_PDF_PARSER": "pymupdf4llm",
"PDFKB_PDF_CHUNKER": "langchain",
"PDFKB_EMBEDDING_MODEL": "text-embedding-3-large"
},
"transport": "stdio"
}
}
}
Speed Optimized:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_PDF_PARSER": "pymupdf4llm",
"PDFKB_CHUNK_SIZE": "800"
},
"transport": "stdio"
}
}
}
Memory Efficient:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_PDF_PARSER": "pymupdf4llm",
"PDFKB_EMBEDDING_BATCH_SIZE": "50"
},
"transport": "stdio"
}
}
}
Tier 2: Use Case Specific (15% of users)
Academic Papers:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_PDF_PARSER": "mineru",
"PDFKB_CHUNK_SIZE": "1200"
},
"transport": "stdio"
}
}
}
Business Documents:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_PDF_PARSER": "pymupdf4llm",
"PDFKB_DOCLING_TABLE_MODE": "ACCURATE",
"PDFKB_DOCLING_DO_TABLE_STRUCTURE": "true"
},
"transport": "stdio"
}
}
}
Multi-language Documents:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_PDF_PARSER": "docling",
"PDFKB_DOCLING_OCR_LANGUAGES": "en,fr,de,es",
"PDFKB_DOCLING_DO_OCR": "true"
},
"transport": "stdio"
}
}
}
Hybrid Search (NEW - Improved Search Quality):
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_ENABLE_HYBRID_SEARCH": "true",
"PDFKB_HYBRID_VECTOR_WEIGHT": "0.6",
"PDFKB_HYBRID_TEXT_WEIGHT": "0.4"
},
"transport": "stdio"
}
}
}
Maximum Quality:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
"PDFKB_PDF_PARSER": "llm",
"PDFKB_LLM_MODEL": "anthropic/claude-3.5-sonnet",
"PDFKB_EMBEDDING_MODEL": "text-embedding-3-large"
},
"transport": "stdio"
}
}
}
Essential Environment Variables
| Variable | Default | Description |
|---|---|---|
PDFKB_OPENAI_API_KEY |
required | OpenAI API key for embeddings |
PDFKB_KNOWLEDGEBASE_PATH |
./pdfs |
Directory containing PDF files |
PDFKB_CACHE_DIR |
./.cache |
Cache directory for processing |
PDFKB_PDF_PARSER |
pymupdf4llm |
Parser: pymupdf4llm (default), marker, mineru, docling, llm |
PDFKB_PDF_CHUNKER |
langchain |
Chunking strategy: langchain (default), unstructured |
PDFKB_CHUNK_SIZE |
1000 |
Target chunk size for LangChain chunker |
PDFKB_ENABLE_WEB |
true |
Enable/disable web interface |
PDFKB_WEB_PORT |
8080 |
Web server port |
PDFKB_WEB_HOST |
localhost |
Web server host |
PDFKB_WEB_CORS_ORIGINS |
http://localhost:3000,http://127.0.0.1:3000 |
CORS allowed origins (comma-separated) |
PDFKB_EMBEDDING_MODEL |
text-embedding-3-large |
OpenAI embedding model (use text-embedding-3-small for faster processing) |
PDFKB_ENABLE_HYBRID_SEARCH |
true |
Enable hybrid search combining semantic and keyword matching |
PDFKB_HYBRID_VECTOR_WEIGHT |
0.6 |
Weight for semantic search (0-1, must sum to 1 with text weight) |
PDFKB_HYBRID_TEXT_WEIGHT |
0.4 |
Weight for keyword/BM25 search (0-1, must sum to 1 with vector weight) |
PDFKB_RRF_K |
60 |
Reciprocal Rank Fusion constant (higher = less emphasis on rank differences) |
🖥️ MCP Client Setup
Claude Desktop
Configuration File Location:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json - Linux:
~/.config/Claude/claude_desktop_config.json
Configuration:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs",
"PDFKB_CACHE_DIR": "/Users/yourname/Documents/PDFs/.cache"
},
"transport": "stdio",
"autoRestart": true,
"PDFKB_EMBEDDING_MODEL": "text-embedding-3-small",
}
}
}
Verification:
- Restart Claude Desktop completely
- Look for PDF KB tools in the interface
- Test with "Add a document" or "Search documents"
VS Code with Native MCP Support
Configuration (.vscode/mcp.json in workspace):
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
},
"transport": "stdio"
}
}
}
Verification:
- Reload VS Code window
- Check VS Code's MCP server status in Command Palette
- Use MCP tools in Copilot Chat
VS Code with Continue Extension
Configuration (.continue/config.json):
{
"models": [...],
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
"PDFKB_KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
},
"transport": "stdio"
}
}
}
Verification:
- Reload VS Code window
- Check Continue panel for server connection
- Use
@pdfkbin Continue chat
Generic MCP Client
Standard Configuration Template:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "required",
"PDFKB_KNOWLEDGEBASE_PATH": "required-absolute-path",
"PDFKB_PDF_PARSER": "optional-default-pymupdf4llm"
},
"transport": "stdio",
"autoRestart": true,
"timeout": 30000
}
}
}
📊 Performance & Troubleshooting
Common Issues
Server not appearing in MCP client:
// ❌ Wrong: Missing transport
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"]
}
}
}
// ✅ Correct: Include transport and restart client
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"transport": "stdio"
}
}
}
Processing too slow:
// Switch to faster parser
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-key",
"PDFKB_PDF_PARSER": "pymupdf4llm"
},
"transport": "stdio"
}
}
}
Memory issues:
// Reduce memory usage
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-key",
"PDFKB_EMBEDDING_BATCH_SIZE": "25",
"PDFKB_CHUNK_SIZE": "500"
},
"transport": "stdio"
}
}
}
Poor table extraction:
// Use table-optimized parser
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-key",
"PDFKB_PDF_PARSER": "docling",
"PDFKB_DOCLING_TABLE_MODE": "ACCURATE"
},
"transport": "stdio"
}
}
}
Resource Requirements
| Configuration | RAM Usage | Processing Speed | Best For |
|---|---|---|---|
| Speed | 2-4 GB | Fastest | Large collections |
| Balanced | 4-6 GB | Medium | Most users |
| Quality | 6-12 GB | Medium-Fast | Accuracy priority |
| GPU | 8-16 GB | Very Fast | High-volume processing |
🔧 Advanced Configuration
Parser-Specific Options
MinerU Configuration:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-key",
"PDFKB_PDF_PARSER": "mineru",
"PDFKB_MINERU_LANG": "en",
"PDFKB_MINERU_METHOD": "auto",
"PDFKB_MINERU_VRAM": "16"
},
"transport": "stdio"
}
}
}
LLM Parser Configuration:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-key",
"PDFKB_OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
"PDFKB_PDF_PARSER": "llm",
"PDFKB_LLM_MODEL": "google/gemini-2.5-flash-lite",
"PDFKB_LLM_CONCURRENCY": "5",
"PDFKB_LLM_DPI": "150"
},
"transport": "stdio"
}
}
}
Performance Tuning
High-Performance Setup:
{
"mcpServers": {
"pdfkb": {
"command": "uvx",
"args": ["pdfkb-mcp"],
"env": {
"PDFKB_OPENAI_API_KEY": "sk-key",
"PDFKB_PDF_PARSER": "mineru",
"PDFKB_KNOWLEDGEBASE_PATH": "/Volumes/FastSSD/Documents/PDFs",
"PDFKB_CACHE_DIR": "/Volumes/FastSSD/Documents/PDFs/.cache",
"PDFKB_EMBEDDING_BATCH_SIZE": "200",
"PDFKB_VECTOR_SEARCH_K": "15",
"PDFKB_FILE_SCAN_INTERVAL": "30"
},
"transport": "stdio"
}
}
}
Intelligent Caching
The server uses multi-stage caching:
- Parsing Cache: Stores converted markdown (
src/pdfkb/intelligent_cache.py:139) - Chunking Cache: Stores processed chunks
- Vector Cache: ChromaDB embeddings storage
Cache Invalidation Rules:
- Changing
PDFKB_PDF_PARSER→ Full reset (parsing + chunking + embeddings) - Changing
PDFKB_PDF_CHUNKER→ Partial reset (chunking + embeddings) - Changing
PDFKB_EMBEDDING_MODEL→ Minimal reset (embeddings only)
📚 Appendix
Installation Options
Primary (Recommended):
uvx pdfkb-mcp
**Web Interface Included**: All installation methods include the web interface. Use these commands:
- `pdfkb-mcp` - Integrated MCP + Web server (default)
- `PDFKB_ENABLE_WEB=false pdfkb-mcp` - MCP server only (web disabled)
With Specific Parser Dependencies:
uvx pdfkb-mcp[marker] # Marker parser
uvx pdfkb-mcp[mineru] # MinerU parser
uvx pdfkb-mcp[docling] # Docling parser
uvx pdfkb-mcp[llm] # LLM parser
-uvx pdfkb-mcp[langchain] # LangChain chunker
uvx pdfkb-mcp[web] # Enhanced web features (psutil for metrics)
+uvx pdfkb-mcp[unstructured_chunker] # Unstructured chunker
pip install "pdfkb-mcp[web]" # Enhanced web features Or via pip/pipx:
pip install "pdfkb-mcp[marker]" # Marker parser
pip install "pdfkb-mcp[docling-complete]" # Docling with OCR and full features
Development Installation:
git clone https://github.com/juanqui/pdfkb-mcp.git
cd pdfkb-mcp
pip install -e ".[dev]"
Complete Environment Variables Reference
| Variable | Default | Description |
|---|---|---|
PDFKB_OPENAI_API_KEY |
required | OpenAI API key for embeddings |
PDFKB_OPENROUTER_API_KEY |
optional | Required for LLM parser |
PDFKB_KNOWLEDGEBASE_PATH |
./pdfs |
PDF directory path |
PDFKB_CACHE_DIR |
./.cache |
Cache directory |
PDFKB_PDF_PARSER |
pymupdf4llm |
PDF parser selection |
PDFKB_PDF_CHUNKER |
langchain |
Chunking strategy |
PDFKB_CHUNK_SIZE |
1000 |
LangChain chunk size |
PDFKB_CHUNK_OVERLAP |
200 |
LangChain chunk overlap |
PDFKB_EMBEDDING_MODEL |
text-embedding-3-large |
OpenAI model |
PDFKB_EMBEDDING_BATCH_SIZE |
100 |
Embedding batch size |
PDFKB_VECTOR_SEARCH_K |
5 |
Default search results |
PDFKB_FILE_SCAN_INTERVAL |
60 |
File monitoring interval |
PDFKB_LOG_LEVEL |
INFO |
Logging level |
PDFKB_ENABLE_WEB |
true |
Enable/disable web interface |
PDFKB_WEB_PORT |
8080 |
Web server port |
PDFKB_WEB_HOST |
localhost |
Web server host |
PDFKB_WEB_CORS_ORIGINS |
http://localhost:3000,http://127.0.0.1:3000 |
CORS allowed origins (comma-separated) |
Parser Comparison Details
| Feature | PyMuPDF4LLM | Marker | MinerU | Docling | LLM |
|---|---|---|---|---|---|
| Speed | Fastest | Medium | Fast (GPU) | Medium | Slowest |
| Memory | Lowest | Medium | High | Medium | Lowest |
| Tables | Basic | Good | Excellent | Excellent | Excellent |
| Formulas | Basic | Good | Excellent | Good | Excellent |
| Images | Basic | Good | Good | Excellent | Excellent |
| Setup | Simple | Simple | Moderate | Simple | Simple |
| Cost | Free | Free | Free | Free | API costs |
Chunking Strategies
LangChain (PDFKB_PDF_CHUNKER=langchain):
- Header-aware splitting with
MarkdownHeaderTextSplitter - Configurable via
PDFKB_CHUNK_SIZEandPDFKB_CHUNK_OVERLAP - Best for customizable chunking
- Default and installed with base package
Unstructured (PDFKB_PDF_CHUNKER=unstructured):
- Intelligent semantic chunking with
unstructuredlibrary - Zero configuration required
- Install extra:
pip install "pdfkb-mcp[unstructured_chunker]"to enable - Best for document structure awareness
First-run notes
- On the first run, the server initializes caches and vector store and logs selected components:
- Parser: PyMuPDF4LLM (default)
- Chunker: LangChain (default)
- Embedding Model: text-embedding-3-large (default)
- If you select a parser/chunker that isn’t installed, the server logs a warning with the exact install command and falls back to the default components instead of exiting.
Troubleshooting Guide
API Key Issues:
- Verify key format starts with
sk- - Check account has sufficient credits
- Test connectivity:
curl -H "Authorization: Bearer $PDFKB_OPENAI_API_KEY" https://api.openai.com/v1/models
Parser Installation Issues:
- MinerU:
pip install mineru[all]and verifymineru --version - Docling:
pip install doclingfor basic,pip install pdfkb-mcp[docling-complete]for all features - LLM: Requires
PDFKB_OPENROUTER_API_KEYenvironment variable
Performance Optimization:
- Speed: Use
pymupdf4llmparser (fastest, low memory footprint) - Memory: Reduce
PDFKB_EMBEDDING_BATCH_SIZEandPDFKB_CHUNK_SIZE; use pypdfium backend for Docling - Quality: Use
mineruwith GPU (>10K tokens/s on RTX 4090) ormarkerfor balanced quality - Tables: Use
doclingwithPDFKB_DOCLING_TABLE_MODE=ACCURATEormarkerwith LLM mode - Batch Processing: Use
markeron H100 (~25 pages/s) ormineruwith sglang acceleration
For additional support, see implementation details in src/pdfkb/main.py and src/pdfkb/config.py.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfkb_mcp-0.4.0.tar.gz.
File metadata
- Download URL: pdfkb_mcp-0.4.0.tar.gz
- Upload date:
- Size: 220.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae12b430342b1270d4a9385eb4dcfa10b118e8f791dc5e3148993efa15bd6afc
|
|
| MD5 |
2f1405b6119c05e7276a723f8e4b184d
|
|
| BLAKE2b-256 |
9884d9562eaed7caf653b000c96d69732550cce85868785f5ed587d176920446
|
File details
Details for the file pdfkb_mcp-0.4.0-py3-none-any.whl.
File metadata
- Download URL: pdfkb_mcp-0.4.0-py3-none-any.whl
- Upload date:
- Size: 143.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7eb0a89eeb25b7860f3ac23aff995322fbae03480d7dbcaa31d40a7d7cf6a5ec
|
|
| MD5 |
d532f610d5bd83a500ef439743338653
|
|
| BLAKE2b-256 |
78dc4d57e470793bb3db2556d308a6ff15b92795546f6bed1d5c1be7f7174730
|