A Model Context Protocol server for managing PDF documents with vector search capabilities

These details have not been verified by PyPI

Project links

Project description

PDF Knowledgebase MCP Server

A Model Context Protocol (MCP) server that enables intelligent document search and retrieval from PDF collections. Built for seamless integration with Claude Desktop, Continue, Cline, and other MCP clients, this server provides advanced search capabilities powered by local or OpenAI embeddings and ChromaDB vector storage.

🆕 NEW Features:

Local Embeddings: Run embeddings locally with HuggingFace models - no API costs, full privacy
Hybrid Search: Combines semantic similarity with keyword matching (BM25) for superior search quality
Web Interface: Modern web UI for document management and search alongside the traditional MCP protocol

🚀 Quick Start
🌐 Web Interface
🏗️ Architecture Overview
🤖 Local Embeddings
🔍 Hybrid Search
🎯 Parser Selection Guide
⚙️ Configuration
🖥️ MCP Client Setup
📊 Performance & Troubleshooting
🔧 Advanced Configuration
📚 Appendix

🚀 Quick Start

Step 1: Configure Your MCP Client

Option A: Local Embeddings w/ Hybrid Search (No API Key Required)

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp[hybrid]"],
      "env": {
        "PDFKB_KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs",
        "PDFKB_ENABLE_HYBRID_SEARCH": "true"
      },
      "transport": "stdio",
      "autoRestart": true
    }
  }
}

Option B: OpenAI Embeddings w/ Hybrid Search

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp[hybrid]"],
      "env": {
        "PDFKB_EMBEDDING_PROVIDER": "openai",
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs",
        "PDFKB_ENABLE_HYBRID_SEARCH": "true"
      },
      "transport": "stdio",
      "autoRestart": true
    }
  }
}

Step 3: Verify Installation

Restart your MCP client completely
Check for PDF KB tools: Look for add_document, search_documents, list_documents, remove_document
Test functionality: Try adding a PDF and searching for content

🌐 Web Interface

The PDF Knowledgebase includes a modern web interface for easy document management and search. The web interface is enabled by default.

Server Modes

1. Integrated Mode (Default - Both MCP + Web):

pdfkb-mcp

Runs both MCP server AND web interface concurrently
Web interface available at http://localhost:8080
Best of both worlds: API integration + web UI

2. MCP Only Mode (Disable Web Interface):

PDFKB_ENABLE_WEB=false pdfkb-mcp

Runs only the MCP server for integration with Claude Desktop, VS Code, etc.
Most resource-efficient option
Uses same document storage as web interface

Web Interface Features

📄 Document Upload: Drag & drop PDF files or upload via file picker
🔍 Semantic Search: Powerful vector-based search with real-time results
📊 Document Management: List, preview, and manage your PDF collection
📈 Real-time Status: Live processing updates via WebSocket connections
🎯 Chunk Explorer: View and navigate document chunks for detailed analysis
⚙️ System Metrics: Monitor server performance and resource usage

Quick Web Setup

Install and run:

uvx pdfkb-mcp                    # Install if needed
PDFKB_ENABLE_WEB=true pdfkb-mcp  # Start integrated server

Open your browser: http://localhost:8080

Configure environment (create .env file):

PDFKB_OPENAI_API_KEY=sk-proj-abc123def456ghi789...
PDFKB_KNOWLEDGEBASE_PATH=/path/to/your/pdfs
PDFKB_WEB_PORT=8080
PDFKB_WEB_HOST=localhost
PDFKB_ENABLE_WEB=true

Web Configuration Options

Environment Variable	Default	Description
`PDFKB_ENABLE_WEB`	`true`	Enable/disable web interface
`PDFKB_WEB_PORT`	`8080`	Web server port
`PDFKB_WEB_HOST`	`localhost`	Web server host
`PDFKB_WEB_CORS_ORIGINS`	`http://localhost:3000,http://127.0.0.1:3000`	CORS allowed origins

Command Line Options

The server supports command line arguments:

# Customize web server port (web interface enabled by default)
pdfkb-mcp --port 9000

# Use custom configuration file
pdfkb-mcp --config myconfig.env

# Change log level
pdfkb-mcp --log-level DEBUG

# Enable web interface via command line
pdfkb-mcp --enable-web

API Documentation

When running with web interface enabled, comprehensive API documentation is available at:

Swagger UI: http://localhost:8080/docs
ReDoc: http://localhost:8080/redoc

🏗️ Architecture Overview

MCP Integration

graph TB
    subgraph "MCP Clients"
        C1[Claude Desktop]
        C2[VS Code/Continue]
        C3[Other MCP Clients]
    end

    subgraph "MCP Protocol Layer"
        MCP[Model Context Protocol<br/>Standard Layer]
    end

    subgraph "MCP Servers"
        PDFKB[PDF KB Server<br/>This Server]
        S1[Other MCP<br/>Server]
        S2[Other MCP<br/>Server]
    end

    C1 --> MCP
    C2 --> MCP
    C3 --> MCP

    MCP --> PDFKB
    MCP --> S1
    MCP --> S2

    classDef client fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef protocol fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef server fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef highlight fill:#c8e6c9,stroke:#1b5e20,stroke-width:3px

    class C1,C2,C3 client
    class MCP protocol
    class S1,S2 server
    class PDFKB highlight

Internal Architecture

graph LR
    subgraph "Input Layer"
        PDF[PDF Files]
        WEB[Web Interface<br/>Port 8080]
        MCP_IN[MCP Protocol]
    end

    subgraph "Processing Pipeline"
        PARSER[PDF Parser<br/>PyMuPDF/Marker/MinerU]
        CHUNKER[Text Chunker<br/>LangChain/Unstructured]
        EMBED[Embedding Service<br/>Local/OpenAI]
    end

    subgraph "Storage Layer"
        CACHE[Intelligent Cache<br/>Multi-stage]
        VECTOR[Vector Store<br/>ChromaDB]
        TEXT[Text Index<br/>Whoosh BM25]
    end

    subgraph "Search Engine"
        HYBRID[Hybrid Search<br/>RRF Fusion]
    end

    PDF --> PARSER
    WEB --> PARSER
    MCP_IN --> PARSER

    PARSER --> CHUNKER
    CHUNKER --> EMBED

    EMBED --> CACHE
    CACHE --> VECTOR
    CACHE --> TEXT

    VECTOR --> HYBRID
    TEXT --> HYBRID

    HYBRID --> WEB
    HYBRID --> MCP_IN

    classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    classDef process fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    classDef storage fill:#fce4ec,stroke:#880e4f,stroke-width:2px
    classDef search fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px

    class PDF,WEB,MCP_IN input
    class PARSER,CHUNKER,EMBED process
    class CACHE,VECTOR,TEXT storage
    class HYBRID search

Available Tools & Resources

Tools (Actions your client can perform):

add_document(path, metadata?) - Add PDF to knowledgebase
search_documents(query, limit=5, metadata_filter?, search_type?) - Hybrid search across PDFs (semantic + keyword matching)
list_documents(metadata_filter?) - List all documents with metadata
remove_document(document_id) - Remove document from knowledgebase

Resources (Data your client can access):

pdf://{document_id} - Full document content as JSON
pdf://{document_id}/page/{page_number} - Specific page content
pdf://list - List of all documents with metadata

🤖 Local Embeddings

The server now supports local embeddings as the default option, eliminating API costs and keeping your data completely private. Local embeddings run on your machine using HuggingFace models optimized for performance.

Features

Zero API Costs: No OpenAI API charges for embeddings
Complete Privacy: Your documents never leave your machine
Hardware Acceleration: Automatic detection and use of Metal (macOS), CUDA (NVIDIA), or CPU
Smart Caching: LRU cache for frequently embedded texts
Multiple Model Sizes: Choose based on your hardware capabilities

Quick Start

Local embeddings are enabled by default. No configuration needed for basic usage:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_KNOWLEDGEBASE_PATH": "/path/to/pdfs"
      }
    }
  }
}

Supported Models

Model	Size	Dimensions	Max Context	Best For
Qwen/Qwen3-Embedding-0.6B (default)	1.2GB	1024	32K tokens	Best overall - long docs, fast
Qwen/Qwen3-Embedding-4B	8.0GB	2560	32K tokens	Maximum quality, long context
intfloat/multilingual-e5-large-instruct	0.8GB	1024	512 tokens	Multilingual, instruction-following
BAAI/bge-m3	2.0GB	1024	8K tokens	Multilingual, balanced
jinaai/jina-embeddings-v3	1.3GB	1024	8K tokens	Task-specific retrieval

Configure your preferred model:

PDFKB_LOCAL_EMBEDDING_MODEL="Qwen/Qwen3-Embedding-0.6B"  # Default

Hardware Optimization

The server automatically detects and uses the best available hardware:

Apple Silicon (M1/M2/M3): Uses Metal Performance Shaders (MPS)
NVIDIA GPUs: Uses CUDA acceleration
CPU Fallback: Optimized for multi-core processing

Force a specific device if needed:

PDFKB_EMBEDDING_DEVICE="mps"   # Force Metal/MPS
PDFKB_EMBEDDING_DEVICE="cuda"  # Force CUDA
PDFKB_EMBEDDING_DEVICE="cpu"   # Force CPU

Configuration Options

# Embedding provider (local or openai)
PDFKB_EMBEDDING_PROVIDER="local"  # Default

# Model selection (choose from supported models)
PDFKB_LOCAL_EMBEDDING_MODEL="Qwen/Qwen3-Embedding-0.6B"  # Default
# Other options:
# - "Qwen/Qwen3-Embedding-4B" (8GB, 2560 dims, best quality)
# - "intfloat/multilingual-e5-large-instruct" (0.8GB, multilingual)
# - "BAAI/bge-m3" (2GB, multilingual, 8K context)
# - "jinaai/jina-embeddings-v3" (1.3GB, task-specific)

# Performance tuning
PDFKB_LOCAL_EMBEDDING_BATCH_SIZE=32  # Adjust based on memory
PDFKB_EMBEDDING_CACHE_SIZE=10000     # Number of cached embeddings
PDFKB_MAX_SEQUENCE_LENGTH=512        # Maximum text length

# Fallback options
PDFKB_FALLBACK_TO_OPENAI=false  # Use OpenAI if local fails

Switching to OpenAI

If you prefer OpenAI embeddings:

{
  "env": {
    "PDFKB_EMBEDDING_PROVIDER": "openai",
    "PDFKB_OPENAI_API_KEY": "sk-proj-...",
    "PDFKB_EMBEDDING_MODEL": "text-embedding-3-large"
  }
}

Performance Tips

Batch Size: Larger batches are faster but use more memory
- Apple Silicon: 32-64 recommended
- NVIDIA GPUs: 64-128 recommended
- CPU: 16-32 recommended
Model Selection: Choose based on your needs
- Default (Qwen3-0.6B): Best for most users - 32K context, fast, 1.2GB
- Long documents: Use Qwen3-4B for 32K context with higher quality
- Multilingual: Use bge-m3 or multilingual-e5-large-instruct
- Specific tasks: Use jina-embeddings-v3 with task parameters
Memory Management: The server automatically handles OOM errors by reducing batch size

🔍 Hybrid Search

The server now supports Hybrid Search, which combines the strengths of semantic similarity search (vector embeddings) with traditional keyword matching (BM25) for improved search quality.

How It Works

Dual Indexing: Documents are indexed in both a vector database (ChromaDB) and a full-text search index (Whoosh)
Parallel Search: Queries execute both semantic and keyword searches simultaneously
Reciprocal Rank Fusion (RRF): Results are intelligently merged using RRF algorithm for optimal ranking

Benefits

Better Recall: Finds documents that match exact keywords even if semantically different
Improved Precision: Combines conceptual understanding with keyword relevance
Technical Terms: Excellent for technical documentation, code references, and domain-specific terminology
Balanced Results: Configurable weights let you adjust the balance between semantic and keyword matching

Configuration

Enable hybrid search by setting:

PDFKB_ENABLE_HYBRID_SEARCH=true  # Enable hybrid search (default: true)
PDFKB_HYBRID_VECTOR_WEIGHT=0.6   # Weight for semantic search (default: 0.6)
PDFKB_HYBRID_TEXT_WEIGHT=0.4     # Weight for keyword search (default: 0.4)
PDFKB_RRF_K=60                   # RRF constant (default: 60)

Installation

To use hybrid search, install with the optional dependency:

pip install "pdfkb-mcp[hybrid]"

Or if using uvx, it's included by default when hybrid search is enabled.

🎯 Parser Selection Guide

Decision Tree

Document Type & Priority?
├── 🏃 Speed Priority → PyMuPDF4LLM (fastest processing, low memory)
├── 📚 Academic Papers → MinerU (GPU-accelerated, excellent formulas/tables)
├── 📊 Business Reports → Docling (accurate tables, structured output)
├── ⚖️ Balanced Quality → Marker (good multilingual, selective OCR)
└── 🎯 Maximum Accuracy → LLM (slow, API costs, complex layouts)

Performance Comparison

Parser	Processing Speed	Memory	Text Quality	Table Quality	Best For
PyMuPDF4LLM	Fastest	Low	Good	Basic-Moderate	RAG pipelines, bulk ingestion
MinerU	Fast with GPU¹	~4GB VRAM²	Excellent	Excellent	Scientific/technical PDFs
Docling	0.9-2.5 pages/s³	2.5-6GB⁴	Excellent	Excellent	Structured documents, tables
Marker	~25 p/s batch⁵	~4GB VRAM⁶	Excellent	Good-Excellent⁷	Scientific papers, multilingual
LLM	Slow⁸	Variable⁹	Excellent¹⁰	Excellent	Complex layouts, high-value docs

Notes: ¹ >10,000 tokens/s on RTX 4090 with sglang ² Reported for <1B parameter model ³ CPU benchmarks: 0.92-1.34 p/s (native), 1.57-2.45 p/s (pypdfium) ⁴ 2.42-2.56GB (pypdfium), 6.16-6.20GB (native backend) ⁵ Projected on H100 GPU in batch mode ⁶ Benchmark configuration on NVIDIA A6000 ⁷ Enhanced with optional LLM mode for table merging ⁸ Order of magnitude slower than traditional parsers ⁹ Depends on token usage and model size ¹⁰ 98.7-100% accuracy when given clean text

⚙️ Configuration

Tier 1: Basic Configurations (80% of users)

Default (Recommended):

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_PDF_PARSER": "pymupdf4llm",
        "PDFKB_PDF_CHUNKER": "langchain",
        "PDFKB_EMBEDDING_MODEL": "text-embedding-3-large"
      },
      "transport": "stdio"
    }
  }
}

Speed Optimized:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_PDF_PARSER": "pymupdf4llm",
        "PDFKB_CHUNK_SIZE": "800"
      },
      "transport": "stdio"
    }
  }
}

Memory Efficient:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_PDF_PARSER": "pymupdf4llm",
        "PDFKB_EMBEDDING_BATCH_SIZE": "50"
      },
      "transport": "stdio"
    }
  }
}

Tier 2: Use Case Specific (15% of users)

Academic Papers:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_PDF_PARSER": "mineru",
        "PDFKB_CHUNK_SIZE": "1200"
      },
      "transport": "stdio"
    }
  }
}

Business Documents:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_PDF_PARSER": "pymupdf4llm",
        "PDFKB_DOCLING_TABLE_MODE": "ACCURATE",
        "PDFKB_DOCLING_DO_TABLE_STRUCTURE": "true"
      },
      "transport": "stdio"
    }
  }
}

Multi-language Documents:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_PDF_PARSER": "docling",
        "PDFKB_DOCLING_OCR_LANGUAGES": "en,fr,de,es",
        "PDFKB_DOCLING_DO_OCR": "true"
      },
      "transport": "stdio"
    }
  }
}

Hybrid Search (NEW - Improved Search Quality):

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_ENABLE_HYBRID_SEARCH": "true",
        "PDFKB_HYBRID_VECTOR_WEIGHT": "0.6",
        "PDFKB_HYBRID_TEXT_WEIGHT": "0.4"
      },
      "transport": "stdio"
    }
  }
}

Maximum Quality:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
        "PDFKB_PDF_PARSER": "llm",
        "PDFKB_LLM_MODEL": "anthropic/claude-3.5-sonnet",
        "PDFKB_EMBEDDING_MODEL": "text-embedding-3-large"
      },
      "transport": "stdio"
    }
  }
}

Essential Environment Variables

Variable	Default	Description
`PDFKB_OPENAI_API_KEY`	required	OpenAI API key for embeddings
`PDFKB_KNOWLEDGEBASE_PATH`	`./pdfs`	Directory containing PDF files
`PDFKB_CACHE_DIR`	`./.cache`	Cache directory for processing
`PDFKB_PDF_PARSER`	`pymupdf4llm`	Parser: `pymupdf4llm` (default), `marker`, `mineru`, `docling`, `llm`
`PDFKB_PDF_CHUNKER`	`langchain`	Chunking strategy: `langchain` (default), `unstructured`
`PDFKB_CHUNK_SIZE`	`1000`	Target chunk size for LangChain chunker
`PDFKB_ENABLE_WEB`	`true`	Enable/disable web interface
`PDFKB_WEB_PORT`	`8080`	Web server port
`PDFKB_WEB_HOST`	`localhost`	Web server host
`PDFKB_WEB_CORS_ORIGINS`	`http://localhost:3000,http://127.0.0.1:3000`	CORS allowed origins (comma-separated)
`PDFKB_EMBEDDING_MODEL`	`text-embedding-3-large`	OpenAI embedding model (use `text-embedding-3-small` for faster processing)
`PDFKB_ENABLE_HYBRID_SEARCH`	`true`	Enable hybrid search combining semantic and keyword matching
`PDFKB_HYBRID_VECTOR_WEIGHT`	`0.6`	Weight for semantic search (0-1, must sum to 1 with text weight)
`PDFKB_HYBRID_TEXT_WEIGHT`	`0.4`	Weight for keyword/BM25 search (0-1, must sum to 1 with vector weight)
`PDFKB_RRF_K`	`60`	Reciprocal Rank Fusion constant (higher = less emphasis on rank differences)

🖥️ MCP Client Setup

Claude Desktop

Configuration File Location:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs",
        "PDFKB_CACHE_DIR": "/Users/yourname/Documents/PDFs/.cache"
      },
      "transport": "stdio",
      "autoRestart": true,
                "PDFKB_EMBEDDING_MODEL": "text-embedding-3-small",
    }
  }
}

Verification:

Restart Claude Desktop completely
Look for PDF KB tools in the interface
Test with "Add a document" or "Search documents"

VS Code with Native MCP Support

Configuration (.vscode/mcp.json in workspace):

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Verification:

Reload VS Code window
Check VS Code's MCP server status in Command Palette
Use MCP tools in Copilot Chat

VS Code with Continue Extension

Configuration (.continue/config.json):

{
  "models": [...],
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDFKB_KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Verification:

Reload VS Code window
Check Continue panel for server connection
Use @pdfkb in Continue chat

Generic MCP Client

Standard Configuration Template:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "required",
        "PDFKB_KNOWLEDGEBASE_PATH": "required-absolute-path",
        "PDFKB_PDF_PARSER": "optional-default-pymupdf4llm"
      },
      "transport": "stdio",
      "autoRestart": true,
      "timeout": 30000
    }
  }
}

📊 Performance & Troubleshooting

Common Issues

Server not appearing in MCP client:

// ❌ Wrong: Missing transport
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"]
    }
  }
}

// ✅ Correct: Include transport and restart client
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "transport": "stdio"
    }
  }
}

Processing too slow:

// Switch to faster parser
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-key",
        "PDFKB_PDF_PARSER": "pymupdf4llm"
      },
      "transport": "stdio"
    }
  }
}

Memory issues:

// Reduce memory usage
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-key",
        "PDFKB_EMBEDDING_BATCH_SIZE": "25",
        "PDFKB_CHUNK_SIZE": "500"
      },
      "transport": "stdio"
    }
  }
}

Poor table extraction:

// Use table-optimized parser
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-key",
        "PDFKB_PDF_PARSER": "docling",
        "PDFKB_DOCLING_TABLE_MODE": "ACCURATE"
      },
      "transport": "stdio"
    }
  }
}

Resource Requirements

Configuration	RAM Usage	Processing Speed	Best For
Speed	2-4 GB	Fastest	Large collections
Balanced	4-6 GB	Medium	Most users
Quality	6-12 GB	Medium-Fast	Accuracy priority
GPU	8-16 GB	Very Fast	High-volume processing

🔧 Advanced Configuration

Parser-Specific Options

MinerU Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-key",
        "PDFKB_PDF_PARSER": "mineru",
        "PDFKB_MINERU_LANG": "en",
        "PDFKB_MINERU_METHOD": "auto",
        "PDFKB_MINERU_VRAM": "16"
      },
      "transport": "stdio"
    }
  }
}

LLM Parser Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-key",
        "PDFKB_OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
        "PDFKB_PDF_PARSER": "llm",
        "PDFKB_LLM_MODEL": "google/gemini-2.5-flash-lite",
        "PDFKB_LLM_CONCURRENCY": "5",
        "PDFKB_LLM_DPI": "150"
      },
      "transport": "stdio"
    }
  }
}

Performance Tuning

High-Performance Setup:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "PDFKB_OPENAI_API_KEY": "sk-key",
        "PDFKB_PDF_PARSER": "mineru",
        "PDFKB_KNOWLEDGEBASE_PATH": "/Volumes/FastSSD/Documents/PDFs",
        "PDFKB_CACHE_DIR": "/Volumes/FastSSD/Documents/PDFs/.cache",
        "PDFKB_EMBEDDING_BATCH_SIZE": "200",
        "PDFKB_VECTOR_SEARCH_K": "15",
        "PDFKB_FILE_SCAN_INTERVAL": "30"
      },
      "transport": "stdio"
    }
  }
}

Intelligent Caching

The server uses multi-stage caching:

Parsing Cache: Stores converted markdown (src/pdfkb/intelligent_cache.py:139)
Chunking Cache: Stores processed chunks
Vector Cache: ChromaDB embeddings storage

Cache Invalidation Rules:

Changing PDFKB_PDF_PARSER → Full reset (parsing + chunking + embeddings)
Changing PDFKB_PDF_CHUNKER → Partial reset (chunking + embeddings)
Changing PDFKB_EMBEDDING_MODEL → Minimal reset (embeddings only)

📚 Appendix

Installation Options

Primary (Recommended):

uvx pdfkb-mcp
**Web Interface Included**: All installation methods include the web interface. Use these commands:
- `pdfkb-mcp` - Integrated MCP + Web server (default)
- `PDFKB_ENABLE_WEB=false pdfkb-mcp` - MCP server only (web disabled)

With Specific Parser Dependencies:

uvx pdfkb-mcp[marker]     # Marker parser
uvx pdfkb-mcp[mineru]     # MinerU parser
uvx pdfkb-mcp[docling]    # Docling parser
uvx pdfkb-mcp[llm]        # LLM parser
-uvx pdfkb-mcp[langchain]  # LangChain chunker
uvx pdfkb-mcp[web]        # Enhanced web features (psutil for metrics)
+uvx pdfkb-mcp[unstructured_chunker]  # Unstructured chunker

pip install "pdfkb-mcp[web]" # Enhanced web features Or via pip/pipx:

pip install "pdfkb-mcp[marker]"            # Marker parser
pip install "pdfkb-mcp[docling-complete]"  # Docling with OCR and full features

Development Installation:

git clone https://github.com/juanqui/pdfkb-mcp.git
cd pdfkb-mcp
pip install -e ".[dev]"

Complete Environment Variables Reference

Variable	Default	Description
`PDFKB_OPENAI_API_KEY`	required	OpenAI API key for embeddings
`PDFKB_OPENROUTER_API_KEY`	optional	Required for LLM parser
`PDFKB_KNOWLEDGEBASE_PATH`	`./pdfs`	PDF directory path
`PDFKB_CACHE_DIR`	`./.cache`	Cache directory
`PDFKB_PDF_PARSER`	`pymupdf4llm`	PDF parser selection
`PDFKB_PDF_CHUNKER`	`langchain`	Chunking strategy
`PDFKB_CHUNK_SIZE`	`1000`	LangChain chunk size
`PDFKB_CHUNK_OVERLAP`	`200`	LangChain chunk overlap
`PDFKB_EMBEDDING_MODEL`	`text-embedding-3-large`	OpenAI model
`PDFKB_EMBEDDING_BATCH_SIZE`	`100`	Embedding batch size
`PDFKB_VECTOR_SEARCH_K`	`5`	Default search results
`PDFKB_FILE_SCAN_INTERVAL`	`60`	File monitoring interval
`PDFKB_LOG_LEVEL`	`INFO`	Logging level
`PDFKB_ENABLE_WEB`	`true`	Enable/disable web interface
`PDFKB_WEB_PORT`	`8080`	Web server port
`PDFKB_WEB_HOST`	`localhost`	Web server host
`PDFKB_WEB_CORS_ORIGINS`	`http://localhost:3000,http://127.0.0.1:3000`	CORS allowed origins (comma-separated)

Parser Comparison Details

Feature	PyMuPDF4LLM	Marker	MinerU	Docling	LLM
Speed	Fastest	Medium	Fast (GPU)	Medium	Slowest
Memory	Lowest	Medium	High	Medium	Lowest
Tables	Basic	Good	Excellent	Excellent	Excellent
Formulas	Basic	Good	Excellent	Good	Excellent
Images	Basic	Good	Good	Excellent	Excellent
Setup	Simple	Simple	Moderate	Simple	Simple
Cost	Free	Free	Free	Free	API costs

Chunking Strategies

LangChain (PDFKB_PDF_CHUNKER=langchain):

Header-aware splitting with MarkdownHeaderTextSplitter
Configurable via PDFKB_CHUNK_SIZE and PDFKB_CHUNK_OVERLAP
Best for customizable chunking
Default and installed with base package

Unstructured (PDFKB_PDF_CHUNKER=unstructured):

Intelligent semantic chunking with unstructured library
Zero configuration required
Install extra: pip install "pdfkb-mcp[unstructured_chunker]" to enable
Best for document structure awareness

First-run notes

On the first run, the server initializes caches and vector store and logs selected components:
- Parser: PyMuPDF4LLM (default)
- Chunker: LangChain (default)
- Embedding Model: text-embedding-3-large (default)
If you select a parser/chunker that isn’t installed, the server logs a warning with the exact install command and falls back to the default components instead of exiting.

Troubleshooting Guide

API Key Issues:

Verify key format starts with sk-
Check account has sufficient credits
Test connectivity: curl -H "Authorization: Bearer $PDFKB_OPENAI_API_KEY" https://api.openai.com/v1/models

Parser Installation Issues:

MinerU: pip install mineru[all] and verify mineru --version
Docling: pip install docling for basic, pip install pdfkb-mcp[docling-complete] for all features
LLM: Requires PDFKB_OPENROUTER_API_KEY environment variable

Performance Optimization:

Speed: Use pymupdf4llm parser (fastest, low memory footprint)
Memory: Reduce PDFKB_EMBEDDING_BATCH_SIZE and PDFKB_CHUNK_SIZE; use pypdfium backend for Docling
Quality: Use mineru with GPU (>10K tokens/s on RTX 4090) or marker for balanced quality
Tables: Use docling with PDFKB_DOCLING_TABLE_MODE=ACCURATE or marker with LLM mode
Batch Processing: Use marker on H100 (~25 pages/s) or mineru with sglang acceleration

For additional support, see implementation details in src/pdfkb/main.py and src/pdfkb/config.py.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.0

Sep 13, 2025

0.6.0

Sep 13, 2025

0.5.1

Aug 17, 2025

0.5.0

Aug 17, 2025

0.4.3

Aug 12, 2025

0.4.2

Aug 12, 2025

0.4.1

Aug 12, 2025

This version

0.4.0

Aug 11, 2025

0.3.0

Aug 10, 2025

0.2.0

Aug 9, 2025

0.1.2

Aug 7, 2025

0.1.1

Aug 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfkb_mcp-0.4.0.tar.gz (220.6 kB view details)

Uploaded Aug 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfkb_mcp-0.4.0-py3-none-any.whl (143.2 kB view details)

Uploaded Aug 11, 2025 Python 3

File details

Details for the file pdfkb_mcp-0.4.0.tar.gz.

File metadata

Download URL: pdfkb_mcp-0.4.0.tar.gz
Upload date: Aug 11, 2025
Size: 220.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdfkb_mcp-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`ae12b430342b1270d4a9385eb4dcfa10b118e8f791dc5e3148993efa15bd6afc`
MD5	`2f1405b6119c05e7276a723f8e4b184d`
BLAKE2b-256	`9884d9562eaed7caf653b000c96d69732550cce85868785f5ed587d176920446`

See more details on using hashes here.

File details

Details for the file pdfkb_mcp-0.4.0-py3-none-any.whl.

File metadata

Download URL: pdfkb_mcp-0.4.0-py3-none-any.whl
Upload date: Aug 11, 2025
Size: 143.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdfkb_mcp-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7eb0a89eeb25b7860f3ac23aff995322fbae03480d7dbcaa31d40a7d7cf6a5ec`
MD5	`d532f610d5bd83a500ef439743338653`
BLAKE2b-256	`78dc4d57e470793bb3db2556d308a6ff15b92795546f6bed1d5c1be7f7174730`

See more details on using hashes here.

pdfkb-mcp 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF Knowledgebase MCP Server

Table of Contents

🚀 Quick Start

Step 1: Configure Your MCP Client

Step 3: Verify Installation

🌐 Web Interface

Server Modes

Web Interface Features

Quick Web Setup

Web Configuration Options

Command Line Options

API Documentation

🏗️ Architecture Overview

MCP Integration

Internal Architecture

Available Tools & Resources

🤖 Local Embeddings

Features

Quick Start

Supported Models

Hardware Optimization

Configuration Options

Switching to OpenAI

Performance Tips

🔍 Hybrid Search

How It Works

Benefits

Configuration

Installation

🎯 Parser Selection Guide

Decision Tree

Performance Comparison

⚙️ Configuration

Tier 1: Basic Configurations (80% of users)

Tier 2: Use Case Specific (15% of users)

Essential Environment Variables

🖥️ MCP Client Setup

Claude Desktop

VS Code with Native MCP Support

VS Code with Continue Extension

Generic MCP Client

📊 Performance & Troubleshooting

Common Issues

Resource Requirements

🔧 Advanced Configuration

Parser-Specific Options

Performance Tuning

Intelligent Caching

📚 Appendix

Installation Options

Complete Environment Variables Reference

Parser Comparison Details

Chunking Strategies

First-run notes

Troubleshooting Guide

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata