Turn any PDF folder into a searchable MCP server
Project description
pdf2mcp
██████╗ ██████╗ ███████╗██████╗ ███╗ ███╗ ██████╗██████╗
██╔══██╗██╔══██╗██╔════╝╚════██╗████╗ ████║██╔════╝██╔══██╗
██████╔╝██║ ██║█████╗ █████╔╝██╔████╔██║██║ ██████╔╝
██╔═══╝ ██║ ██║██╔══╝ ██╔═══╝ ██║╚██╔╝██║██║ ██╔═══╝
██║ ██████╔╝██║ ███████╗██║ ╚═╝ ██║╚██████╗██║
╚═╝ ╚═════╝ ╚═╝ ╚══════╝╚═╝ ╚═╝ ╚═════╝╚═╝
Turn any PDF folder into a searchable MCP server with semantic search.
Installation
From PyPI (recommended)
pip install pdf2mcp
Or with uv:
uv tool install pdf2mcp
From source
git clone https://github.com/iSamBa/pdf2mcp.git
uv tool install ./pdf2mcp
To update after pulling new changes:
uv tool install --force ./pdf2mcp
Optional: Tesseract OCR
Tesseract is only needed if you want to extract text from scanned or image-only PDFs. Without it, pdf2mcp works fine for text-based PDFs — image-only pages are simply skipped with a warning.
macOS:
brew install tesseract
Ubuntu / Debian:
sudo apt-get install tesseract-ocr
Windows:
Download the installer from UB-Mannheim/tesseract.
Additional languages: install language packs for non-English PDFs:
# Example: French and German
sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu
# or on macOS
brew install tesseract-lang
Then set PDF2MCP_OCR_LANGUAGE to the appropriate language code (e.g., fra, deu).
Verify
pdf2mcp --version
Quick Start
Interactive Setup (recommended)
pdf2mcp init -i ./my-project
The interactive wizard walks you through all configuration in 6 steps:
- Project directory — confirm or change the target path
- OpenAI API key — securely enter your key (masked input) and optional base URL
- Documents directory — where your PDFs live (default:
docs) - Embedding settings — choose model, chunk size, and overlap
- Server settings — name, transport, host, and port
- OCR settings — enable/disable OCR for scanned PDFs
After setup, the wizard optionally offers to ingest any PDFs found in your docs directory and generate ready-to-paste MCP client config snippets.
Manual Setup
# 1. Scaffold a project (creates docs/ and .env template)
pdf2mcp init ./my-project
cd my-project
# 2. Add your PDFs to docs/ and set OPENAI_API_KEY in .env
# 3. Ingest
pdf2mcp ingest
# 4. Start the server
pdf2mcp serve
# 5. Get config snippets for your MCP client
pdf2mcp config
Architecture
pdf2mcp separates server and client concerns:
- Server (
pdf2mcp serve) — runs independently, handles PDF ingestion, embedding, and search. Configured viaPDF2MCP_*environment variables. - Client (Claude Code, Cursor, VS Code, etc.) — connects to a running server over HTTP. Only needs the server URL.
The default transport is streamable-http. The server listens on http://127.0.0.1:8000/mcp and shuts down gracefully on SIGINT/SIGTERM.
OCR / Scanned PDF Support
pdf2mcp automatically detects image-only pages in PDFs and falls back to Tesseract OCR when available:
- Per-page strategy: text pages are extracted via pymupdf4llm; image-only pages are OCR'd via Tesseract.
- Automatic detection: each page is checked for extractable text (via
_page_has_text) and image dominance (via_is_image_dominant). Pages without sufficient text are classified as image-only. - Graceful degradation: if Tesseract is not installed or OCR is disabled, image-only pages are skipped with a warning — text-based pages are still extracted normally.
- Configuration: use
PDF2MCP_OCR_ENABLED,PDF2MCP_OCR_LANGUAGE, andPDF2MCP_OCR_DPIenvironment variables (see Environment Variables).
Commands
| Command | Description |
|---|---|
pdf2mcp init [dir] |
Scaffold a working directory with docs/ and .env |
pdf2mcp init -i [dir] |
Launch the interactive setup wizard |
pdf2mcp ingest |
Parse PDFs, chunk, embed, and store in vector DB |
pdf2mcp serve |
Start the MCP server (HTTP by default) |
pdf2mcp config |
Print ready-to-paste config for MCP clients |
Common Flags
# Override docs directory
pdf2mcp ingest --docs-dir ./my-pdfs
pdf2mcp serve --docs-dir ./my-pdfs
# Force re-ingestion (clears DB and re-ingests all documents)
pdf2mcp ingest --force
# Enable debug logging
pdf2mcp ingest -v
pdf2mcp serve --verbose
# Use stdio transport (for clients that spawn the server)
pdf2mcp serve --transport stdio
# Custom host/port
pdf2mcp serve --host 0.0.0.0 --port 9000
# Custom server name
pdf2mcp serve --name my-docs
# Config for a specific client
pdf2mcp config --client cursor
pdf2mcp config --client claude-desktop --transport stdio
# Interactive setup wizard
pdf2mcp init -i ./my-project
pdf2mcp init --interactive
Client Configuration
pdf2mcp config generates ready-to-paste JSON for all supported clients. The default is HTTP — clients just need the server URL:
{
"mcpServers": {
"pdf-docs": {
"type": "http",
"url": "http://127.0.0.1:8000/mcp"
}
}
}
| Client | Config File | Top-level Key | HTTP Support |
|---|---|---|---|
| Claude Code | .mcp.json |
mcpServers |
Yes |
| Claude Desktop | claude_desktop_config.json |
mcpServers |
No (stdio only) |
| Cursor | .cursor/mcp.json |
mcpServers |
Yes |
| VS Code / Copilot | .vscode/mcp.json |
servers |
Yes |
Use --transport stdio for clients that need to spawn the server process (e.g., Claude Desktop):
{
"mcpServers": {
"pdf-docs": {
"command": "uv",
"args": ["run", "pdf2mcp", "serve"]
}
}
}
Environment Variables
Server settings (PDF2MCP_*)
These configure the server process. MCP clients never need these.
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | OpenAI API key for embeddings |
PDF2MCP_OPENAI_BASE_URL |
https://api.openai.com/v1 |
OpenAI API base URL (for Azure, local proxies, or compatible providers) |
PDF2MCP_DOCS_DIR |
docs |
Directory containing PDF files |
PDF2MCP_DATA_DIR |
data |
Directory for vector database |
PDF2MCP_EMBEDDING_MODEL |
text-embedding-3-small |
OpenAI embedding model |
PDF2MCP_CHUNK_SIZE |
500 |
Target chunk size in tokens |
PDF2MCP_CHUNK_OVERLAP |
50 |
Overlap between chunks in tokens |
PDF2MCP_DEFAULT_NUM_RESULTS |
5 |
Default search results count |
PDF2MCP_SERVER_NAME |
pdf-docs |
MCP server name |
PDF2MCP_SERVER_TRANSPORT |
streamable-http |
Transport protocol |
PDF2MCP_SERVER_HOST |
127.0.0.1 |
Host to bind to |
PDF2MCP_SERVER_PORT |
8000 |
Port to bind to |
PDF2MCP_OCR_ENABLED |
true |
Enable OCR for scanned/image-only pages |
PDF2MCP_OCR_LANGUAGE |
eng |
Tesseract language code |
PDF2MCP_OCR_DPI |
300 |
DPI for OCR rendering |
MCP Tools
The server exposes six tools:
| Tool | Description |
|---|---|
search_docs(query) |
Semantic search across all ingested PDFs |
search_in_doc(query, filename) |
Semantic search scoped to a single document |
list_docs() |
List all ingested documents with chunk counts |
get_sections(filename) |
Get section headings for a specific document |
read_page(filename, page) |
Read the full content of a specific page |
read_section(filename, section_title) |
Read the full content of a named section |
Typical workflow
list_docs— discover available documentsget_sections— browse a document's structureread_sectionorread_page— read specific contentsearch_docsorsearch_in_doc— find information by query
MCP Resources
| Resource URI | Description |
|---|---|
docs://status |
Server status: document count, chunk count, embedding model, and docs directory |
Development
git clone https://github.com/iSamBa/pdf2mcp.git
cd pdf2mcp
uv sync --all-extras
uv run pytest
uv run ruff check src/
uv run mypy src/
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2mcp-0.5.0.tar.gz.
File metadata
- Download URL: pdf2mcp-0.5.0.tar.gz
- Upload date:
- Size: 153.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42ae1e056d3fb7c595efe7b8fa89940fce743c4c69085cb6edd405f3ef113ffc
|
|
| MD5 |
2e6ef136ed856fcdedb57054c93eaceb
|
|
| BLAKE2b-256 |
c2cd60a941be757764c970b861aac954ffcee3c684721f4bb35dc39bbcc464d6
|
Provenance
The following attestation bundles were made for pdf2mcp-0.5.0.tar.gz:
Publisher:
publish.yml on iSamBa/pdf2mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf2mcp-0.5.0.tar.gz -
Subject digest:
42ae1e056d3fb7c595efe7b8fa89940fce743c4c69085cb6edd405f3ef113ffc - Sigstore transparency entry: 1098358456
- Sigstore integration time:
-
Permalink:
iSamBa/pdf2mcp@e3e96f8801c2744dc9f1f891409e4a1d4c63e436 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/iSamBa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e3e96f8801c2744dc9f1f891409e4a1d4c63e436 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pdf2mcp-0.5.0-py3-none-any.whl.
File metadata
- Download URL: pdf2mcp-0.5.0-py3-none-any.whl
- Upload date:
- Size: 38.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5719eb5e6a43bee4b65516aebcfbf8302700d61dfcd4eca1af8bbbdf559b3928
|
|
| MD5 |
150dbb21d96d3752c1c7fd4f0ceab9e1
|
|
| BLAKE2b-256 |
90440702ac137eff65506431b5a268349030ff464cce063c3bc01959af7c2b43
|
Provenance
The following attestation bundles were made for pdf2mcp-0.5.0-py3-none-any.whl:
Publisher:
publish.yml on iSamBa/pdf2mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf2mcp-0.5.0-py3-none-any.whl -
Subject digest:
5719eb5e6a43bee4b65516aebcfbf8302700d61dfcd4eca1af8bbbdf559b3928 - Sigstore transparency entry: 1098358546
- Sigstore integration time:
-
Permalink:
iSamBa/pdf2mcp@e3e96f8801c2744dc9f1f891409e4a1d4c63e436 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/iSamBa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e3e96f8801c2744dc9f1f891409e4a1d4c63e436 -
Trigger Event:
release
-
Statement type: