Skip to main content

Turn any PDF folder into a searchable MCP server

Project description

pdf2mcp

██████╗ ██████╗ ███████╗██████╗ ███╗   ███╗ ██████╗██████╗
██╔══██╗██╔══██╗██╔════╝╚════██╗████╗ ████║██╔════╝██╔══██╗
██████╔╝██║  ██║█████╗   █████╔╝██╔████╔██║██║     ██████╔╝
██╔═══╝ ██║  ██║██╔══╝  ██╔═══╝ ██║╚██╔╝██║██║     ██╔═══╝
██║     ██████╔╝██║     ███████╗██║ ╚═╝ ██║╚██████╗██║
╚═╝     ╚═════╝ ╚═╝     ╚══════╝╚═╝     ╚═╝ ╚═════╝╚═╝

PyPI License: MIT Python 3.10+

Turn any PDF folder into a searchable MCP server with semantic, hybrid, or keyword search.

Installation

From PyPI (recommended)

pip install pdf2mcp

Or with uv:

uv tool install pdf2mcp

From source

git clone https://github.com/iSamBa/pdf2mcp.git
uv tool install ./pdf2mcp

To update after pulling new changes:

uv tool install --force ./pdf2mcp

Optional: Tesseract OCR

Tesseract is only needed if you want to extract text from scanned or image-only PDFs. Without it, pdf2mcp works fine for text-based PDFs — image-only pages are simply skipped with a warning.

macOS:

brew install tesseract

Ubuntu / Debian:

sudo apt-get install tesseract-ocr

Windows:

Download the installer from UB-Mannheim/tesseract.

Additional languages: install language packs for non-English PDFs:

# Example: French and German
sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu
# or on macOS
brew install tesseract-lang

Then set PDF2MCP_OCR_LANGUAGE to the appropriate language code (e.g., fra, deu).

Verify

pdf2mcp --version

Quick Start

Interactive Setup (recommended)

pdf2mcp init -i ./my-project

The interactive wizard walks you through all configuration in 6 steps:

  1. Project directory — confirm or change the target path
  2. OpenAI API key — securely enter your key (masked input) and optional base URL
  3. Documents directory — where your PDFs live (default: docs)
  4. Embedding settings — choose model, chunk size, and overlap
  5. Server settings — name, transport, host, and port
  6. OCR settings — enable/disable OCR for scanned PDFs

After setup, the wizard optionally offers to ingest any PDFs found in your docs directory and generate ready-to-paste MCP client config snippets.

Manual Setup

# 1. Scaffold a project (creates docs/ and .env template)
pdf2mcp init ./my-project
cd my-project

# 2. Add your PDFs to docs/ and set OPENAI_API_KEY in .env

# 3. Ingest
pdf2mcp ingest

# 4. Start the server
pdf2mcp serve

# 5. Get config snippets for your MCP client
pdf2mcp config

Architecture

pdf2mcp separates server and client concerns:

  • Server (pdf2mcp serve) — runs independently, handles PDF ingestion, embedding, and search. Configured via PDF2MCP_* environment variables.
  • Client (Claude Code, Cursor, VS Code, etc.) — connects to a running server over HTTP. Only needs the server URL.

The default transport is streamable-http. The server listens on http://127.0.0.1:8000/mcp and shuts down gracefully on SIGINT/SIGTERM.

OCR / Scanned PDF Support

pdf2mcp automatically detects image-only pages in PDFs and falls back to Tesseract OCR when available:

  • Per-page strategy: text pages are extracted via pymupdf4llm; image-only pages are OCR'd via Tesseract.
  • Automatic detection: each page is checked for extractable text (via _page_has_text) and image dominance (via _is_image_dominant). Pages without sufficient text are classified as image-only.
  • Graceful degradation: if Tesseract is not installed or OCR is disabled, image-only pages are skipped with a warning — text-based pages are still extracted normally.
  • Configuration: use PDF2MCP_OCR_ENABLED, PDF2MCP_OCR_LANGUAGE, and PDF2MCP_OCR_DPI environment variables (see Environment Variables).

Commands

Command Description
pdf2mcp init [dir] Scaffold a working directory with docs/ and .env
pdf2mcp init -i [dir] Launch the interactive setup wizard
pdf2mcp ingest Parse PDFs, chunk, embed, and store in vector DB
pdf2mcp serve Start the MCP server (HTTP by default)
pdf2mcp config Print ready-to-paste config for MCP clients
pdf2mcp stats Display index statistics (doc count, chunks, DB size)
pdf2mcp search <query> Search the index from the command line
pdf2mcp delete <filename> Delete a document from the index

Common Flags

# Override docs directory
pdf2mcp ingest --docs-dir ./my-pdfs
pdf2mcp serve --docs-dir ./my-pdfs

# Force re-ingestion (clears DB and re-ingests all documents)
pdf2mcp ingest --force

# Enable debug logging
pdf2mcp ingest -v
pdf2mcp serve --verbose

# Use stdio transport (for clients that spawn the server)
pdf2mcp serve --transport stdio

# Custom host/port
pdf2mcp serve --host 0.0.0.0 --port 9000

# Custom server name
pdf2mcp serve --name my-docs

# Config for a specific client
pdf2mcp config --client cursor
pdf2mcp config --client claude-desktop --transport stdio

# Interactive setup wizard
pdf2mcp init -i ./my-project
pdf2mcp init --interactive

# View index statistics
pdf2mcp stats

# Search the index from CLI
pdf2mcp search "safety requirements"
pdf2mcp search "torque settings" --filename manual.pdf
pdf2mcp search "installation" -n 10

# Delete a document from the index
pdf2mcp delete old-manual.pdf
pdf2mcp delete old-manual.pdf -y   # skip confirmation

Client Configuration

pdf2mcp config generates ready-to-paste JSON for all supported clients. The default is HTTP — clients just need the server URL:

{
  "mcpServers": {
    "pdf-docs": {
      "type": "http",
      "url": "http://127.0.0.1:8000/mcp"
    }
  }
}
Client Config File Top-level Key HTTP Support
Claude Code .mcp.json mcpServers Yes
Claude Desktop claude_desktop_config.json mcpServers No (stdio only)
Cursor .cursor/mcp.json mcpServers Yes
VS Code / Copilot .vscode/mcp.json servers Yes

Use --transport stdio for clients that need to spawn the server process (e.g., Claude Desktop):

{
  "mcpServers": {
    "pdf-docs": {
      "command": "uv",
      "args": ["run", "pdf2mcp", "serve"]
    }
  }
}

Environment Variables

Server settings (PDF2MCP_*)

These configure the server process. MCP clients never need these.

Variable Default Description
OPENAI_API_KEY (required) OpenAI API key for embeddings
PDF2MCP_OPENAI_BASE_URL https://api.openai.com/v1 OpenAI API base URL (for Azure, local proxies, or compatible providers)
PDF2MCP_DOCS_DIR docs Directory containing PDF files
PDF2MCP_DATA_DIR data Directory for vector database
PDF2MCP_EMBEDDING_MODEL text-embedding-3-small OpenAI embedding model
PDF2MCP_CHUNK_SIZE 500 Target chunk size in tokens
PDF2MCP_CHUNK_OVERLAP 50 Overlap between chunks in tokens
PDF2MCP_DEFAULT_NUM_RESULTS 5 Default search results count
PDF2MCP_SERVER_NAME pdf-docs MCP server name
PDF2MCP_SERVER_TRANSPORT streamable-http Transport protocol
PDF2MCP_SERVER_HOST 127.0.0.1 Host to bind to
PDF2MCP_SERVER_PORT 8000 Port to bind to
PDF2MCP_SEARCH_MODE semantic Search mode: semantic, hybrid, or keyword
PDF2MCP_OCR_ENABLED true Enable OCR for scanned/image-only pages
PDF2MCP_OCR_LANGUAGE eng Tesseract language code
PDF2MCP_OCR_DPI 300 DPI for OCR rendering

Search Modes

pdf2mcp supports three search modes, controlled by the PDF2MCP_SEARCH_MODE environment variable:

Mode Description When to use
semantic (default) Pure vector similarity search General natural-language queries
keyword Full-text search (no embeddings needed) Exact terms, acronyms, error codes
hybrid Combines vector + full-text search Best of both worlds

To switch modes, set PDF2MCP_SEARCH_MODE in your .env and re-ingest:

# In .env
PDF2MCP_SEARCH_MODE=hybrid

# Re-ingest to build the FTS index
pdf2mcp ingest --force

Hybrid and keyword modes automatically create a full-text search index. If you switch modes without re-ingesting, the FTS index is created lazily on the first query.

MCP Tools

The server exposes six tools:

Tool Description
search_docs(query) Search across all ingested PDFs
search_in_doc(query, filename) Search scoped to a single document
list_docs() List all ingested documents with chunk counts
get_sections(filename) Get section headings for a specific document
read_page(filename, page) Read the full content of a specific page
read_section(filename, section_title) Read the full content of a named section

Typical workflow

  1. list_docs — discover available documents
  2. get_sections — browse a document's structure
  3. read_section or read_page — read specific content
  4. search_docs or search_in_doc — find information by query

MCP Prompts

The server provides five prompts that guide LLMs through multi-tool workflows:

Prompt Args Description
summarize_document filename Read all sections and synthesize a summary
compare_documents filename1, filename2 Side-by-side comparison of two documents
extract_key_findings filename Extract conclusions, recommendations, and key findings
deep_dive filename, topic Exhaustive analysis of a specific topic
document_overview filename Structured table of contents with brief descriptions

Prompts return step-by-step instructions that reference the existing tools, enabling LLMs to perform complex multi-step document analysis automatically.

MCP Resources

Resource URI Description
docs://status Server status: document count, chunk count, embedding model, and docs directory

Development

git clone https://github.com/iSamBa/pdf2mcp.git
cd pdf2mcp
uv sync --all-extras
uv run pytest
uv run ruff check src/
uv run mypy src/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2mcp-0.6.0.tar.gz (159.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2mcp-0.6.0-py3-none-any.whl (43.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf2mcp-0.6.0.tar.gz.

File metadata

  • Download URL: pdf2mcp-0.6.0.tar.gz
  • Upload date:
  • Size: 159.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2mcp-0.6.0.tar.gz
Algorithm Hash digest
SHA256 28ab7891ac6e37bef65a9d5e1799691c37f068d1be5006cf4cf9ac1db95ba207
MD5 f803b0302c504067b30366bbfc31c799
BLAKE2b-256 8f072cb7f4fa17f368c7ece54eec9d5dcc9ae9832c3214bb3e08d6967e60a457

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2mcp-0.6.0.tar.gz:

Publisher: publish.yml on iSamBa/pdf2mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdf2mcp-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2mcp-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 43.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2mcp-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f34d43a2dc35d2b60b81f04f9ebb7a91beeada9709e751788017506388ad415
MD5 6ddb959b8d1bc88a644b35ba756dbc62
BLAKE2b-256 0b146df6141c5745d1bf3b2e9331900f5ef81e901b2a363fb8e93a433a09a802

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2mcp-0.6.0-py3-none-any.whl:

Publisher: publish.yml on iSamBa/pdf2mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page