Turn any PDF folder into a searchable MCP server

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

pdf2mcp

██████╗ ██████╗ ███████╗██████╗ ███╗   ███╗ ██████╗██████╗
██╔══██╗██╔══██╗██╔════╝╚════██╗████╗ ████║██╔════╝██╔══██╗
██████╔╝██║  ██║█████╗   █████╔╝██╔████╔██║██║     ██████╔╝
██╔═══╝ ██║  ██║██╔══╝  ██╔═══╝ ██║╚██╔╝██║██║     ██╔═══╝
██║     ██████╔╝██║     ███████╗██║ ╚═╝ ██║╚██████╗██║
╚═╝     ╚═════╝ ╚═╝     ╚══════╝╚═╝     ╚═╝ ╚═════╝╚═╝

Turn any PDF folder into a searchable MCP server with semantic search.

Installation

From PyPI (recommended)

pip install pdf2mcp

Or with uv:

uv tool install pdf2mcp

From source

git clone https://github.com/iSamBa/pdf2mcp.git
uv tool install ./pdf2mcp

To update after pulling new changes:

uv tool install --force ./pdf2mcp

Optional: Tesseract OCR

Tesseract is only needed if you want to extract text from scanned or image-only PDFs. Without it, pdf2mcp works fine for text-based PDFs — image-only pages are simply skipped with a warning.

macOS:

brew install tesseract

Ubuntu / Debian:

sudo apt-get install tesseract-ocr

Windows:

Download the installer from UB-Mannheim/tesseract.

Additional languages: install language packs for non-English PDFs:

# Example: French and German
sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu
# or on macOS
brew install tesseract-lang

Then set PDF2MCP_OCR_LANGUAGE to the appropriate language code (e.g., fra, deu).

Verify

pdf2mcp --version

Quick Start

# 1. Scaffold a project (creates docs/ and .env)
pdf2mcp init ./my-project
cd my-project

# 2. Add your PDFs to docs/ and set OPENAI_API_KEY in .env

# 3. Ingest
pdf2mcp ingest

# 4. Start the server
pdf2mcp serve

# 5. Get config snippets for your MCP client
pdf2mcp config

Architecture

pdf2mcp separates server and client concerns:

Server (pdf2mcp serve) — runs independently, handles PDF ingestion, embedding, and search. Configured via PDF2MCP_* environment variables.
Client (Claude Code, Cursor, VS Code, etc.) — connects to a running server over HTTP. Only needs the server URL.

The default transport is streamable-http. The server listens on http://127.0.0.1:8000/mcp and shuts down gracefully on SIGINT/SIGTERM.

OCR / Scanned PDF Support

pdf2mcp automatically detects image-only pages in PDFs and falls back to Tesseract OCR when available:

Per-page strategy: text pages are extracted via pymupdf4llm; image-only pages are OCR'd via Tesseract.
Automatic detection: each page is checked for extractable text (via _page_has_text) and image dominance (via _is_image_dominant). Pages without sufficient text are classified as image-only.
Graceful degradation: if Tesseract is not installed or OCR is disabled, image-only pages are skipped with a warning — text-based pages are still extracted normally.
Configuration: use PDF2MCP_OCR_ENABLED, PDF2MCP_OCR_LANGUAGE, and PDF2MCP_OCR_DPI environment variables (see Environment Variables).

Commands

Command	Description
`pdf2mcp init [dir]`	Scaffold a working directory with `docs/` and `.env`
`pdf2mcp ingest`	Parse PDFs, chunk, embed, and store in vector DB
`pdf2mcp serve`	Start the MCP server (HTTP by default)
`pdf2mcp config`	Print ready-to-paste config for MCP clients

Common Flags

# Override docs directory
pdf2mcp ingest --docs-dir ./my-pdfs
pdf2mcp serve --docs-dir ./my-pdfs

# Force re-ingestion (clears DB and re-ingests all documents)
pdf2mcp ingest --force

# Enable debug logging
pdf2mcp ingest -v
pdf2mcp serve --verbose

# Use stdio transport (for clients that spawn the server)
pdf2mcp serve --transport stdio

# Custom host/port
pdf2mcp serve --host 0.0.0.0 --port 9000

# Custom server name
pdf2mcp serve --name my-docs

# Config for a specific client
pdf2mcp config --client cursor
pdf2mcp config --client claude-desktop --transport stdio

Client Configuration

pdf2mcp config generates ready-to-paste JSON for all supported clients. The default is HTTP — clients just need the server URL:

{
  "mcpServers": {
    "pdf-docs": {
      "type": "http",
      "url": "http://127.0.0.1:8000/mcp"
    }
  }
}

Client	Config File	Top-level Key	HTTP Support
Claude Code	`.mcp.json`	`mcpServers`	Yes
Claude Desktop	`claude_desktop_config.json`	`mcpServers`	No (stdio only)
Cursor	`.cursor/mcp.json`	`mcpServers`	Yes
VS Code / Copilot	`.vscode/mcp.json`	`servers`	Yes

Use --transport stdio for clients that need to spawn the server process (e.g., Claude Desktop):

{
  "mcpServers": {
    "pdf-docs": {
      "command": "uv",
      "args": ["run", "pdf2mcp", "serve"]
    }
  }
}

Environment Variables

Server settings (`PDF2MCP_*`)

These configure the server process. MCP clients never need these.

Variable	Default	Description
`OPENAI_API_KEY`	(required)	OpenAI API key for embeddings
`PDF2MCP_OPENAI_BASE_URL`	`https://api.openai.com/v1`	OpenAI API base URL (for Azure, local proxies, or compatible providers)
`PDF2MCP_DOCS_DIR`	`docs`	Directory containing PDF files
`PDF2MCP_DATA_DIR`	`data`	Directory for vector database
`PDF2MCP_EMBEDDING_MODEL`	`text-embedding-3-small`	OpenAI embedding model
`PDF2MCP_CHUNK_SIZE`	`500`	Target chunk size in tokens
`PDF2MCP_CHUNK_OVERLAP`	`50`	Overlap between chunks in tokens
`PDF2MCP_DEFAULT_NUM_RESULTS`	`5`	Default search results count
`PDF2MCP_SERVER_NAME`	`pdf-docs`	MCP server name
`PDF2MCP_SERVER_TRANSPORT`	`streamable-http`	Transport protocol
`PDF2MCP_SERVER_HOST`	`127.0.0.1`	Host to bind to
`PDF2MCP_SERVER_PORT`	`8000`	Port to bind to
`PDF2MCP_OCR_ENABLED`	`true`	Enable OCR for scanned/image-only pages
`PDF2MCP_OCR_LANGUAGE`	`eng`	Tesseract language code
`PDF2MCP_OCR_DPI`	`300`	DPI for OCR rendering

MCP Tools

The server exposes six tools:

Tool	Description
`search_docs(query)`	Semantic search across all ingested PDFs
`search_in_doc(query, filename)`	Semantic search scoped to a single document
`list_docs()`	List all ingested documents with chunk counts
`get_sections(filename)`	Get section headings for a specific document
`read_page(filename, page)`	Read the full content of a specific page
`read_section(filename, section_title)`	Read the full content of a named section

Typical workflow

list_docs — discover available documents
get_sections — browse a document's structure
read_section or read_page — read specific content
search_docs or search_in_doc — find information by query

MCP Resources

Resource URI	Description
`docs://status`	Server status: document count, chunk count, embedding model, and docs directory

Development

git clone https://github.com/iSamBa/pdf2mcp.git
cd pdf2mcp
uv sync --all-extras
uv run pytest
uv run ruff check src/
uv run mypy src/

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

iSamBa

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

Mar 14, 2026

0.5.0

Mar 13, 2026

This version

0.4.0

Mar 13, 2026

0.3.0

Mar 13, 2026

0.2.3

Mar 12, 2026

0.2.2

Mar 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2mcp-0.4.0.tar.gz (141.1 kB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2mcp-0.4.0-py3-none-any.whl (32.2 kB view details)

Uploaded Mar 13, 2026 Python 3

File details

Details for the file pdf2mcp-0.4.0.tar.gz.

File metadata

Download URL: pdf2mcp-0.4.0.tar.gz
Upload date: Mar 13, 2026
Size: 141.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2mcp-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`d991a54bc3d84dab68a887d17c553d60d6f2921c9198dc29155799ca0272bbfc`
MD5	`9daf7fa75fef1572d6fc1c1cca78fd33`
BLAKE2b-256	`87e83d00287b5f8c3f57078377a6bd2aeb5285cb40d930ff3dfb68687c704fcc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2mcp-0.4.0.tar.gz:

Publisher: publish.yml on iSamBa/pdf2mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf2mcp-0.4.0.tar.gz
- Subject digest: d991a54bc3d84dab68a887d17c553d60d6f2921c9198dc29155799ca0272bbfc
- Sigstore transparency entry: 1097535355
- Sigstore integration time: Mar 13, 2026
Source repository:
- Permalink: iSamBa/pdf2mcp@94e85d3bcc90de87e157b79a3936585313520a36
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/iSamBa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@94e85d3bcc90de87e157b79a3936585313520a36
- Trigger Event: release

File details

Details for the file pdf2mcp-0.4.0-py3-none-any.whl.

File metadata

Download URL: pdf2mcp-0.4.0-py3-none-any.whl
Upload date: Mar 13, 2026
Size: 32.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2mcp-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4cf63cda6f12a9dd9eb0391bd5bd8080a7c77e149859ee4b288b9c9b7ddeb549`
MD5	`5b89b5f2db08f945f096ac9680415880`
BLAKE2b-256	`cb5ab5ec009458d06fff04792a5085d09acf6b083b80cc58d93d04c8eb8e5a36`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2mcp-0.4.0-py3-none-any.whl:

Publisher: publish.yml on iSamBa/pdf2mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf2mcp-0.4.0-py3-none-any.whl
- Subject digest: 4cf63cda6f12a9dd9eb0391bd5bd8080a7c77e149859ee4b288b9c9b7ddeb549
- Sigstore transparency entry: 1097535407
- Sigstore integration time: Mar 13, 2026
Source repository:
- Permalink: iSamBa/pdf2mcp@94e85d3bcc90de87e157b79a3936585313520a36
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/iSamBa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@94e85d3bcc90de87e157b79a3936585313520a36
- Trigger Event: release

pdf2mcp 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pdf2mcp

Installation

From PyPI (recommended)

From source

Optional: Tesseract OCR

Verify

Quick Start

Architecture

OCR / Scanned PDF Support

Commands

Common Flags

Client Configuration

Environment Variables

Server settings (PDF2MCP_*)

MCP Tools

Typical workflow

MCP Resources

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Server settings (`PDF2MCP_*`)