MCP server for full-text search across PDF document collections

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

pdf-search-mcp

MCP server for full-text search across PDF document collections. Built for AI agents — index once, search instantly from any MCP client.

Search entire collections — pre-indexes all PDFs for instant ranked results with snippets, not one file at a time
Fully offline — no API keys, no cloud services, just SQLite FTS5 and PyMuPDF
Page rendering — render pages as PNG for formulas, diagrams, and tables; crop to a region with auto-DPI scaling for detail shots
Dual renderer — CoreGraphics on macOS (sharper math fonts), PyMuPDF on Linux/Windows
German-aware — automatic expansion of ß↔ss, ä↔ae, ö↔oe, ü↔ue so both spellings match

Installation

From PyPI

pip install pdf-search-mcp

From source

git clone https://github.com/renvk/pdf-search-mcp.git
cd pdf-search-mcp
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

Requires Python 3.10+. On macOS, pyobjc-framework-Quartz is installed automatically for native CoreGraphics PDF rendering (sharper formula and math font output). On Linux/Windows, PyMuPDF is used as the renderer.

Quick Start

1. Index your PDFs

PDF_SEARCH_DIR=/path/to/your/pdfs python -m pdf_search_mcp.pdf_search index

2. Register with your MCP client

The server runs over stdio. Example for Claude Code:

# project-scoped (only available in the current directory)
claude mcp add pdf-search -- pdf-search-mcp

# or global (available in all projects)
claude mcp add --scope global pdf-search -- pdf-search-mcp

For other MCP clients, add to your MCP config:

{
  "mcpServers": {
    "pdf-search": {
      "command": "pdf-search-mcp"
    }
  }
}

3. Search

Ask your AI agent to search your PDFs — it will use the search, read_page, and read_page_image tools automatically.

Configuration

Environment Variable	Default	Description
`PDF_SEARCH_DIR`	(none)	Path to your PDF directory (required for first index, remembered after)
`PDF_SEARCH_DB`	`~/.local/share/pdf-search-mcp/pdf_index.db`	Path to the SQLite database file

CLI Usage

The pdf_search.py module doubles as a CLI for indexing and direct search:

# Build index (first time — PDF_SEARCH_DIR required)
PDF_SEARCH_DIR=/path/to/pdfs python -m pdf_search_mcp.pdf_search index

# Subsequent syncs (path remembered from first index)
python -m pdf_search_mcp.pdf_search index

# Search from command line
python -m pdf_search_mcp.pdf_search search "query terms"

# Read a specific page
python -m pdf_search_mcp.pdf_search read filename.pdf 5

# Show index statistics
python -m pdf_search_mcp.pdf_search stats

# Rebuild index from scratch (path remembered)
python -m pdf_search_mcp.pdf_search reindex

Search Syntax

Uses SQLite FTS5 query syntax:

Syntax	Example	Description
Terms	`distributed consensus`	Both terms must appear (implicit AND)
Phrase	`"garbage collection"`	Exact phrase match
OR	`mutex OR semaphore`	Either term
NOT	`cache NOT redis`	Exclude term
Prefix	`concur*`	Prefix matching
NEAR	`NEAR(load balancer, 10)`	Terms within 10 tokens of each other

Auto-quoting: Terms containing any special character (dots, hyphens, commas, slashes, colons, ...) are automatically quoted (e.g., ISO-27001 becomes "ISO-27001", 1:100 becomes "1:100") because FTS5 treats these as token separators or operators. Query preparation guarantees valid FTS5 syntax — stray quotes are dropped, unbalanced parentheses are repaired, and dangling AND/OR operators are trimmed. The one exception is NOT without a left operand (FTS5's NOT is binary): it is passed through and returns a clear error, because silently searching the excluded term would invert the query's meaning.

German expansion: Umlauts and eszett are automatically expanded to their digraph equivalents and vice versa (ß↔ss, ä↔ae, ö↔oe, ü↔ue). Searching for Größe also finds Groesse, and Weißbuch also finds Weissbuch. Reverse expansion (ss→ß) replaces one position at a time. Expansion also applies inside NEAR() expressions.

Auto-relaxation: When a multi-term query returns no results (all terms must appear on the same page), the search automatically relaxes: first by dropping the term least represented in the corpus (chosen by uncapped match counts), then by OR-ing all terms. A note in the output explains what was actually searched. Structured queries (explicit AND, OR, NOT, NEAR, parentheses) are not relaxed.

MCP Tools

Tool	Parameters	Description
`search`	`query`, `limit=10`	Full-text search with ranked results and snippets (limit range 1-50)
`read_page`	`filename`, `page`, `subfolder=None`	Read the full text of a specific page
`read_page_image`	`filename`, `page`, `dpi=140`, `region=None`, `subfolder=None`	Render a page (or cropped region) as PNG. `region=[x1,y1,x2,y2]` with 0.0–1.0 fractional coords to crop; DPI auto-scales for the cropped area
`stats`	(none)	Show index statistics (file count, pages, DB size, renderer)

When the same filename exists in several subfolders, read_page and read_page_image require the subfolder parameter ("" selects the root folder); an unspecified subfolder returns an error listing the candidates instead of picking one arbitrarily.

Python API

from pdf_search_mcp import (
    search_with_relaxation, search_pdfs, prepare_query,
    read_pdf_page, render_pdf_page, index_pdfs,
)

# Index PDFs
index_pdfs("/path/to/pdfs")

# Search with the full pipeline (auto-quoting, German expansion,
# relaxation) — same behavior as the MCP search tool and the CLI
results, note = search_with_relaxation("ISO-27001 Anhang", limit=5)
for r in results:
    print(f"{r['subfolder']}/{r['file']} p.{r['page']}: {r['snippet']}")

# Low-level: search_pdfs takes a RAW FTS5 MATCH string (no preparation).
# Run user input through prepare_query first.
results = search_pdfs(prepare_query("garbage collection"), limit=5)

# Read full page text
text = read_pdf_page("document.pdf", 42)

# Render full page as PNG
png_path = render_pdf_page("document.pdf", 42)

# Render cropped region (DPI auto-scales to maximize detail)
png_path = render_pdf_page("document.pdf", 42, region=[0.0, 0.5, 1.0, 0.8])

How It Works

Indexing incrementally syncs your PDF directory into a SQLite FTS5 virtual table. On first run, all PDFs are indexed. On subsequent runs, only new, changed (by mtime/size), and deleted files are processed, each committed individually so an interrupted run resumes where it stopped. Only page content is searchable — filenames, subfolders, and page numbers are stored as unindexed metadata so query terms cannot match them. Directories starting with _ are skipped.

Upgrading to 0.3.0: the FTS5 schema changed (metadata columns are no longer searchable). Existing indexes are detected and refused with a clear error — run python -m pdf_search_mcp.pdf_search reindex once to rebuild.

Searching runs FTS5 MATCH queries and re-ranks results by combining BM25 relevance with match density — pages where search terms cluster together score higher than pages with the same terms scattered throughout. The density signal blends term concentration (matches per character) and spatial clustering (how tightly grouped the matches are).
Reading re-opens the original PDF file on disk (path resolved via the stored pdf_dir metadata) for full page text or image rendering. Region crops auto-scale DPI to fill a 1568 px long-edge budget, maximizing detail without producing oversized images.

The database stores the text content only — original PDFs are accessed on disk for read_page and read_page_image. Rendering uses CoreGraphics on macOS and PyMuPDF elsewhere.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

renvk

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.0

Jun 11, 2026

This version

0.3.0

Jun 11, 2026

0.2.0

Mar 10, 2026

0.1.0

Mar 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_search_mcp-0.3.0.tar.gz (47.5 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_search_mcp-0.3.0-py3-none-any.whl (30.7 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file pdf_search_mcp-0.3.0.tar.gz.

File metadata

Download URL: pdf_search_mcp-0.3.0.tar.gz
Upload date: Jun 11, 2026
Size: 47.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_search_mcp-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`d3e4a793c7949cf185979abec91c11de34c27e291a046ee7ae0b456ce84e38ef`
MD5	`0bc736ca7f87de7a51989225913b70b0`
BLAKE2b-256	`6c34b9fa5dcdb5b3cc19518d449c3fd2d2a98d256f5d1f6191fdaad3e9e512e0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf_search_mcp-0.3.0.tar.gz:

Publisher: publish.yml on renvk/pdf-search-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf_search_mcp-0.3.0.tar.gz
- Subject digest: d3e4a793c7949cf185979abec91c11de34c27e291a046ee7ae0b456ce84e38ef
- Sigstore transparency entry: 1789675101
- Sigstore integration time: Jun 11, 2026
Source repository:
- Permalink: renvk/pdf-search-mcp@bc315707da4e0f3b818e8eaec37cedc35aefe0c3
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/renvk
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@bc315707da4e0f3b818e8eaec37cedc35aefe0c3
- Trigger Event: release

File details

Details for the file pdf_search_mcp-0.3.0-py3-none-any.whl.

File metadata

Download URL: pdf_search_mcp-0.3.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 30.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_search_mcp-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a894753b7e3e6e8e76decfbeba57b0d58fdd5d96010e722d98a1e7e76909486d`
MD5	`0282924ef7263034ad58cddbe8a9e389`
BLAKE2b-256	`38008c7f8589b61d1fce638467bc0b2f05afcfeb43f7f42d582ade2311c93dda`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf_search_mcp-0.3.0-py3-none-any.whl:

Publisher: publish.yml on renvk/pdf-search-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf_search_mcp-0.3.0-py3-none-any.whl
- Subject digest: a894753b7e3e6e8e76decfbeba57b0d58fdd5d96010e722d98a1e7e76909486d
- Sigstore transparency entry: 1789675309
- Sigstore integration time: Jun 11, 2026
Source repository:
- Permalink: renvk/pdf-search-mcp@bc315707da4e0f3b818e8eaec37cedc35aefe0c3
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/renvk
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@bc315707da4e0f3b818e8eaec37cedc35aefe0c3
- Trigger Event: release

pdf-search-mcp 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Project description

pdf-search-mcp

Installation

From PyPI

From source

Quick Start

1. Index your PDFs

2. Register with your MCP client

3. Search

Configuration

CLI Usage

Search Syntax

MCP Tools

Python API

How It Works

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance