Skip to main content

Structure-aware local PDF ingestion and retrieval for AI clients

Project description

PDF Context Server

PDF Context (pdf-context on PyPI, import pdf_context) is a local-first library and MCP server that transforms PDF documents into structured, retrievable context for AI applications.

Drop PDFs into a watch folder, and the server ingests them automatically — extracting structure, classifying document type, chunking with awareness of chapters/sections, embedding locally, and exposing retrieval tools that AI clients use to teach, answer questions, or navigate documents sequentially.

Drop in PDFs. Build context once. Query from anywhere.


Install

From PyPI (when published):

pip install pdf-context

From source (development):

git clone https://github.com/yourusername/pdf-context-server.git
cd pdf-context-server
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env

Console entry points:

  • pdf-context — developer CLI (ingest, search, smoke tests)
  • pdf-context-mcp — MCP stdio server for AI clients

Programmatic API

from pdf_context import PdfContext, PdfContextConfig

config = PdfContextConfig(
    pdf_data_dir="/path/to/pdfs",
    storage_dir="/path/to/storage",
)
ctx = PdfContext(config, watch=False)
ctx.ingest("my-book.pdf")
results = ctx.search("virtual memory", document="my-book.pdf", top_k=5)
print(results["chunks"])

Each PdfContext instance is isolated: separate (pdf_dir, storage_dir) pairs get separate SQLite + Chroma indexes.


Features (v1)

  • PDF ingestion with folder watch and background job queue
  • Structure extraction from PDF outlines, heading heuristics, or page-level fallback
  • Auto document classification: textbook, technical_reference, paper, notes
  • Per-type retrieval profiles (chunk size, sequential vs semantic-first)
  • Local embeddings (sentence-transformers default, Ollama optional)
  • ChromaDB vector storage + SQLite metadata
  • Semantic search with structure filters (chapter, section, page range)
  • Sequential navigation for chapter-by-chapter learning
  • MCP stdio server for any compatible AI client
  • Production-grade local reliability: dedup, retries, checkpoints, resume

Architecture

PDF Documents (data/pdfs/)
      │
      ▼
 Folder Watcher ──► Job Queue (SQLite)
      │
      ▼
 Structure Extract + Classify + Parse + Chunk + Embed
      │
      ├──► ChromaDB (vectors + metadata)
      └──► SQLite (documents, structure, chunks, jobs)
      │
      ▼
 MCP Server (stdio)
      ├── NavigationalEngine (sequential / section content)
      └── SemanticEngine (scoped semantic search)
      │
      ▼
 AI Client (Cursor, Claude Desktop, etc.)

Project Structure

pdf-context-server/
├── pdf_context/                # installable package
│   ├── client.py               # PdfContext public API
│   ├── config.py               # PdfContextConfig
│   ├── context.py              # AppContext runtime
│   ├── cli.py                  # pdf-context CLI
│   ├── mcp/                    # MCP factory + stdio entry
│   ├── classification/
│   ├── structure/
│   ├── parsers/
│   ├── chunking.py
│   ├── embeddings.py
│   ├── vector_store.py
│   ├── db/
│   ├── ingest/
│   ├── retrieval/
│   └── skills/                 # bundled agent skills (CLI install)
├── app/                        # deprecated shim (python -m app.main)
├── .cursor/
│   ├── mcp.json                # project MCP config (example)
│   └── skills/pdf-context/
├── data/pdfs/
├── storage/
├── tests/
├── pyproject.toml
├── requirements.txt            # dev convenience (see pyproject.toml)
├── .env.example
└── README.md

Installation (legacy dev clone)

See Install above. requirements.txt mirrors runtime deps; prefer pip install -e ".[dev]".


Quick test (no MCP required)

Drop a PDF in data/pdfs/, then run one command:

pdf-context smoke

That ingests all PDFs, runs a sample search, and prints PASS or FAIL with details.

Other useful commands:

pdf-context status
pdf-context list
pdf-context ingest
pdf-context ingest "my-book.pdf"
pdf-context search "virtual memory" -d "my-book.pdf"
pdf-context --pdf-dir /path/pdfs --storage-dir /path/storage status
pdf-context skill list
pdf-context skill install
pytest

Or with Make: make smoke, make status, make test.

MCP is for daily use in Cursor. The CLI is for verifying everything works without configuring or reloading MCP.


Adding Documents

Place PDFs in your configured PDF folder (default data/pdfs/):

data/pdfs/
├── operating-systems.pdf
├── api-reference.pdf
└── lecture-notes.pdf

Keep PDF and storage folders separate. pdf_data_dir and storage_dir must not be the same path, and neither may live inside the other. Mixing them causes the folder watcher to pick up Chroma/SQLite files, or ingest metadata into your PDF tree. Use sibling directories (defaults data/pdfs/ + storage/ are fine).

The folder watcher auto-enqueues new or changed PDFs for ingestion.

Optional type override sidecar:

data/pdfs/operating-systems.pdf.meta.json
{ "doc_type": "textbook" }

Valid types: textbook, technical_reference, paper, notes


MCP Setup

Enable pdf-context only in projects where PDFs are your source of truth. Avoid enabling it globally in Cursor user settings if most chats are code or general work—when the server is disconnected, the model cannot call PDF tools at all.

Add to project .cursor/mcp.json (Cursor) or your client's MCP config:

{
  "mcpServers": {
    "pdf-context": {
      "command": "pdf-context-mcp",
      "args": [
        "--pdf-dir", "/absolute/path/to/pdfs",
        "--storage-dir", "/absolute/path/to/storage"
      ]
    }
  }
}

No repo clone required after pip install pdf-context. For local dev, point command at .venv/bin/pdf-context-mcp.

Legacy (deprecated): "command": "python", "args": ["-m", "app.main"]

Use a descriptive server name (pdf-context, pdf-ml-book, pdf-papers) so rules and skills can refer to the right corpus.

Restart or reload MCP after changing config.

Multiple corpora (research vs papers)

Run one MCP process per (pdf folder, storage) pair. Example:

{
  "mcpServers": {
    "pdf-textbooks": {
      "command": "pdf-context-mcp",
      "args": [
        "--pdf-dir", "/Users/me/books",
        "--storage-dir", "/Users/me/.pdf-context/books",
        "--instance-id", "textbooks"
      ]
    },
    "pdf-papers": {
      "command": "pdf-context-mcp",
      "args": [
        "--pdf-dir", "/Users/me/papers",
        "--storage-dir", "/Users/me/.pdf-context/papers",
        "--instance-id", "papers"
      ]
    }
  }
}

Or via environment (PDF_CONTEXT_PDF_DATA_DIR, PDF_CONTEXT_STORAGE_DIR; legacy PDF_DATA_DIR / STORAGE_DIR still work):

"env": {
  "PDF_CONTEXT_PDF_DATA_DIR": "/Users/me/books",
  "PDF_CONTEXT_STORAGE_DIR": "/Users/me/.pdf-context/books"
}

When the AI client should call MCP tools

The model chooses tools from your message, tool descriptions, and project skills—it is not automatic. This project steers that behavior in three layers:

  1. Tool docstrings in pdf_context/mcp/server.py — each tool states use when / do not use when.
  2. Project skill.cursor/skills/pdf-context/SKILL.md tells Cursor when to use pdf-context vs codebase tools.
  3. Project-scoped MCP — enable the server only where PDFs matter.
User intent Expected tools
Fix code / git / tests None (pdf-context idle)
"What does the book say about X?" search_pdf_context (+ maybe list_documents)
Chapter walkthrough with cites list_chapters, get_section_content, get_next_chunks
"Is my PDF indexed?" get_ingest_status, list_documents
Casual chat None

Phrases that help: "From the indexed PDFs…", "Search [filename] for…", "Don't guess—use pdf-context."

Phrases that skip PDF tools: "In general (no PDF)", "Fix this Python file."

After pulling this repo, reload MCP so clients pick up new tool descriptions.

Install agent skill for any AI client

Bundled skills live in pdf_context/skills/. Install into Cursor, Claude Code, VS Code Copilot, Codex/AGENTS.md, Windsurf, Gemini, or a custom path:

pdf-context skill install
pdf-context skill list
pdf-context skill install -s pdf-context -c claude-code -p .
--client Writes to
cursor-project .cursor/skills/pdf-context/SKILL.md
cursor-global ~/.cursor/skills/pdf-context/SKILL.md
claude-code CLAUDE.md
vscode-copilot .github/copilot-instructions.md
codex-agents AGENTS.md
windsurf .windsurfrules
gemini GEMINI.md
custom path from --output

Markdown targets include marked blocks (<!-- pdf-context-skill:start/end -->) so re-running install can update the section without wiping your file.


MCP Tools

Tool Purpose
list_documents Corpus check — what's indexed; call if unsure scope
get_ingest_status Queue health; new PDFs; empty search debugging
get_document_profile Doc type, retrieval profile, per-document guidance
list_structure Full TOC tree
list_chapters Flat chapter list (textbooks)
get_section_content Ordered chunks for a chapter/section
get_next_chunks Sequential read-ahead from cursor
search_pdf_context Semantic search with optional structure filters
set_document_type Override auto-classification (when user asks)
reingest_document Force re-index (when user asks)

Each tool's MCP description includes when to call it and when to skip it.


Document Types

Type Treatment
textbook Sequential chapter navigation; larger chunks; chapter-scoped search
technical_reference Semantic-first; section-scoped search; no forced sequential reading
paper Section-scoped semantic search (abstract, methods, etc.)
notes Weak structure; semantic-only; page-level fallback navigation

Classification is automatic at ingest. Override via .meta.json or set_document_type.


Chapter-by-Chapter Learning Workflow

The AI client holds progress via the cursor returned by navigational tools.

1. get_document_profile("operating-systems.pdf")
2. list_chapters("operating-systems.pdf")
3. get_section_content("operating-systems.pdf", node_id=<chapter_id>, limit=5)
4. [Client teaches / summarizes from returned chunks]
5. search_pdf_context("page faults", document="operating-systems.pdf", chapter_id=<id>)
6. get_next_chunks("operating-systems.pdf", cursor=<last_cursor>, limit=5)

For unstructured notes, use list_structure and semantic search without sequential navigation.


Configuration

See .env.example. Key settings:

Variable Default Description
PDF_CONTEXT_PDF_DATA_DIR data/pdfs PDF watch folder (must not overlap storage)
PDF_CONTEXT_STORAGE_DIR storage SQLite + Chroma (must not overlap PDF folder)
PDF_CONTEXT_EMBEDDING_PROVIDER sentence_transformers or ollama
PDF_CONTEXT_EMBEDDING_MODEL all-MiniLM-L6-v2 Local embedding model
PDF_CONTEXT_WATCH_ENABLED true Auto-ingest on folder changes
PDF_CONTEXT_CHECKPOINT_PAGE_INTERVAL 50 Resume checkpoint during large ingests

Legacy PDF_DATA_DIR / STORAGE_DIR (no prefix) are accepted for one release.

Path layout rule: After resolving to absolute paths, pdf_data_dir and storage_dir must differ and must not be nested (parent/child). Configuration is validated at startup when directories are created; invalid layouts raise a clear error.

First ingest of a large library (20+ textbooks, ~20k pages) on CPU may take hours. Checkpoints make ingestion resumable if interrupted.


Technology Stack

  • Python 3.11+
  • PyMuPDF — PDF parsing and outline extraction
  • sentence-transformers — local embeddings
  • ChromaDB — vector storage
  • SQLite — metadata, structure, job queue
  • MCP — AI client integration
  • watchdog — folder watching

Development

pip install -e ".[dev]"
pytest
pdf-context --help
pdf-context-mcp --help

Vision

PDF Context Server converts static PDFs into structured, searchable knowledge that AI applications consume on demand — without re-uploading documents every session.

Retrieval, not synthesis: the server returns ranked chunks and structure metadata; your AI client generates answers, lessons, and summaries.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_context-0.1.0.tar.gz (48.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_context-0.1.0-py3-none-any.whl (48.4 kB view details)

Uploaded Python 3

File details

Details for the file pdf_context-0.1.0.tar.gz.

File metadata

  • Download URL: pdf_context-0.1.0.tar.gz
  • Upload date:
  • Size: 48.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for pdf_context-0.1.0.tar.gz
Algorithm Hash digest
SHA256 af4b1431405087092b5d82e45d5756a0bbd8b0c66830c8ca94c72b2c983cbe58
MD5 f4d84f2f303540feffa6fb59391151fc
BLAKE2b-256 b663060acbfcb805d18209c23223c78b8ebabad00323a91a00d56daf1761e85c

See more details on using hashes here.

File details

Details for the file pdf_context-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdf_context-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 48.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for pdf_context-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ecd7a85798ca95199cecae811be8dd9b92099bbf8bc31d3c0a4ad5d7649d5c7c
MD5 2fee3de2a59df239b67fcacc3dae4111
BLAKE2b-256 304891c4fd7bc05f179991d9c1ab7d847bc6d44c447dece94c1130922861ea90

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page