Skip to main content

MCP server enabling AI agents to efficiently search and understand PDF documents

Project description

Paper Intelligence MCP Server

A local MCP (Model Context Protocol) server for intelligent paper/PDF management with RAG capabilities.

Quick Start

Claude Code CLI:

claude mcp add paper-intelligence -- uvx paper-intelligence

VS Code:

code --add-mcp '{"name":"paper-intelligence","command":"uvx","args":["paper-intelligence"]}'

Features

  • PDF to Markdown: Convert PDFs using Marker with high accuracy
  • Header Indexing: Extract document structure into searchable JSON
  • Semantic Search: RAG-powered search using LlamaIndex + ChromaDB + HuggingFace embeddings
  • Hybrid Search: Combined grep (text/regex) + semantic search
  • GPU Acceleration: MPS (Apple Silicon) and CUDA support
  • Self-contained: Each paper gets its own directory with all data
  • Version Tracking: Metadata tracks which version processed each paper

Installation

Option 1: Install from PyPI (Recommended)

# Install with pip
pip install paper-intelligence

# Or run directly with uvx (no install needed)
uvx paper-intelligence

Option 2: Install from GitHub

# Install directly from GitHub (no clone needed)
pip install "paper-intelligence @ git+https://github.com/Strand-AI/paper-intelligence.git"

Option 3: Local Development

git clone https://github.com/Strand-AI/paper-intelligence.git
cd paper-intelligence

# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate

# Install in development mode
pip install -e ".[dev]"

# Run the server
python -m paper_intelligence.server

MCP Client Configuration

Claude Code CLI

The easiest way to add the server:

claude mcp add paper-intelligence -- uvx paper-intelligence

Verify installation:

claude mcp list

Claude Desktop

Add to your Claude Desktop config:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "paper-intelligence": {
      "command": "uvx",
      "args": ["paper-intelligence"]
    }
  }
}

VS Code

One-liner install:

code --add-mcp '{"name":"paper-intelligence","command":"uvx","args":["paper-intelligence"]}'

Or manually add to your User Settings (JSON) or .vscode/mcp.json:

{
  "mcp": {
    "servers": {
      "paper-intelligence": {
        "command": "uvx",
        "args": ["paper-intelligence"]
      }
    }
  }
}

Cursor

  1. Go to Settings → MCP → Add new MCP Server
  2. Select command type
  3. Enter: uvx paper-intelligence

Or add to your Cursor MCP config:

{
  "mcpServers": {
    "paper-intelligence": {
      "command": "uvx",
      "args": ["paper-intelligence"]
    }
  }
}

Windsurf

Add to your Windsurf MCP configuration:

{
  "mcpServers": {
    "paper-intelligence": {
      "command": "uvx",
      "args": ["paper-intelligence"]
    }
  }
}

Output Structure

For ~/Downloads/paper.pdf, creates ~/Downloads/paper/:

paper/
├── paper.md        # Converted markdown
├── metadata.json   # Processing version and info
├── index.json      # Header hierarchy (for search context)
├── chroma/         # Embeddings database
└── images/         # Extracted images (if any)

MCP Tools

process_paper

Full pipeline: Convert PDF, index headers, and create embeddings.

process_paper(
    pdf_path="~/Downloads/paper.pdf",
    use_llm=False,      # Set True for enhanced accuracy
    chunk_size=512,
    chunk_overlap=50
)
# Returns: output_dir, markdown_path, images_dir (if images extracted), image_count

convert_pdf

Convert a PDF file to Markdown.

convert_pdf(
    pdf_path="~/Downloads/paper.pdf",
    output_dir=None,  # Defaults to ~/Downloads/paper/
    use_llm=False
)
# Returns: markdown_path, images_dir (if images extracted), image_count

index_markdown

Extract header hierarchy into searchable JSON.

index_markdown(
    markdown_path="~/Downloads/paper/paper.md"
)

embed_document

Create embeddings for semantic search.

embed_document(
    markdown_path="~/Downloads/paper/paper.md",
    chunk_size=512,
    chunk_overlap=50
)

search

Unified search with grep and/or RAG.

search(
    query="transformer attention mechanism",
    paper_dirs=["~/Downloads/paper1", "~/Downloads/paper2"],
    mode="hybrid",  # "grep", "rag", or "hybrid"
    top_k=5
)

get_paper_info

Check processing status of a paper directory.

get_paper_info("~/Downloads/paper")
# Returns: has_markdown, has_index, has_embeddings, has_images,
#          images_dir, image_files, image_count,
#          version info, metadata

Extracted Images

When PDFs contain images (figures, diagrams, etc.), they are automatically extracted to an images/ subdirectory. The agent using this MCP server can:

  1. Check get_paper_info() to see if images exist and get the images_dir path
  2. Access individual image files listed in image_files
  3. Reference images from the converted markdown (images are linked in the .md file)

Version Compatibility

Each processed paper directory includes a metadata.json file tracking:

  • paper_intelligence_version: Version used for processing
  • processed_at: Timestamp of processing
  • source_pdf: Original PDF filename
  • steps_completed: Which processing steps were run

When accessing papers, get_paper_info() checks version compatibility and warns if re-processing might be beneficial.

How Search Uses index.json

The index.json file stores the header hierarchy extracted from the markdown. When you search:

  1. Grep search: Uses index.json to provide header context for matches (e.g., "Methods > Data Collection")
  2. RAG search: Returns semantic matches from the embedded chunks

The index enables fast header lookups without re-parsing the markdown on each search.

Technical Stack

  • MCP: Official Python SDK with FastMCP
  • PDF Conversion: marker-pdf
  • Embeddings: LlamaIndex + HuggingFace (BAAI/bge-small-en-v1.5)
  • Vector Store: ChromaDB (persistent, local per-paper)
  • GPU: PyTorch with MPS (Apple Silicon) or CUDA support

Development

pip install -e ".[dev]"

# Run unit tests (fast)
pytest tests/test_markdown_parser.py

# Run integration tests (slow, requires ML models)
pytest tests/test_integration.py -v

To use your local development version with MCP clients, replace uvx paper-intelligence with:

python -m paper_intelligence.server

Debugging

Use the MCP Inspector to debug the server:

npx @modelcontextprotocol/inspector uvx paper-intelligence

Troubleshooting

Server not starting?

  • Ensure Python 3.11+ is installed
  • Try uvx paper-intelligence directly to see error messages
  • Check that all dependencies installed correctly

Windows encoding issues? Add to your config:

"env": {
  "PYTHONIOENCODING": "utf-8"
}

Claude Desktop not detecting changes? Claude Desktop only reads configuration on startup. Fully restart the app after config changes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_intelligence-0.1.1.tar.gz (735.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_intelligence-0.1.1-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file paper_intelligence-0.1.1.tar.gz.

File metadata

  • Download URL: paper_intelligence-0.1.1.tar.gz
  • Upload date:
  • Size: 735.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper_intelligence-0.1.1.tar.gz
Algorithm Hash digest
SHA256 27098a811f8681992a706a464d667bb43cd82ad63a5a95e706c45567d507c1b5
MD5 85f40a55272e5859902fff4ae0c55658
BLAKE2b-256 746df42ed7921613bbf76b45ddc9ae22a9059cb1fe879250134473affc14ace7

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_intelligence-0.1.1.tar.gz:

Publisher: publish.yml on Strand-AI/paper-intelligence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paper_intelligence-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for paper_intelligence-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 44c60bba70951c27d286c5fadecdf1716a984ccd1ae94bced86c963c68d2e91f
MD5 2df239433450f268e5e2414330cfb5a8
BLAKE2b-256 ef065aa2a5acd410e258f85f675b14ac578b4493b0ee99e44c98c69da505c006

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_intelligence-0.1.1-py3-none-any.whl:

Publisher: publish.yml on Strand-AI/paper-intelligence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page