Skip to main content

MCP server for intelligent paper/PDF management with RAG capabilities

Project description

Paper Intelligence MCP Server

A local MCP (Model Context Protocol) server for intelligent paper/PDF management with RAG capabilities.

Features

  • PDF to Markdown: Convert PDFs using Marker with high accuracy
  • Header Indexing: Extract document structure into searchable JSON
  • Semantic Search: RAG-powered search using LlamaIndex + ChromaDB + HuggingFace embeddings
  • Hybrid Search: Combined grep (text/regex) + semantic search
  • GPU Acceleration: MPS (Apple Silicon) and CUDA support
  • Self-contained: Each paper gets its own directory with all data
  • Version Tracking: Metadata tracks which version processed each paper

Installation

Option 1: Install from PyPI (Recommended)

# Install with pip
pip install paper-intelligence

# Or run directly with uvx (no install needed)
uvx paper-intelligence

Option 2: Install from GitHub

# Install directly from GitHub (no clone needed)
pip install "paper-intelligence @ git+https://github.com/Strand-AI/paper-intelligence.git"

Option 3: Local Development

git clone https://github.com/Strand-AI/paper-intelligence.git
cd paper-intelligence

# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate

# Install in development mode
pip install -e ".[dev]"

# Run the server
python -m paper_intelligence.server

MCP Client Configuration

Claude Desktop

Add to your Claude Desktop config (~/.config/claude/claude_desktop_config.json on macOS/Linux or %APPDATA%\Claude\claude_desktop_config.json on Windows):

Using uvx (recommended after PyPI publish):

{
  "mcpServers": {
    "paper-intelligence": {
      "command": "uvx",
      "args": ["paper-intelligence"]
    }
  }
}

Using local install:

{
  "mcpServers": {
    "paper-intelligence": {
      "command": "/path/to/paper-intelligence/.venv/bin/python",
      "args": ["-m", "paper_intelligence.server"]
    }
  }
}

Claude Code

Add to your Claude Code config (~/.claude.json):

Using uvx (recommended after PyPI publish):

{
  "mcpServers": {
    "paper-intelligence": {
      "type": "stdio",
      "command": "uvx",
      "args": ["paper-intelligence"]
    }
  }
}

Using local install:

{
  "mcpServers": {
    "paper-intelligence": {
      "type": "stdio",
      "command": "/path/to/paper-intelligence/.venv/bin/python",
      "args": ["-m", "paper_intelligence.server"],
      "cwd": "/path/to/paper-intelligence"
    }
  }
}

Output Structure

For ~/Downloads/paper.pdf, creates ~/Downloads/paper/:

paper/
├── paper.md        # Converted markdown
├── metadata.json   # Processing version and info
├── index.json      # Header hierarchy (for search context)
├── chroma/         # Embeddings database
└── images/         # Extracted images (if any)

MCP Tools

process_paper

Full pipeline: Convert PDF, index headers, and create embeddings.

process_paper(
    pdf_path="~/Downloads/paper.pdf",
    use_llm=False,      # Set True for enhanced accuracy
    chunk_size=512,
    chunk_overlap=50
)
# Returns: output_dir, markdown_path, images_dir (if images extracted), image_count

convert_pdf

Convert a PDF file to Markdown.

convert_pdf(
    pdf_path="~/Downloads/paper.pdf",
    output_dir=None,  # Defaults to ~/Downloads/paper/
    use_llm=False
)
# Returns: markdown_path, images_dir (if images extracted), image_count

index_markdown

Extract header hierarchy into searchable JSON.

index_markdown(
    markdown_path="~/Downloads/paper/paper.md"
)

embed_document

Create embeddings for semantic search.

embed_document(
    markdown_path="~/Downloads/paper/paper.md",
    chunk_size=512,
    chunk_overlap=50
)

search

Unified search with grep and/or RAG.

search(
    query="transformer attention mechanism",
    paper_dirs=["~/Downloads/paper1", "~/Downloads/paper2"],
    mode="hybrid",  # "grep", "rag", or "hybrid"
    top_k=5
)

get_paper_info

Check processing status of a paper directory.

get_paper_info("~/Downloads/paper")
# Returns: has_markdown, has_index, has_embeddings, has_images,
#          images_dir, image_files, image_count,
#          version info, metadata

Extracted Images

When PDFs contain images (figures, diagrams, etc.), they are automatically extracted to an images/ subdirectory. The agent using this MCP server can:

  1. Check get_paper_info() to see if images exist and get the images_dir path
  2. Access individual image files listed in image_files
  3. Reference images from the converted markdown (images are linked in the .md file)

Version Compatibility

Each processed paper directory includes a metadata.json file tracking:

  • paper_intelligence_version: Version used for processing
  • processed_at: Timestamp of processing
  • source_pdf: Original PDF filename
  • steps_completed: Which processing steps were run

When accessing papers, get_paper_info() checks version compatibility and warns if re-processing might be beneficial.

How Search Uses index.json

The index.json file stores the header hierarchy extracted from the markdown. When you search:

  1. Grep search: Uses index.json to provide header context for matches (e.g., "Methods > Data Collection")
  2. RAG search: Returns semantic matches from the embedded chunks

The index enables fast header lookups without re-parsing the markdown on each search.

Technical Stack

  • MCP: Official Python SDK with FastMCP
  • PDF Conversion: marker-pdf
  • Embeddings: LlamaIndex + HuggingFace (BAAI/bge-small-en-v1.5)
  • Vector Store: ChromaDB (persistent, local per-paper)
  • GPU: PyTorch with MPS (Apple Silicon) or CUDA support

Development

pip install -e ".[dev]"
pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_intelligence-0.1.0.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_intelligence-0.1.0-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file paper_intelligence-0.1.0.tar.gz.

File metadata

  • Download URL: paper_intelligence-0.1.0.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper_intelligence-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1dbd46adea31f28f37ba7e99f1c982f9eda2a1cae3a021977c7cb116c5cca070
MD5 172a655b25a83881db30c24a9ba5c45b
BLAKE2b-256 d8e89cac2e1f1de4ed656d000ff2620455cc89d7f95d3634298d3268eb215b35

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_intelligence-0.1.0.tar.gz:

Publisher: publish.yml on Strand-AI/paper-intelligence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paper_intelligence-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for paper_intelligence-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c19f321971a1fe0b5ab6627e4f59b9d74214d15b2ed1dee612852a72aee562ce
MD5 9a45a4a202d66b08ec9788711269686d
BLAKE2b-256 5ae3307f9142b701e428658f042d32d5846e98d8ae0fa3600037cea19f539067

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_intelligence-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Strand-AI/paper-intelligence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page