MCP server for intelligent paper/PDF management with RAG capabilities
Project description
Paper Intelligence MCP Server
A local MCP (Model Context Protocol) server for intelligent paper/PDF management with RAG capabilities.
Features
- PDF to Markdown: Convert PDFs using Marker with high accuracy
- Header Indexing: Extract document structure into searchable JSON
- Semantic Search: RAG-powered search using LlamaIndex + ChromaDB + HuggingFace embeddings
- Hybrid Search: Combined grep (text/regex) + semantic search
- GPU Acceleration: MPS (Apple Silicon) and CUDA support
- Self-contained: Each paper gets its own directory with all data
- Version Tracking: Metadata tracks which version processed each paper
Installation
Option 1: Install from PyPI (Recommended)
# Install with pip
pip install paper-intelligence
# Or run directly with uvx (no install needed)
uvx paper-intelligence
Option 2: Install from GitHub
# Install directly from GitHub (no clone needed)
pip install "paper-intelligence @ git+https://github.com/Strand-AI/paper-intelligence.git"
Option 3: Local Development
git clone https://github.com/Strand-AI/paper-intelligence.git
cd paper-intelligence
# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate
# Install in development mode
pip install -e ".[dev]"
# Run the server
python -m paper_intelligence.server
MCP Client Configuration
Claude Desktop
Add to your Claude Desktop config (~/.config/claude/claude_desktop_config.json on macOS/Linux or %APPDATA%\Claude\claude_desktop_config.json on Windows):
Using uvx (recommended after PyPI publish):
{
"mcpServers": {
"paper-intelligence": {
"command": "uvx",
"args": ["paper-intelligence"]
}
}
}
Using local install:
{
"mcpServers": {
"paper-intelligence": {
"command": "/path/to/paper-intelligence/.venv/bin/python",
"args": ["-m", "paper_intelligence.server"]
}
}
}
Claude Code
Add to your Claude Code config (~/.claude.json):
Using uvx (recommended after PyPI publish):
{
"mcpServers": {
"paper-intelligence": {
"type": "stdio",
"command": "uvx",
"args": ["paper-intelligence"]
}
}
}
Using local install:
{
"mcpServers": {
"paper-intelligence": {
"type": "stdio",
"command": "/path/to/paper-intelligence/.venv/bin/python",
"args": ["-m", "paper_intelligence.server"],
"cwd": "/path/to/paper-intelligence"
}
}
}
Output Structure
For ~/Downloads/paper.pdf, creates ~/Downloads/paper/:
paper/
├── paper.md # Converted markdown
├── metadata.json # Processing version and info
├── index.json # Header hierarchy (for search context)
├── chroma/ # Embeddings database
└── images/ # Extracted images (if any)
MCP Tools
process_paper
Full pipeline: Convert PDF, index headers, and create embeddings.
process_paper(
pdf_path="~/Downloads/paper.pdf",
use_llm=False, # Set True for enhanced accuracy
chunk_size=512,
chunk_overlap=50
)
# Returns: output_dir, markdown_path, images_dir (if images extracted), image_count
convert_pdf
Convert a PDF file to Markdown.
convert_pdf(
pdf_path="~/Downloads/paper.pdf",
output_dir=None, # Defaults to ~/Downloads/paper/
use_llm=False
)
# Returns: markdown_path, images_dir (if images extracted), image_count
index_markdown
Extract header hierarchy into searchable JSON.
index_markdown(
markdown_path="~/Downloads/paper/paper.md"
)
embed_document
Create embeddings for semantic search.
embed_document(
markdown_path="~/Downloads/paper/paper.md",
chunk_size=512,
chunk_overlap=50
)
search
Unified search with grep and/or RAG.
search(
query="transformer attention mechanism",
paper_dirs=["~/Downloads/paper1", "~/Downloads/paper2"],
mode="hybrid", # "grep", "rag", or "hybrid"
top_k=5
)
get_paper_info
Check processing status of a paper directory.
get_paper_info("~/Downloads/paper")
# Returns: has_markdown, has_index, has_embeddings, has_images,
# images_dir, image_files, image_count,
# version info, metadata
Extracted Images
When PDFs contain images (figures, diagrams, etc.), they are automatically extracted to an images/ subdirectory. The agent using this MCP server can:
- Check
get_paper_info()to see if images exist and get theimages_dirpath - Access individual image files listed in
image_files - Reference images from the converted markdown (images are linked in the
.mdfile)
Version Compatibility
Each processed paper directory includes a metadata.json file tracking:
paper_intelligence_version: Version used for processingprocessed_at: Timestamp of processingsource_pdf: Original PDF filenamesteps_completed: Which processing steps were run
When accessing papers, get_paper_info() checks version compatibility and warns if re-processing might be beneficial.
How Search Uses index.json
The index.json file stores the header hierarchy extracted from the markdown. When you search:
- Grep search: Uses
index.jsonto provide header context for matches (e.g., "Methods > Data Collection") - RAG search: Returns semantic matches from the embedded chunks
The index enables fast header lookups without re-parsing the markdown on each search.
Technical Stack
- MCP: Official Python SDK with FastMCP
- PDF Conversion: marker-pdf
- Embeddings: LlamaIndex + HuggingFace (BAAI/bge-small-en-v1.5)
- Vector Store: ChromaDB (persistent, local per-paper)
- GPU: PyTorch with MPS (Apple Silicon) or CUDA support
Development
pip install -e ".[dev]"
pytest
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper_intelligence-0.1.0.tar.gz.
File metadata
- Download URL: paper_intelligence-0.1.0.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1dbd46adea31f28f37ba7e99f1c982f9eda2a1cae3a021977c7cb116c5cca070
|
|
| MD5 |
172a655b25a83881db30c24a9ba5c45b
|
|
| BLAKE2b-256 |
d8e89cac2e1f1de4ed656d000ff2620455cc89d7f95d3634298d3268eb215b35
|
Provenance
The following attestation bundles were made for paper_intelligence-0.1.0.tar.gz:
Publisher:
publish.yml on Strand-AI/paper-intelligence
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paper_intelligence-0.1.0.tar.gz -
Subject digest:
1dbd46adea31f28f37ba7e99f1c982f9eda2a1cae3a021977c7cb116c5cca070 - Sigstore transparency entry: 786060047
- Sigstore integration time:
-
Permalink:
Strand-AI/paper-intelligence@2eede5e92e31b8440cc8e54d48f54858373aa6b7 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Strand-AI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2eede5e92e31b8440cc8e54d48f54858373aa6b7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file paper_intelligence-0.1.0-py3-none-any.whl.
File metadata
- Download URL: paper_intelligence-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c19f321971a1fe0b5ab6627e4f59b9d74214d15b2ed1dee612852a72aee562ce
|
|
| MD5 |
9a45a4a202d66b08ec9788711269686d
|
|
| BLAKE2b-256 |
5ae3307f9142b701e428658f042d32d5846e98d8ae0fa3600037cea19f539067
|
Provenance
The following attestation bundles were made for paper_intelligence-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Strand-AI/paper-intelligence
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paper_intelligence-0.1.0-py3-none-any.whl -
Subject digest:
c19f321971a1fe0b5ab6627e4f59b9d74214d15b2ed1dee612852a72aee562ce - Sigstore transparency entry: 786060052
- Sigstore integration time:
-
Permalink:
Strand-AI/paper-intelligence@2eede5e92e31b8440cc8e54d48f54858373aa6b7 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Strand-AI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2eede5e92e31b8440cc8e54d48f54858373aa6b7 -
Trigger Event:
push
-
Statement type: