Model Context Protocol server for arXiv PDF retrieval and LLM context generation.
Project description
paperstack (Model Context Protocol)
Overview
paperstack is a production-grade Model Context Protocol (MCP) server focused on arXiv research retrieval.
It provides:
- arXiv Atom API search by ID/query
- PDF download, validation, and cache
- PDF text extraction (title, abstract, body, references)
- Token-aware context chunking for LLM pipelines
- CLI, API, and autonomous agent integration support
Table of Contents
- Quickstart
- Installation
- Usage
- MCP Server
- Project structure
- Configuration
- Testing
- Troubleshooting
- Contributing
- License
Quickstart
1. Clone repository
git clone https://github.com/Aldrin-Joan/paperstack.git
cd paperstack
2. Set up Python environment (recommended)
python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows
.venv\Scripts\activate
3. Install dependencies
pip install -r requirements.txt
4. Run smoke test
python test_smoke.py
Installation
From source:
pip install -e .
From PyPI:
pip install paperstack-mcp
Usage
CLI
paperstack --help
Run server locally:
python -m src.mcp_server
Python API
from paperstack_mcp import entrypoint # import alias for the package
from src.arxiv_client import ArxivClient
from src.pdf_fetcher import PdfFetcher
from src.pdf_parser import PdfParser
from src.context_builder import ContextBuilder
client = ArxivClient()
results = client.search('quantum computing', max_results=3)
pdf_path = PdfFetcher().fetch_paper(results[0].id)
parsed = PdfParser().parse(pdf_path)
context = ContextBuilder().build(parsed)
print(context.summary)
Architecture Layers
| Layer | Features |
|---|---|
| Layer 1 — retrieval (both tools have this) | Search · PDF fetch + cache · Text extraction + chunking |
| Layer 2 — intelligence (your opportunity) | Citation graph · Concept extraction · Cross-paper synthesis |
| Layer 3 — dev tooling (highly unique) | Code + dataset links · Implementation diff · Reproducibility audit |
| Layer 4 — research workflows (unique) | Reading lists · Topic tracking + alerts · Agent-ready Q&A |
MCP Server
src/mcp_server/__main__.py starts an MCP tool server exposing:
arxiv_search(query or ID expand)arxiv_fetch_pdf(download + cache)arxiv_parse_pdf(extract text and metadata)arxiv_build_context(chunk to LLM-friendly context)arxiv_citation_graph(author/paper citation network)arxiv_extract_contributions(structured contribution extractor)arxiv_semantic_index(semantic similarity index builder/query)arxiv_compare_papers(paper comparison report)arxiv_extract_code_links(discover official GitHub/HuggingFace/Kaggle links from a paper)arxiv_reproducibility_score(reproducibility heuristic score with evidence details)arxiv_diff_implementations(compare paper method claims against a GitHub implementation)arxiv_reading_list(persistent reading list CRUD and filters)arxiv_watch_topic(watch query topics and detect new papers)arxiv_explain_for_audience(audience-specific explanation synthesis)
Use any MCP-capable client (VS Code MCP extension, custom agent SDK) to connect.
VS Code MCP server setup
In VS Code, add an MCP server entry to your workspace settings (e.g., .vscode/settings.json):
{
"servers": {
"arxiv-mcp": {
"command": "D:/Softwares/Anaconda3/python.exe",
"args": ["-m", "src.mcp_server"],
"cwd": "${workspaceFolder}",
"env": {
"PYTHONPATH": "${workspaceFolder}",
"ARXIV_DOWNLOAD_DIR": "${workspaceFolder}/downloads",
"ARXIV_KEEP_PDFS": "true",
"CHUNK_SIZE_TOKENS": "800",
"CHUNK_OVERLAP_TOKENS": "100",
"ARXIV_RATE_LIMIT_DELAY": "3.0",
"MAX_RETRIES": "3",
"HTTP_TIMEOUT": "60"
}
}
}
}
ARXIV_DOWNLOAD_DIR: local storage for downloaded PDFs.ARXIV_KEEP_PDFS: keep cached PDFs after parse.CHUNK_SIZE_TOKENS/CHUNK_OVERLAP_TOKENS: controls text-chunking in context builder.ARXIV_RATE_LIMIT_DELAY: delay between arXiv API calls.MAX_RETRIES,HTTP_TIMEOUT: network robustness.
You can apply this configuration also in other compatible MCP clients using their server configuration schema.
Project structure
src/- package sourcearxiv_client/- arXiv Atom API logicpdf_fetcher/- download/cache PDFpdf_parser/- extract/clean PDF textcontext_builder/- tokenization + chunkingmcp_server/- MCP protocol/adapters
tests/- pytest suiterequirements.txt- dependenciespyproject.toml- package metadata
Configuration
Environment variables:
ARXIV_CACHE_DIR(default:./downloads)ARXIV_CACHE_TTL(default:604800seconds / 7 days)ARXIV_DB_PATH(default:${ARXIV_DOWNLOAD_DIR}/arxiv_mcp.db) path to the SQLite workflow databaseARXIV_RATE_LIMIT(default:1request/sec)S2_API_KEY(optional; Semantic Scholar API key for higher rate limits)OLLAMA_BASE_URL(default:http://localhost:11434)OLLAMA_MODEL(default:mistral)SEMANTIC_INDEX_DIR(default:${ARXIV_DOWNLOAD_DIR}/semantic_index)CITATION_CACHE_TTL(default:86400seconds / 24 hours)CONTRIBUTION_CACHE_TTL(default:604800seconds / 7 days)EMBEDDING_MODEL(default:sentence-transformers/all-MiniLM-L6-v2)GITHUB_TOKEN(optional; for GitHub API auth, improves 60 -> 5000 req/hour)LINK_CACHE_TTL(default:172800seconds / 48 hours)REPRO_CACHE_TTL(default:604800seconds / 7 days)DIFF_CACHE_TTL(default:86400seconds / 24 hours)GITHUB_MAX_FILES(default:20)GITHUB_MAX_FILE_SIZE_KB(default:50)
Set in shell or via .env before running.
Testing
Run full tests:
pytest -q
Smoke test:
python test_smoke.py
Troubleshooting
arxiv-mcpcommand not found: ensure virtualenv is active and package installed- PDF download failure: check network access to
https://arxiv.org/pdf/ - Rate-limit errors: lower request frequency or adjust
ARXIV_RATE_LIMIT - Topic duplicates observed after repeated tests: use
DatabaseClient.reset()on workflow DB and/ortopic_watcher.addnow enforces dedupe by(query, label). - Reading list duplicate notes:
ReadingListManager.addnow avoids re-appending identical note blocks. - Ollama not available fallback:
_passthroughnow uses arXivmetadata.abstractfor all explanation fields (what_it_is/problem_solved/how_it_works/why_it_matters/key_result). - Dependency pin check:
pip install -r requirements.txtincludesprotobuf==3.20.3andurllib3>=2.0.0,<3to avoid known warning/conflict cases (TensorFlow + ChromaDBMessageFactoryand RequestsRequestsDependencyWarning). - Smoke harness summary:
scripts/run_all_tools.pyprints final status with count of run/passed/failed tools.
Contributing
- Fork repo
- Create feature branch
- Add tests and update README
- Open PR
Follow style checks (Black, formatting and lint).
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paperstack_mcp-0.1.9.tar.gz.
File metadata
- Download URL: paperstack_mcp-0.1.9.tar.gz
- Upload date:
- Size: 57.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79b5ca273163052d041163746db53f012b8a000596483f03f8631a82dc4b65d4
|
|
| MD5 |
09ec27d7c0a6403ad8fa23e20eebe901
|
|
| BLAKE2b-256 |
6d12e8ced0f7127829d24ba7e3e767fc51e5adde87dd86a3fc76f38b827ed9c8
|
File details
Details for the file paperstack_mcp-0.1.9-py3-none-any.whl.
File metadata
- Download URL: paperstack_mcp-0.1.9-py3-none-any.whl
- Upload date:
- Size: 59.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3de2edf45edcda07a8727b4f860bf6cd559cb47247419139e5d17916bee65381
|
|
| MD5 |
a8949a9a960bb2df3120db9b1f00f071
|
|
| BLAKE2b-256 |
f489cae835aaed66f025a4cfa7627740ec5b023762669be013a411d2f4c7e8f5
|