Precision PDF-to-Markdown converter for research papers
Project description
paper2md
Precision PDF-to-Markdown converter for research papers.
Features
- Title, author, and abstract extraction from diverse paper formats
- Heading hierarchy detection via font size, weight, and allcaps analysis
- Math rendering with CM font-to-LaTeX mapping (~120 symbols)
- Tables detected via line-based layout, output in pipe format
- Figures from raster (xref), vector (drawings), and clustered composites
- References with bracket and alphanumeric key parsing
- OCR fallback for scanned PDFs (PyMuPDF OCR or pytesseract)
- Multi-column support via 1D clustering with adaptive thresholds
- MCP server with tools for PDF conversion, structured extraction, and metadata
Installation
pip install paper2md
With MCP server support:
pip install paper2md[mcp]
With OCR support for scanned PDFs:
pip install paper2md[ocr]
Usage
CLI
paper2md paper.pdf -d output/
This writes the Markdown file and all extracted figure images to the output directory.
Python API
from paper2md import convert
result = convert("paper.pdf")
print(result.markdown)
MCP Server
paper2md exposes three MCP tools: convert_pdf, convert_pdf_structured, and extract_metadata. Configure your MCP client to launch paper2md.mcp_server.
Tested Formats
paper2md is tested against papers from the following venues and publishers:
arXiv, NeurIPS, CVPR, ICLR, IEEE, ACM, NAACL, Meta AI, DeepMind, JMLR, Nature, Springer
Requirements
- Python 3.10+
- PyMuPDF 1.24+
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper2md-0.1.0.tar.gz.
File metadata
- Download URL: paper2md-0.1.0.tar.gz
- Upload date:
- Size: 90.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a38bd1018db8acb638d6b4329a99b9de02991ac8358aa57844e3b2b49506e29
|
|
| MD5 |
57850cedefe4a389b1aba4cf356e1975
|
|
| BLAKE2b-256 |
10093ab6bf1bdb6bb57265b0cf3bcc80587a48384a8f5fa45d52abc8477266cd
|
File details
Details for the file paper2md-0.1.0-py3-none-any.whl.
File metadata
- Download URL: paper2md-0.1.0-py3-none-any.whl
- Upload date:
- Size: 64.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db2ce64a0a0992c838f4a5d68e7fbd46f6a85c754f236afc922d13793812ecdd
|
|
| MD5 |
8b70ecb987880b4eb8bbb5074a5b1c82
|
|
| BLAKE2b-256 |
3b632b5e81e0a8ed768667c7a2866a9da8f9053858ef665401a865318870d6a6
|