Skip to main content

Precision PDF-to-Markdown converter for research papers

Project description

paper2md

Precision PDF-to-Markdown converter for research papers.

Features

  • Title, author, and abstract extraction from diverse paper formats
  • Heading hierarchy detection via font size, weight, and allcaps analysis
  • Math rendering with CM font-to-LaTeX mapping (~120 symbols)
  • Tables detected via line-based layout, output in pipe format
  • Figures from raster (xref), vector (drawings), and clustered composites
  • References with bracket and alphanumeric key parsing
  • OCR fallback for scanned PDFs (PyMuPDF OCR or pytesseract)
  • Multi-column support via 1D clustering with adaptive thresholds
  • MCP server with tools for PDF conversion, structured extraction, and metadata

Installation

pip install paper2md

With MCP server support:

pip install paper2md[mcp]

With OCR support for scanned PDFs:

pip install paper2md[ocr]

Usage

CLI

paper2md paper.pdf -d output/

This writes the Markdown file and all extracted figure images to the output directory.

Python API

from paper2md import convert

result = convert("paper.pdf")
print(result.markdown)

MCP Server

paper2md exposes three MCP tools: convert_pdf, convert_pdf_structured, and extract_metadata. Configure your MCP client to launch paper2md.mcp_server.

Tested Formats

paper2md is tested against papers from the following venues and publishers:

arXiv, NeurIPS, CVPR, ICLR, IEEE, ACM, NAACL, Meta AI, DeepMind, JMLR, Nature, Springer

Requirements

  • Python 3.10+
  • PyMuPDF 1.24+

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper2md-0.1.0.tar.gz (90.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper2md-0.1.0-py3-none-any.whl (64.6 kB view details)

Uploaded Python 3

File details

Details for the file paper2md-0.1.0.tar.gz.

File metadata

  • Download URL: paper2md-0.1.0.tar.gz
  • Upload date:
  • Size: 90.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for paper2md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3a38bd1018db8acb638d6b4329a99b9de02991ac8358aa57844e3b2b49506e29
MD5 57850cedefe4a389b1aba4cf356e1975
BLAKE2b-256 10093ab6bf1bdb6bb57265b0cf3bcc80587a48384a8f5fa45d52abc8477266cd

See more details on using hashes here.

File details

Details for the file paper2md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: paper2md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 64.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for paper2md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 db2ce64a0a0992c838f4a5d68e7fbd46f6a85c754f236afc922d13793812ecdd
MD5 8b70ecb987880b4eb8bbb5074a5b1c82
BLAKE2b-256 3b632b5e81e0a8ed768667c7a2866a9da8f9053858ef665401a865318870d6a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page