Convert academic PDF papers to clean markdown using Docling + AI cleanup
Project description
pdf2md
Convert academic PDF papers to clean, readable markdown with linked citations, embedded figures, and structured metadata for RAG systems.
Contents
- Quick Start — install and convert a paper
- Depth Levels — control how much processing is applied
- Direct CLI Usage — convert PDFs locally
- Service Mode — Docker microservice for remote/homelab use
- Claude Code Integration — MCP server +
/convert-papercommand - Processing Pipeline — what happens at each stage
- Local AI Setup — run with LM Studio or Ollama
- Installation — extras and requirements
- Batch Processing — convert many papers at once
Quick Start
# Install
pip install paper-to-md
# Pre-download Docling ML models (~500MB, one-time)
pdf2md download-models
# Convert a paper (Docling + postprocess + LLM retouch)
pdf2md convert paper.pdf ./output
# Fast conversion (no AI)
pdf2md convert paper.pdf ./output -d low
# Full pipeline with local LLM
pdf2md convert paper.pdf ./output -d high --local
Depth Levels
pdf2md uses a depth-based system to control how much processing is applied:
| Depth | What happens | Speed |
|---|---|---|
low |
Docling extraction + rule-based postprocessing (citations, figures, sections, cleanup) | Fast, no AI |
medium |
+ LLM retouch (author formatting, lettered section detection) | Moderate |
high |
+ VLM figure descriptions + code/equation enrichments | Slow |
Direct CLI Usage
pdf2md convert — Main Conversion
uv run pdf2md convert paper.pdf ./output [OPTIONS]
| Option | Description |
|---|---|
-d, --depth |
Analysis depth: low, medium (default), high |
-l, --local |
Use local LLM/VLM instead of cloud (Claude) |
-p, --provider |
LLM provider: lm_studio (default), ollama |
-m, --model |
Override LLM/VLM model name |
--keep-raw |
Save raw Docling extraction alongside processed output |
--raw |
Skip all processing, output only raw extraction |
--images-scale N |
Image resolution multiplier (default: 2.0) |
--min-image-width |
Minimum image width in pixels, filters logos (default: 200) |
--min-image-height |
Minimum image height in pixels (default: 150) |
--min-image-area |
Minimum image area in pixels (default: 40000) |
Output:
output/paper/
├── paper.md # Final processed markdown
├── paper_raw.md # Raw Docling output (if --keep-raw)
├── img/
│ ├── figure1.png
│ ├── figure2.png
│ └── ...
├── enrichments.json # All metadata (depth=high only)
├── figures.json # Figure metadata
├── equations.json # Equations with LaTeX
└── code_blocks.json # Code with language detection
pdf2md retouch — LLM Cleanup Only
Run LLM-based cleanup on an existing markdown file:
uv run pdf2md retouch paper.md [OPTIONS]
| Option | Description |
|---|---|
-l, --local |
Use local LLM instead of cloud (Claude) |
-p, --provider |
LLM provider: lm_studio, ollama |
-m, --model |
Override LLM model name |
-i, --images |
Path to images directory (default: ./img) |
-v, --verbose |
Show detailed LLM progress |
The retouch step fixes:
- Author formatting — Extracts and formats author names, affiliations, emails
- Lettered section headers — Classifies
A. Backgroundas header vsA. We conducted...as sentence
pdf2md postprocess — Rule-Based Fixes Only
uv run pdf2md postprocess paper.md [OPTIONS]
| Option | Description |
|---|---|
-i, --images |
Path to images directory (default: ./img) |
-o, --output |
Output path (default: overwrite input file) |
pdf2md enrich — Extract RAG Metadata
uv run pdf2md enrich paper.pdf ./output [OPTIONS]
| Option | Description |
|---|---|
--describe |
Generate VLM descriptions for figures |
-l, --local |
Use local VLM instead of cloud |
-p, --provider |
VLM provider: lm_studio, ollama |
-m, --model |
Override VLM model |
--images-scale N |
Image resolution multiplier (default: 2.0) |
Service Mode
Run pdf2md as a Docker microservice for remote or homelab use. The service provides an HTTP API with Ed25519 signature authentication and async job processing via Redis/arq.
Docker Deployment
# Start all services (API, worker, PostgreSQL, Redis)
docker compose up -d --build
# Run database migrations
docker compose exec api alembic upgrade head
# Check logs
docker compose logs -f worker
API Endpoints
All endpoints require Ed25519 signature authentication (see Auth Setup).
| Method | Endpoint | Description |
|---|---|---|
POST |
/submit_paper |
Upload a PDF and enqueue conversion. Returns job_id. |
GET |
/status/{job_id} |
Check job status, progress, and errors. |
GET |
/retrieve/{job_id} |
Download completed results as tar.gz. |
Submit example:
curl -X POST http://your-server:8000/submit_paper \
-F "file=@paper.pdf" \
-F "depth=medium" \
-H "Authorization: Signature <base64-sig>" \
-H "X-Timestamp: $(date +%s)" \
-H "X-Client-Id: <your-uuid>"
Auth Setup
The service uses Ed25519 keypairs for authentication. Each client has a UUID and a public key stored in the database; requests are signed with the corresponding private key.
Signature format: METHOD\nPATH\nTIMESTAMP signed with the client's Ed25519 private key.
Headers required:
Authorization: Signature <base64-signature>X-Timestamp: <unix-epoch>X-Client-Id: <client-uuid>
Timestamps must be within 5 minutes of server time (configurable via PDF2MD_SERVICE_AUTH_TIMESTAMP_TOLERANCE_SECONDS).
Service Environment Variables
| Variable | Default | Description |
|---|---|---|
PDF2MD_SERVICE_DATABASE_URL |
postgresql+asyncpg://... |
PostgreSQL connection string |
PDF2MD_SERVICE_REDIS_URL |
redis://localhost:6379 |
Redis connection string |
PDF2MD_SERVICE_DATA_DIR |
/data |
Root data directory |
PDF2MD_SERVICE_UPLOAD_DIR |
/data/uploads |
PDF upload storage |
PDF2MD_SERVICE_AUTH_TIMESTAMP_TOLERANCE_SECONDS |
300 |
Signature freshness window |
PDF2MD_SERVICE_WORKER_MAX_JOBS |
1 |
Concurrent conversion jobs |
Claude Code Integration
MCP Server
The mcp/server.py script exposes the service API as MCP tools for Claude Code. It loads credentials from a .env file in the repo root.
Register the server:
claude mcp add --scope user pdf2md-service -- uv run /path/to/paper-to-md/mcp/server.py
Required .env variables (not committed — see .env.example):
PDF2MD_SERVICE_URL=http://your-server:8000
PDF2MD_CLIENT_ID=00000000-0000-0000-0000-000000000001
PDF2MD_PRIVATE_KEY=<base64-ed25519-private-key>
Tools provided:
| Tool | Description |
|---|---|
pdf2md_submit |
Upload a PDF and start conversion. Returns job ID. |
pdf2md_status |
Poll job status and progress. |
pdf2md_retrieve |
Download and extract completed results. |
/convert-paper Command
A project-level slash command in .claude/commands/convert-paper.md that orchestrates the full conversion workflow.
/convert-paper path/to/paper.pdf
This submits the PDF, polls for completion, downloads results, and reports extracted files. Auto-discovered by Claude Code when working in this repo.
Processing Pipeline
1. Docling Extraction
Uses Docling (ML-based) to extract:
- Text with structure (headings, paragraphs, lists)
- Tables with formatting
- Figures as images
- Equations
2. Deterministic Post-Processing
Applied at all depth levels (including low):
Citations:
[7]→[[7]](#ref-7)(clickable links)[11]-[14]→ expanded to four individual linked citations- Anchors added to reference entries for link targets
Sections:
Abstract -Text here→## Abstract\n\nText here- Hierarchical section numbering → proper markdown headers
Figures:
- Embeds
above line-start captions - Each figure embedded exactly once
Bibliography:
- Adds
<a id="ref-N"></a>anchors to reference entries - Ensures proper spacing between entries
Cleanup:
- Fixes ligatures (fi→fi, fl→fl)
- Removes GLYPH artifacts from OCR
- Fixes hyphenated word breaks across lines
- Merges split paragraphs
- Removes OCR garbage near figure embeds
3. LLM Retouch (medium, high depth)
Uses LLM to fix issues that need judgment:
- Author formatting — Extracts names, affiliations, emails into structured
## Authorssection - Lettered sections — Classifies
A. Backgroundas header vsA. We conducted...as sentence
4. VLM + Enrichments (high depth)
Extracts structured data for RAG:
| File | Contents |
|---|---|
figures.json |
Caption, classification, VLM description, page number |
equations.json |
LaTeX representation, surrounding context |
code_blocks.json |
Code text, detected language |
enrichments.json |
All of the above combined |
Local AI Setup
pdf2md supports running entirely locally using LM Studio or Ollama:
# Using LM Studio (default local provider)
export LM_STUDIO_HOST=http://localhost:1234/v1
uv run pdf2md convert paper.pdf ./output --local
# Using Ollama
export OLLAMA_HOST=http://localhost:11434
uv run pdf2md convert paper.pdf ./output --local --provider ollama
# Override model
uv run pdf2md convert paper.pdf ./output --local --model qwen3-8b
# VLM on a separate node
export PDF2MD_VLM_HOST=http://192.168.1.100:1234/v1
uv run pdf2md convert paper.pdf ./output -d high --local
Environment Variables
| Variable | Default | Description |
|---|---|---|
PDF2MD_TEXT_MODEL |
qwen3-4b |
Text LLM for retouch |
PDF2MD_VLM_MODEL |
qwen3-vl-4b |
VLM for figure descriptions |
PDF2MD_PROVIDER |
lm_studio |
Default provider |
LM_STUDIO_HOST |
http://localhost:1234/v1 |
LM Studio endpoint |
PDF2MD_VLM_HOST |
http://localhost:1234/v1 |
VLM endpoint (can differ from text) |
OLLAMA_HOST |
http://localhost:11434 |
Ollama endpoint |
Installation
# Standard install — includes Docling, Claude Agent SDK, and LiteLLM
pip install paper-to-md
# Pre-download Docling ML models (~500MB, one-time)
pdf2md download-models
# Docker microservice dependencies
pip install paper-to-md[service]
# Development (pytest + ruff)
pip install paper-to-md[dev]
Requirements
- Python 3.10-3.12
- uv recommended for dependency management
Batch Processing
# Convert all PDFs in a directory
uv run python scripts/batch_convert.py papers/ output/
# Fast batch (no AI)
uv run python scripts/batch_convert.py papers/ output/ --depth low
# Full batch with local LLM
uv run python scripts/batch_convert.py papers/ output/ --depth high --local
# Dry run to see what would be processed
uv run python scripts/batch_convert.py papers/ output/ --dry-run
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper_to_md-0.2.0.tar.gz.
File metadata
- Download URL: paper_to_md-0.2.0.tar.gz
- Upload date:
- Size: 259.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
346426d5888361dc7d55164fafe853dd932610d1dfc8fea0664e9cf3879098af
|
|
| MD5 |
27b148bb404cc0d0f322d62dcab4d72b
|
|
| BLAKE2b-256 |
84b81af6243efc1e05408c46c56933995d71e1700c34cd8ab053f6fa41f669e6
|
Provenance
The following attestation bundles were made for paper_to_md-0.2.0.tar.gz:
Publisher:
publish.yml on JaimeCernuda/paper-to-md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paper_to_md-0.2.0.tar.gz -
Subject digest:
346426d5888361dc7d55164fafe853dd932610d1dfc8fea0664e9cf3879098af - Sigstore transparency entry: 964596078
- Sigstore integration time:
-
Permalink:
JaimeCernuda/paper-to-md@68b26bc2d95b516f707a6c386291b765d71d7cbf -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/JaimeCernuda
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@68b26bc2d95b516f707a6c386291b765d71d7cbf -
Trigger Event:
release
-
Statement type:
File details
Details for the file paper_to_md-0.2.0-py3-none-any.whl.
File metadata
- Download URL: paper_to_md-0.2.0-py3-none-any.whl
- Upload date:
- Size: 52.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed7f62a854d66a183465a0a19e437d519655a8b2a4088178723f7332374984dc
|
|
| MD5 |
11f8c7b01b28a723ae66f0ccece3be73
|
|
| BLAKE2b-256 |
98db96a132b46643d51ea08769101ffa995d63d3ddd21d5b3aad6f57cf5b946c
|
Provenance
The following attestation bundles were made for paper_to_md-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on JaimeCernuda/paper-to-md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paper_to_md-0.2.0-py3-none-any.whl -
Subject digest:
ed7f62a854d66a183465a0a19e437d519655a8b2a4088178723f7332374984dc - Sigstore transparency entry: 964596208
- Sigstore integration time:
-
Permalink:
JaimeCernuda/paper-to-md@68b26bc2d95b516f707a6c386291b765d71d7cbf -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/JaimeCernuda
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@68b26bc2d95b516f707a6c386291b765d71d7cbf -
Trigger Event:
release
-
Statement type: