Skip to main content

Convert academic PDF papers to clean markdown using Docling + AI cleanup

Project description

pdf2md

Convert academic PDF papers to clean, readable markdown with linked citations, embedded figures, and structured metadata for RAG systems.

Contents

Quick Start

# Install
uv tool install paper-to-md

# Pre-download Docling ML models (~500MB, one-time)
pdf2md download-models

# Convert a paper — uses medium depth by default (Docling + postprocess + LLM retouch)
pdf2md convert paper.pdf

# Output goes to ./paper/paper.md (same directory as the PDF)
# Or specify an output directory explicitly:
pdf2md convert paper.pdf ./output

Depth Levels

pdf2md uses a depth-based system to control how much processing is applied. The default is medium.

Depth Default? What happens AI required?
low Docling extraction + rule-based postprocessing (citations, figures, sections, cleanup) No
medium yes Everything in low + LLM retouch via Claude Agent SDK (author formatting, lettered section headers, figure relocation, paragraph merging) Yes (Claude API or --local)
high Everything in medium + VLM figure descriptions + code/equation enrichments Yes (Claude API or --local)
# Fast, no AI needed
pdf2md convert paper.pdf -d low

# Default — includes agentic LLM cleanup (Claude)
pdf2md convert paper.pdf

# Full pipeline — adds VLM figure descriptions and RAG metadata
pdf2md convert paper.pdf -d high

# Any depth with a local LLM instead of Claude
pdf2md convert paper.pdf --local
pdf2md convert paper.pdf -d high --local

Direct CLI Usage

pdf2md convert — Main Conversion

pdf2md convert paper.pdf [output_dir] [OPTIONS]

If output_dir is omitted, output goes to the same directory as the PDF.

Option Description
-d, --depth Analysis depth: low, medium (default), high
-l, --local Use local LLM/VLM instead of cloud (Claude)
-p, --provider LLM provider: lm_studio (default), ollama
-m, --model Override LLM/VLM model name
--keep-raw Save raw Docling extraction alongside processed output
--raw Skip all processing, output only raw extraction
--images-scale N Image resolution multiplier (default: 2.0)
--min-image-width Minimum image width in pixels, filters logos (default: 200)
--min-image-height Minimum image height in pixels (default: 150)
--min-image-area Minimum image area in pixels (default: 40000)

Output:

output/paper/
├── paper.md              # Final processed markdown
├── paper_raw.md          # Raw Docling output (if --keep-raw)
├── img/
│   ├── figure1.png
│   ├── figure2.png
│   └── ...
├── enrichments.json      # All metadata (depth=high only)
├── figures.json          # Figure metadata
├── equations.json        # Equations with LaTeX
└── code_blocks.json      # Code with language detection

pdf2md retouch — LLM Cleanup Only

Run LLM-based cleanup on an existing markdown file:

uv run pdf2md retouch paper.md [OPTIONS]
Option Description
-l, --local Use local LLM instead of cloud (Claude)
-p, --provider LLM provider: lm_studio, ollama
-m, --model Override LLM model name
-i, --images Path to images directory (default: ./img)
-v, --verbose Show detailed LLM progress

The retouch step fixes:

  • Author formatting — Extracts and formats author names, affiliations, emails
  • Lettered section headers — Classifies A. Background as header vs A. We conducted... as sentence

pdf2md postprocess — Rule-Based Fixes Only

uv run pdf2md postprocess paper.md [OPTIONS]
Option Description
-i, --images Path to images directory (default: ./img)
-o, --output Output path (default: overwrite input file)

pdf2md enrich — Extract RAG Metadata

uv run pdf2md enrich paper.pdf ./output [OPTIONS]
Option Description
--describe Generate VLM descriptions for figures
-l, --local Use local VLM instead of cloud
-p, --provider VLM provider: lm_studio, ollama
-m, --model Override VLM model
--images-scale N Image resolution multiplier (default: 2.0)

Service Mode

Run pdf2md as a Docker microservice for remote or homelab use. The service provides an HTTP API with Ed25519 signature authentication and async job processing via Redis/arq.

Docker Deployment

# Start all services (API, worker, PostgreSQL, Redis)
docker compose up -d --build

# Run database migrations
docker compose exec api alembic upgrade head

# Check logs
docker compose logs -f worker

API Endpoints

All endpoints require Ed25519 signature authentication (see Auth Setup).

Method Endpoint Description
POST /submit_paper Upload a PDF and enqueue conversion. Returns job_id.
GET /status/{job_id} Check job status, progress, and errors.
GET /retrieve/{job_id} Download completed results as tar.gz.

Submit example:

curl -X POST http://your-server:8000/submit_paper \
  -F "file=@paper.pdf" \
  -F "depth=medium" \
  -H "Authorization: Signature <base64-sig>" \
  -H "X-Timestamp: $(date +%s)" \
  -H "X-Client-Id: <your-uuid>"

Auth Setup

The service uses Ed25519 keypairs for authentication. Each client has a UUID and a public key stored in the database; requests are signed with the corresponding private key.

Signature format: METHOD\nPATH\nTIMESTAMP signed with the client's Ed25519 private key.

Headers required:

  • Authorization: Signature <base64-signature>
  • X-Timestamp: <unix-epoch>
  • X-Client-Id: <client-uuid>

Timestamps must be within 5 minutes of server time (configurable via PDF2MD_SERVICE_AUTH_TIMESTAMP_TOLERANCE_SECONDS).

Service Environment Variables

Variable Default Description
PDF2MD_SERVICE_DATABASE_URL postgresql+asyncpg://... PostgreSQL connection string
PDF2MD_SERVICE_REDIS_URL redis://localhost:6379 Redis connection string
PDF2MD_SERVICE_DATA_DIR /data Root data directory
PDF2MD_SERVICE_UPLOAD_DIR /data/uploads PDF upload storage
PDF2MD_SERVICE_AUTH_TIMESTAMP_TOLERANCE_SECONDS 300 Signature freshness window
PDF2MD_SERVICE_WORKER_MAX_JOBS 1 Concurrent conversion jobs

Claude Code Integration

MCP Server

The mcp/server.py script exposes the service API as MCP tools for Claude Code. It loads credentials from a .env file in the repo root.

Register the server:

claude mcp add --scope user pdf2md-service -- uv run /path/to/paper-to-md/mcp/server.py

Required .env variables (not committed — see .env.example):

PDF2MD_SERVICE_URL=http://your-server:8000
PDF2MD_CLIENT_ID=00000000-0000-0000-0000-000000000001
PDF2MD_PRIVATE_KEY=<base64-ed25519-private-key>

Tools provided:

Tool Description
pdf2md_submit Upload a PDF and start conversion. Returns job ID.
pdf2md_status Poll job status and progress.
pdf2md_retrieve Download and extract completed results.

/convert-paper Command

A project-level slash command in .claude/commands/convert-paper.md that orchestrates the full conversion workflow.

/convert-paper path/to/paper.pdf

This submits the PDF, polls for completion, downloads results, and reports extracted files. Auto-discovered by Claude Code when working in this repo.

Processing Pipeline

1. Docling Extraction

Uses Docling (ML-based) to extract:

  • Text with structure (headings, paragraphs, lists)
  • Tables with formatting
  • Figures as images
  • Equations

2. Deterministic Post-Processing

Applied at all depth levels (including low):

Citations:

  • [7][[7]](#ref-7) (clickable links)
  • [11]-[14] → expanded to four individual linked citations
  • Anchors added to reference entries for link targets

Sections:

  • Abstract -Text here## Abstract\n\nText here
  • Hierarchical section numbering → proper markdown headers

Figures:

  • Embeds ![Figure N](./img/figureN.png) above line-start captions
  • Each figure embedded exactly once

Bibliography:

  • Adds <a id="ref-N"></a> anchors to reference entries
  • Ensures proper spacing between entries

Cleanup:

  • Fixes ligatures (fi→fi, fl→fl)
  • Removes GLYPH artifacts from OCR
  • Fixes hyphenated word breaks across lines
  • Merges split paragraphs
  • Removes OCR garbage near figure embeds

3. LLM Retouch (medium, high depth)

Uses LLM to fix issues that need judgment:

  • Author formatting — Extracts names, affiliations, emails into structured ## Authors section
  • Lettered sections — Classifies A. Background as header vs A. We conducted... as sentence

4. VLM + Enrichments (high depth)

Extracts structured data for RAG:

File Contents
figures.json Caption, classification, VLM description, page number
equations.json LaTeX representation, surrounding context
code_blocks.json Code text, detected language
enrichments.json All of the above combined

Local AI Setup

pdf2md supports running entirely locally using LM Studio or Ollama:

# Using LM Studio (default local provider)
export LM_STUDIO_HOST=http://localhost:1234/v1
uv run pdf2md convert paper.pdf ./output --local

# Using Ollama
export OLLAMA_HOST=http://localhost:11434
uv run pdf2md convert paper.pdf ./output --local --provider ollama

# Override model
uv run pdf2md convert paper.pdf ./output --local --model qwen3-8b

# VLM on a separate node
export PDF2MD_VLM_HOST=http://192.168.1.100:1234/v1
uv run pdf2md convert paper.pdf ./output -d high --local

Environment Variables

Variable Default Description
PDF2MD_TEXT_MODEL qwen3-4b Text LLM for retouch
PDF2MD_VLM_MODEL qwen3-vl-4b VLM for figure descriptions
PDF2MD_PROVIDER lm_studio Default provider
LM_STUDIO_HOST http://localhost:1234/v1 LM Studio endpoint
PDF2MD_VLM_HOST http://localhost:1234/v1 VLM endpoint (can differ from text)
OLLAMA_HOST http://localhost:11434 Ollama endpoint

Installation

# Install as a standalone tool (recommended)
uv tool install paper-to-md

# Pre-download Docling ML models (~500MB, one-time)
pdf2md download-models

Alternative install methods:

# Install into a project
uv add paper-to-md

# pip works too
pip install paper-to-md

# Docker microservice dependencies
uv tool install paper-to-md[service]

# Development (pytest + ruff)
uv pip install paper-to-md[dev]

Requirements

  • Python 3.10-3.12
  • uv recommended for installation and dependency management

Batch Processing

# Convert all PDFs in a directory
uv run python scripts/batch_convert.py papers/ output/

# Fast batch (no AI)
uv run python scripts/batch_convert.py papers/ output/ --depth low

# Full batch with local LLM
uv run python scripts/batch_convert.py papers/ output/ --depth high --local

# Dry run to see what would be processed
uv run python scripts/batch_convert.py papers/ output/ --dry-run

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_to_md-0.2.1.tar.gz (260.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_to_md-0.2.1-py3-none-any.whl (53.1 kB view details)

Uploaded Python 3

File details

Details for the file paper_to_md-0.2.1.tar.gz.

File metadata

  • Download URL: paper_to_md-0.2.1.tar.gz
  • Upload date:
  • Size: 260.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper_to_md-0.2.1.tar.gz
Algorithm Hash digest
SHA256 cbc11bb02370e5c9c000fc4dafff50a692aacc6debdb082c031de7e39c45e7be
MD5 836d322cb88a029efdf394de66a71373
BLAKE2b-256 55bc7cc4407c9903906422f2d9fe8c4fda40c32fbe57e5e3858659b58599337b

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_to_md-0.2.1.tar.gz:

Publisher: publish.yml on JaimeCernuda/paper-to-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paper_to_md-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: paper_to_md-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 53.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper_to_md-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e88f610c6aad7a3ccd7903e34d2e92838d808a24b9c314cf984aa18dd7e2503b
MD5 eb085ebeb036ec4689d0024cd53ede2c
BLAKE2b-256 1771ed3cc076f3da8c560bd8884fdbeb1b3353b4941d85019ee14d761f761d2e

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_to_md-0.2.1-py3-none-any.whl:

Publisher: publish.yml on JaimeCernuda/paper-to-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page