Skip to main content

Convert academic PDF papers to clean markdown using Docling + AI cleanup

Project description

pdf2md

Convert academic PDF papers to clean, readable markdown with linked citations, embedded figures, and structured metadata for RAG systems.

Contents

Quick Start

# Install
pip install paper-to-md

# Pre-download Docling ML models (~500MB, one-time)
pdf2md download-models

# Convert a paper (Docling + postprocess + LLM retouch)
pdf2md convert paper.pdf ./output

# Fast conversion (no AI)
pdf2md convert paper.pdf ./output -d low

# Full pipeline with local LLM
pdf2md convert paper.pdf ./output -d high --local

Depth Levels

pdf2md uses a depth-based system to control how much processing is applied:

Depth What happens Speed
low Docling extraction + rule-based postprocessing (citations, figures, sections, cleanup) Fast, no AI
medium + LLM retouch (author formatting, lettered section detection) Moderate
high + VLM figure descriptions + code/equation enrichments Slow

Direct CLI Usage

pdf2md convert — Main Conversion

uv run pdf2md convert paper.pdf ./output [OPTIONS]
Option Description
-d, --depth Analysis depth: low, medium (default), high
-l, --local Use local LLM/VLM instead of cloud (Claude)
-p, --provider LLM provider: lm_studio (default), ollama
-m, --model Override LLM/VLM model name
--keep-raw Save raw Docling extraction alongside processed output
--raw Skip all processing, output only raw extraction
--images-scale N Image resolution multiplier (default: 2.0)
--min-image-width Minimum image width in pixels, filters logos (default: 200)
--min-image-height Minimum image height in pixels (default: 150)
--min-image-area Minimum image area in pixels (default: 40000)

Output:

output/paper/
├── paper.md              # Final processed markdown
├── paper_raw.md          # Raw Docling output (if --keep-raw)
├── img/
│   ├── figure1.png
│   ├── figure2.png
│   └── ...
├── enrichments.json      # All metadata (depth=high only)
├── figures.json          # Figure metadata
├── equations.json        # Equations with LaTeX
└── code_blocks.json      # Code with language detection

pdf2md retouch — LLM Cleanup Only

Run LLM-based cleanup on an existing markdown file:

uv run pdf2md retouch paper.md [OPTIONS]
Option Description
-l, --local Use local LLM instead of cloud (Claude)
-p, --provider LLM provider: lm_studio, ollama
-m, --model Override LLM model name
-i, --images Path to images directory (default: ./img)
-v, --verbose Show detailed LLM progress

The retouch step fixes:

  • Author formatting — Extracts and formats author names, affiliations, emails
  • Lettered section headers — Classifies A. Background as header vs A. We conducted... as sentence

pdf2md postprocess — Rule-Based Fixes Only

uv run pdf2md postprocess paper.md [OPTIONS]
Option Description
-i, --images Path to images directory (default: ./img)
-o, --output Output path (default: overwrite input file)

pdf2md enrich — Extract RAG Metadata

uv run pdf2md enrich paper.pdf ./output [OPTIONS]
Option Description
--describe Generate VLM descriptions for figures
-l, --local Use local VLM instead of cloud
-p, --provider VLM provider: lm_studio, ollama
-m, --model Override VLM model
--images-scale N Image resolution multiplier (default: 2.0)

Service Mode

Run pdf2md as a Docker microservice for remote or homelab use. The service provides an HTTP API with Ed25519 signature authentication and async job processing via Redis/arq.

Docker Deployment

# Start all services (API, worker, PostgreSQL, Redis)
docker compose up -d --build

# Run database migrations
docker compose exec api alembic upgrade head

# Check logs
docker compose logs -f worker

API Endpoints

All endpoints require Ed25519 signature authentication (see Auth Setup).

Method Endpoint Description
POST /submit_paper Upload a PDF and enqueue conversion. Returns job_id.
GET /status/{job_id} Check job status, progress, and errors.
GET /retrieve/{job_id} Download completed results as tar.gz.

Submit example:

curl -X POST http://your-server:8000/submit_paper \
  -F "file=@paper.pdf" \
  -F "depth=medium" \
  -H "Authorization: Signature <base64-sig>" \
  -H "X-Timestamp: $(date +%s)" \
  -H "X-Client-Id: <your-uuid>"

Auth Setup

The service uses Ed25519 keypairs for authentication. Each client has a UUID and a public key stored in the database; requests are signed with the corresponding private key.

Signature format: METHOD\nPATH\nTIMESTAMP signed with the client's Ed25519 private key.

Headers required:

  • Authorization: Signature <base64-signature>
  • X-Timestamp: <unix-epoch>
  • X-Client-Id: <client-uuid>

Timestamps must be within 5 minutes of server time (configurable via PDF2MD_SERVICE_AUTH_TIMESTAMP_TOLERANCE_SECONDS).

Service Environment Variables

Variable Default Description
PDF2MD_SERVICE_DATABASE_URL postgresql+asyncpg://... PostgreSQL connection string
PDF2MD_SERVICE_REDIS_URL redis://localhost:6379 Redis connection string
PDF2MD_SERVICE_DATA_DIR /data Root data directory
PDF2MD_SERVICE_UPLOAD_DIR /data/uploads PDF upload storage
PDF2MD_SERVICE_AUTH_TIMESTAMP_TOLERANCE_SECONDS 300 Signature freshness window
PDF2MD_SERVICE_WORKER_MAX_JOBS 1 Concurrent conversion jobs

Claude Code Integration

MCP Server

The mcp/server.py script exposes the service API as MCP tools for Claude Code. It loads credentials from a .env file in the repo root.

Register the server:

claude mcp add --scope user pdf2md-service -- uv run /path/to/paper-to-md/mcp/server.py

Required .env variables (not committed — see .env.example):

PDF2MD_SERVICE_URL=http://your-server:8000
PDF2MD_CLIENT_ID=00000000-0000-0000-0000-000000000001
PDF2MD_PRIVATE_KEY=<base64-ed25519-private-key>

Tools provided:

Tool Description
pdf2md_submit Upload a PDF and start conversion. Returns job ID.
pdf2md_status Poll job status and progress.
pdf2md_retrieve Download and extract completed results.

/convert-paper Command

A project-level slash command in .claude/commands/convert-paper.md that orchestrates the full conversion workflow.

/convert-paper path/to/paper.pdf

This submits the PDF, polls for completion, downloads results, and reports extracted files. Auto-discovered by Claude Code when working in this repo.

Processing Pipeline

1. Docling Extraction

Uses Docling (ML-based) to extract:

  • Text with structure (headings, paragraphs, lists)
  • Tables with formatting
  • Figures as images
  • Equations

2. Deterministic Post-Processing

Applied at all depth levels (including low):

Citations:

  • [7][[7]](#ref-7) (clickable links)
  • [11]-[14] → expanded to four individual linked citations
  • Anchors added to reference entries for link targets

Sections:

  • Abstract -Text here## Abstract\n\nText here
  • Hierarchical section numbering → proper markdown headers

Figures:

  • Embeds ![Figure N](./img/figureN.png) above line-start captions
  • Each figure embedded exactly once

Bibliography:

  • Adds <a id="ref-N"></a> anchors to reference entries
  • Ensures proper spacing between entries

Cleanup:

  • Fixes ligatures (fi→fi, fl→fl)
  • Removes GLYPH artifacts from OCR
  • Fixes hyphenated word breaks across lines
  • Merges split paragraphs
  • Removes OCR garbage near figure embeds

3. LLM Retouch (medium, high depth)

Uses LLM to fix issues that need judgment:

  • Author formatting — Extracts names, affiliations, emails into structured ## Authors section
  • Lettered sections — Classifies A. Background as header vs A. We conducted... as sentence

4. VLM + Enrichments (high depth)

Extracts structured data for RAG:

File Contents
figures.json Caption, classification, VLM description, page number
equations.json LaTeX representation, surrounding context
code_blocks.json Code text, detected language
enrichments.json All of the above combined

Local AI Setup

pdf2md supports running entirely locally using LM Studio or Ollama:

# Using LM Studio (default local provider)
export LM_STUDIO_HOST=http://localhost:1234/v1
uv run pdf2md convert paper.pdf ./output --local

# Using Ollama
export OLLAMA_HOST=http://localhost:11434
uv run pdf2md convert paper.pdf ./output --local --provider ollama

# Override model
uv run pdf2md convert paper.pdf ./output --local --model qwen3-8b

# VLM on a separate node
export PDF2MD_VLM_HOST=http://192.168.1.100:1234/v1
uv run pdf2md convert paper.pdf ./output -d high --local

Environment Variables

Variable Default Description
PDF2MD_TEXT_MODEL qwen3-4b Text LLM for retouch
PDF2MD_VLM_MODEL qwen3-vl-4b VLM for figure descriptions
PDF2MD_PROVIDER lm_studio Default provider
LM_STUDIO_HOST http://localhost:1234/v1 LM Studio endpoint
PDF2MD_VLM_HOST http://localhost:1234/v1 VLM endpoint (can differ from text)
OLLAMA_HOST http://localhost:11434 Ollama endpoint

Installation

# Standard install — includes Docling, Claude Agent SDK, and LiteLLM
pip install paper-to-md

# Pre-download Docling ML models (~500MB, one-time)
pdf2md download-models

# Docker microservice dependencies
pip install paper-to-md[service]

# Development (pytest + ruff)
pip install paper-to-md[dev]

Requirements

  • Python 3.10-3.12
  • uv recommended for dependency management

Batch Processing

# Convert all PDFs in a directory
uv run python scripts/batch_convert.py papers/ output/

# Fast batch (no AI)
uv run python scripts/batch_convert.py papers/ output/ --depth low

# Full batch with local LLM
uv run python scripts/batch_convert.py papers/ output/ --depth high --local

# Dry run to see what would be processed
uv run python scripts/batch_convert.py papers/ output/ --dry-run

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_to_md-0.2.0.tar.gz (259.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_to_md-0.2.0-py3-none-any.whl (52.7 kB view details)

Uploaded Python 3

File details

Details for the file paper_to_md-0.2.0.tar.gz.

File metadata

  • Download URL: paper_to_md-0.2.0.tar.gz
  • Upload date:
  • Size: 259.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper_to_md-0.2.0.tar.gz
Algorithm Hash digest
SHA256 346426d5888361dc7d55164fafe853dd932610d1dfc8fea0664e9cf3879098af
MD5 27b148bb404cc0d0f322d62dcab4d72b
BLAKE2b-256 84b81af6243efc1e05408c46c56933995d71e1700c34cd8ab053f6fa41f669e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_to_md-0.2.0.tar.gz:

Publisher: publish.yml on JaimeCernuda/paper-to-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paper_to_md-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: paper_to_md-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 52.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper_to_md-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed7f62a854d66a183465a0a19e437d519655a8b2a4088178723f7332374984dc
MD5 11f8c7b01b28a723ae66f0ccece3be73
BLAKE2b-256 98db96a132b46643d51ea08769101ffa995d63d3ddd21d5b3aad6f57cf5b946c

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_to_md-0.2.0-py3-none-any.whl:

Publisher: publish.yml on JaimeCernuda/paper-to-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page