Convert academic PDF papers to clean markdown using Docling + AI cleanup

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

pdf2md

Convert academic PDF papers to clean, readable markdown with linked citations, embedded figures, and structured metadata for RAG systems.

Quick Start — install and convert a paper
Depth Levels — control how much processing is applied
Direct CLI Usage — convert PDFs locally
Service Mode — Docker microservice for remote/homelab use
Claude Code Integration — MCP server + /convert-paper command
Processing Pipeline — what happens at each stage
Local AI Setup — run with LM Studio or Ollama
Installation — extras and requirements
Batch Processing — convert many papers at once

Quick Start

# Install
pip install paper-to-md

# Pre-download Docling ML models (~500MB, one-time)
pdf2md download-models

# Convert a paper (Docling + postprocess + LLM retouch)
pdf2md convert paper.pdf ./output

# Fast conversion (no AI)
pdf2md convert paper.pdf ./output -d low

# Full pipeline with local LLM
pdf2md convert paper.pdf ./output -d high --local

Depth Levels

pdf2md uses a depth-based system to control how much processing is applied:

Depth	What happens	Speed
`low`	Docling extraction + rule-based postprocessing (citations, figures, sections, cleanup)	Fast, no AI
`medium`	+ LLM retouch (author formatting, lettered section detection)	Moderate
`high`	+ VLM figure descriptions + code/equation enrichments	Slow

Direct CLI Usage

`pdf2md convert` — Main Conversion

uv run pdf2md convert paper.pdf ./output [OPTIONS]

Option	Description
`-d, --depth`	Analysis depth: `low`, `medium` (default), `high`
`-l, --local`	Use local LLM/VLM instead of cloud (Claude)
`-p, --provider`	LLM provider: `lm_studio` (default), `ollama`
`-m, --model`	Override LLM/VLM model name
`--keep-raw`	Save raw Docling extraction alongside processed output
`--raw`	Skip all processing, output only raw extraction
`--images-scale N`	Image resolution multiplier (default: 2.0)
`--min-image-width`	Minimum image width in pixels, filters logos (default: 200)
`--min-image-height`	Minimum image height in pixels (default: 150)
`--min-image-area`	Minimum image area in pixels (default: 40000)

Output:

output/paper/
├── paper.md              # Final processed markdown
├── paper_raw.md          # Raw Docling output (if --keep-raw)
├── img/
│   ├── figure1.png
│   ├── figure2.png
│   └── ...
├── enrichments.json      # All metadata (depth=high only)
├── figures.json          # Figure metadata
├── equations.json        # Equations with LaTeX
└── code_blocks.json      # Code with language detection

`pdf2md retouch` — LLM Cleanup Only

Run LLM-based cleanup on an existing markdown file:

uv run pdf2md retouch paper.md [OPTIONS]

Option	Description
`-l, --local`	Use local LLM instead of cloud (Claude)
`-p, --provider`	LLM provider: `lm_studio`, `ollama`
`-m, --model`	Override LLM model name
`-i, --images`	Path to images directory (default: `./img`)
`-v, --verbose`	Show detailed LLM progress

The retouch step fixes:

Author formatting — Extracts and formats author names, affiliations, emails
Lettered section headers — Classifies A. Background as header vs A. We conducted... as sentence

`pdf2md postprocess` — Rule-Based Fixes Only

uv run pdf2md postprocess paper.md [OPTIONS]

Option	Description
`-i, --images`	Path to images directory (default: `./img`)
`-o, --output`	Output path (default: overwrite input file)

`pdf2md enrich` — Extract RAG Metadata

uv run pdf2md enrich paper.pdf ./output [OPTIONS]

Option	Description
`--describe`	Generate VLM descriptions for figures
`-l, --local`	Use local VLM instead of cloud
`-p, --provider`	VLM provider: `lm_studio`, `ollama`
`-m, --model`	Override VLM model
`--images-scale N`	Image resolution multiplier (default: 2.0)

Service Mode

Run pdf2md as a Docker microservice for remote or homelab use. The service provides an HTTP API with Ed25519 signature authentication and async job processing via Redis/arq.

Docker Deployment

# Start all services (API, worker, PostgreSQL, Redis)
docker compose up -d --build

# Run database migrations
docker compose exec api alembic upgrade head

# Check logs
docker compose logs -f worker

API Endpoints

All endpoints require Ed25519 signature authentication (see Auth Setup).

Method	Endpoint	Description
`POST`	`/submit_paper`	Upload a PDF and enqueue conversion. Returns `job_id`.
`GET`	`/status/{job_id}`	Check job status, progress, and errors.
`GET`	`/retrieve/{job_id}`	Download completed results as `tar.gz`.

Submit example:

curl -X POST http://your-server:8000/submit_paper \
  -F "file=@paper.pdf" \
  -F "depth=medium" \
  -H "Authorization: Signature <base64-sig>" \
  -H "X-Timestamp: $(date +%s)" \
  -H "X-Client-Id: <your-uuid>"

Auth Setup

The service uses Ed25519 keypairs for authentication. Each client has a UUID and a public key stored in the database; requests are signed with the corresponding private key.

Signature format: METHOD\nPATH\nTIMESTAMP signed with the client's Ed25519 private key.

Headers required:

Authorization: Signature <base64-signature>
X-Timestamp: <unix-epoch>
X-Client-Id: <client-uuid>

Timestamps must be within 5 minutes of server time (configurable via PDF2MD_SERVICE_AUTH_TIMESTAMP_TOLERANCE_SECONDS).

Service Environment Variables

Variable	Default	Description
`PDF2MD_SERVICE_DATABASE_URL`	`postgresql+asyncpg://...`	PostgreSQL connection string
`PDF2MD_SERVICE_REDIS_URL`	`redis://localhost:6379`	Redis connection string
`PDF2MD_SERVICE_DATA_DIR`	`/data`	Root data directory
`PDF2MD_SERVICE_UPLOAD_DIR`	`/data/uploads`	PDF upload storage
`PDF2MD_SERVICE_AUTH_TIMESTAMP_TOLERANCE_SECONDS`	`300`	Signature freshness window
`PDF2MD_SERVICE_WORKER_MAX_JOBS`	`1`	Concurrent conversion jobs

Claude Code Integration

MCP Server

The mcp/server.py script exposes the service API as MCP tools for Claude Code. It loads credentials from a .env file in the repo root.

Register the server:

claude mcp add --scope user pdf2md-service -- uv run /path/to/paper-to-md/mcp/server.py

Required .env variables (not committed — see .env.example):

PDF2MD_SERVICE_URL=http://your-server:8000
PDF2MD_CLIENT_ID=00000000-0000-0000-0000-000000000001
PDF2MD_PRIVATE_KEY=<base64-ed25519-private-key>

Tools provided:

Tool	Description
`pdf2md_submit`	Upload a PDF and start conversion. Returns job ID.
`pdf2md_status`	Poll job status and progress.
`pdf2md_retrieve`	Download and extract completed results.

`/convert-paper` Command

A project-level slash command in .claude/commands/convert-paper.md that orchestrates the full conversion workflow.

/convert-paper path/to/paper.pdf

This submits the PDF, polls for completion, downloads results, and reports extracted files. Auto-discovered by Claude Code when working in this repo.

Processing Pipeline

1. Docling Extraction

Uses Docling (ML-based) to extract:

Text with structure (headings, paragraphs, lists)
Tables with formatting
Figures as images
Equations

2. Deterministic Post-Processing

Applied at all depth levels (including low):

Citations:

[7] → [[7]](#ref-7) (clickable links)
[11]-[14] → expanded to four individual linked citations
Anchors added to reference entries for link targets

Sections:

Abstract -Text here → ## Abstract\n\nText here
Hierarchical section numbering → proper markdown headers

Figures:

Embeds ![Figure N](./img/figureN.png) above line-start captions
Each figure embedded exactly once

Bibliography:

Adds <a id="ref-N"></a> anchors to reference entries
Ensures proper spacing between entries

Cleanup:

Fixes ligatures (ﬁ→fi, ﬂ→fl)
Removes GLYPH artifacts from OCR
Fixes hyphenated word breaks across lines
Merges split paragraphs
Removes OCR garbage near figure embeds

3. LLM Retouch (medium, high depth)

Uses LLM to fix issues that need judgment:

Author formatting — Extracts names, affiliations, emails into structured ## Authors section
Lettered sections — Classifies A. Background as header vs A. We conducted... as sentence

4. VLM + Enrichments (high depth)

Extracts structured data for RAG:

File	Contents
`figures.json`	Caption, classification, VLM description, page number
`equations.json`	LaTeX representation, surrounding context
`code_blocks.json`	Code text, detected language
`enrichments.json`	All of the above combined

Local AI Setup

pdf2md supports running entirely locally using LM Studio or Ollama:

# Using LM Studio (default local provider)
export LM_STUDIO_HOST=http://localhost:1234/v1
uv run pdf2md convert paper.pdf ./output --local

# Using Ollama
export OLLAMA_HOST=http://localhost:11434
uv run pdf2md convert paper.pdf ./output --local --provider ollama

# Override model
uv run pdf2md convert paper.pdf ./output --local --model qwen3-8b

# VLM on a separate node
export PDF2MD_VLM_HOST=http://192.168.1.100:1234/v1
uv run pdf2md convert paper.pdf ./output -d high --local

Environment Variables

Variable	Default	Description
`PDF2MD_TEXT_MODEL`	`qwen3-4b`	Text LLM for retouch
`PDF2MD_VLM_MODEL`	`qwen3-vl-4b`	VLM for figure descriptions
`PDF2MD_PROVIDER`	`lm_studio`	Default provider
`LM_STUDIO_HOST`	`http://localhost:1234/v1`	LM Studio endpoint
`PDF2MD_VLM_HOST`	`http://localhost:1234/v1`	VLM endpoint (can differ from text)
`OLLAMA_HOST`	`http://localhost:11434`	Ollama endpoint

Installation

# Standard install — includes Docling, Claude Agent SDK, and LiteLLM
pip install paper-to-md

# Pre-download Docling ML models (~500MB, one-time)
pdf2md download-models

# Docker microservice dependencies
pip install paper-to-md[service]

# Development (pytest + ruff)
pip install paper-to-md[dev]

Requirements

Python 3.10-3.12
uv recommended for dependency management

Batch Processing

# Convert all PDFs in a directory
uv run python scripts/batch_convert.py papers/ output/

# Fast batch (no AI)
uv run python scripts/batch_convert.py papers/ output/ --depth low

# Full batch with local LLM
uv run python scripts/batch_convert.py papers/ output/ --depth high --local

# Dry run to see what would be processed
uv run python scripts/batch_convert.py papers/ output/ --dry-run

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jcernuda

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

Feb 25, 2026

This version

0.2.0

Feb 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_to_md-0.2.0.tar.gz (259.8 kB view details)

Uploaded Feb 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paper_to_md-0.2.0-py3-none-any.whl (52.7 kB view details)

Uploaded Feb 18, 2026 Python 3

File details

Details for the file paper_to_md-0.2.0.tar.gz.

File metadata

Download URL: paper_to_md-0.2.0.tar.gz
Upload date: Feb 18, 2026
Size: 259.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper_to_md-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`346426d5888361dc7d55164fafe853dd932610d1dfc8fea0664e9cf3879098af`
MD5	`27b148bb404cc0d0f322d62dcab4d72b`
BLAKE2b-256	`84b81af6243efc1e05408c46c56933995d71e1700c34cd8ab053f6fa41f669e6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_to_md-0.2.0.tar.gz:

Publisher: publish.yml on JaimeCernuda/paper-to-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: paper_to_md-0.2.0.tar.gz
- Subject digest: 346426d5888361dc7d55164fafe853dd932610d1dfc8fea0664e9cf3879098af
- Sigstore transparency entry: 964596078
- Sigstore integration time: Feb 18, 2026
Source repository:
- Permalink: JaimeCernuda/paper-to-md@68b26bc2d95b516f707a6c386291b765d71d7cbf
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/JaimeCernuda
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@68b26bc2d95b516f707a6c386291b765d71d7cbf
- Trigger Event: release

File details

Details for the file paper_to_md-0.2.0-py3-none-any.whl.

File metadata

Download URL: paper_to_md-0.2.0-py3-none-any.whl
Upload date: Feb 18, 2026
Size: 52.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper_to_md-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed7f62a854d66a183465a0a19e437d519655a8b2a4088178723f7332374984dc`
MD5	`11f8c7b01b28a723ae66f0ccece3be73`
BLAKE2b-256	`98db96a132b46643d51ea08769101ffa995d63d3ddd21d5b3aad6f57cf5b946c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_to_md-0.2.0-py3-none-any.whl:

Publisher: publish.yml on JaimeCernuda/paper-to-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: paper_to_md-0.2.0-py3-none-any.whl
- Subject digest: ed7f62a854d66a183465a0a19e437d519655a8b2a4088178723f7332374984dc
- Sigstore transparency entry: 964596208
- Sigstore integration time: Feb 18, 2026
Source repository:
- Permalink: JaimeCernuda/paper-to-md@68b26bc2d95b516f707a6c386291b765d71d7cbf
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/JaimeCernuda
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@68b26bc2d95b516f707a6c386291b765d71d7cbf
- Trigger Event: release

paper-to-md 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pdf2md

Contents

Quick Start

Depth Levels

Direct CLI Usage

pdf2md convert — Main Conversion

pdf2md retouch — LLM Cleanup Only

pdf2md postprocess — Rule-Based Fixes Only

pdf2md enrich — Extract RAG Metadata

Service Mode

Docker Deployment

API Endpoints

Auth Setup

Service Environment Variables

Claude Code Integration

MCP Server

/convert-paper Command

Processing Pipeline

1. Docling Extraction

2. Deterministic Post-Processing

3. LLM Retouch (medium, high depth)

4. VLM + Enrichments (high depth)

Local AI Setup

Environment Variables

Installation

Requirements

Batch Processing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`pdf2md convert` — Main Conversion

`pdf2md retouch` — LLM Cleanup Only

`pdf2md postprocess` — Rule-Based Fixes Only

`pdf2md enrich` — Extract RAG Metadata

`/convert-paper` Command