Skip to main content

Convert PDF to Markdown with image preservation. Combines markitdown + PyMuPDF for best results.

Project description

pdf-to-markdown

Convert PDF to Markdown with image preservation. Solves markitdown's image loss problem by combining markitdown + PyMuPDF.

Why?

markitdown (133k+ ⭐) is excellent for converting documents to Markdown, but it completely loses PDF images. This tool fixes that by intelligently merging markitdown's superior text extraction with PyMuPDF's lossless image extraction.

Metric markitdown only PyMuPDF only This tool
Text quality 95% 90% 95%
Image preservation ❌ 0% ✅ 99% ✅ 99%
Table support ✅ 85% ⚠️ 60% ✅ 85%
Speed ⚡ Fast ⚡ Fast ⚡ Fast

Quick Start

# Install
pip install pymupdf "markitdown[all]"

# Convert (auto-detect strategy)
python -m pdf_to_markdown document.pdf

# Specify output directory
python -m pdf_to_markdown document.pdf -o output/

# Batch convert
python -m pdf_to_markdown *.pdf -o output/

# Extract images only
python -m pdf_to_markdown document.pdf --images-only

# Force strategy
python -m pdf_to_markdown document.pdf --strategy merge    # markitdown + PyMuPDF
python -m pdf_to_markdown document.pdf --strategy pymupdf  # pure PyMuPDF

How It Works

Input PDF
    │
    ▼
┌─────────────────────────────────────┐
│  Step 1: Auto-detect PDF type       │
│  - Page count, image count, scanned │
│  - Select best strategy             │
└─────────────────────────────────────┘
    │
    ├─ Text-only PDF ──→ pymupdf (fastest)
    │
    └─ Mixed content ──→ markitdown + PyMuPDF merge
    │
    ▼
┌─────────────────────────────────────┐
│  Step 2: Parallel extraction        │
│  - markitdown for text/structure    │
│  - PyMuPDF for images (parallel)    │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Step 3: Smart merge                │
│  - Images inserted at correct pos   │
│  - Footer/header pattern matching   │
│  - Quality report generation        │
└─────────────────────────────────────┘
    │
    ▼
Output: document_with_images.md + images/

Strategies

Strategy Best For How
auto (default) Everything Detects PDF type, picks best strategy
merge Mixed content PDFs markitdown text + PyMuPDF images
pymupdf Text-only PDFs Pure PyMuPDF extraction

Quality Check

After conversion, validate the output:

python -m pdf_to_markdown.quality_check output.md
python -m pdf_to_markdown.quality_check output.md --verbose  # detailed report
python -m pdf_to_markdown.quality_check output.md --json     # machine-readable

Output:

📊 Basic Info:
  File size: 1,234 KB
  Total lines: 18,533
📝 Structure:
  Headings: 320
  Tables: 4,400 rows
🖼️ Images:
  References: 645
Quality Score: 95/100 (✅ Excellent)

Benchmark

Tested on: 550-page Chinese technical manual with 645 images

Tool Time Images Text Accuracy Overall
markitdown only 15s ❌ 0 95% ⭐⭐⭐
PyMuPDF only 8s ✅ 645 90% ⭐⭐⭐⭐
This tool 20s ✅ 645 95% ⭐⭐⭐⭐⭐
marker-pdf 120s ✅ 620 98% ⭐⭐⭐⭐⭐

Installation

# Minimal (recommended)
pip install pymupdf "markitdown[all]"

# For scanned PDFs (OCR)
pip install marker-pdf

# From source
git clone https://github.com/Leomeie/pdf-to-markdown.git
cd pdf-to-markdown
pip install -e .

Tool Comparison

See docs/tool-comparison.md for detailed comparison of PDF-to-Markdown tools.

Related Projects

  • markitdown - Microsoft's document converter (133k+ ⭐)
  • PyMuPDF - Fast PDF library (5k+ ⭐)
  • marker - High-quality PDF converter (20k+ ⭐)

License

MIT

Contributing

Contributions welcome! Please open an issue or PR.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markitdown_pdf-1.0.0.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markitdown_pdf-1.0.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file markitdown_pdf-1.0.0.tar.gz.

File metadata

  • Download URL: markitdown_pdf-1.0.0.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for markitdown_pdf-1.0.0.tar.gz
Algorithm Hash digest
SHA256 0200f8166845509d066ae33d4c435ea93717a2f0318621289db7e092d955d9e0
MD5 b1a281754957a3b58f7159a6ce5f74a4
BLAKE2b-256 de879a143ebe992d0c85e230da590aeb0c6a28987406ea388ebe37b1c5693f04

See more details on using hashes here.

File details

Details for the file markitdown_pdf-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: markitdown_pdf-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for markitdown_pdf-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 df78c80d7e462b9c8bc2cff378a7fcbb9e95d9f5539d8be671583153f8b38cf4
MD5 45cf0c2cfcf178b9b15877f5c211d106
BLAKE2b-256 84fb529c578f2a8dc9dac14c1f2f0ae674229ad1de483f320fb5d52d9289099c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page