Skip to main content

Library and CLI to convert PDF documents to clean, well-structured Markdown using LLM-assisted processing, leveraging Antrhopic and OpenAI models for intelligent extraction of text, tables, and complex layouts.

Project description

PDF to Markdown Converter

Library and CLI to convert PDF documents to clean, well-structured Markdown using LLM-assisted processing, leveraging Antrhopic and OpenAI models for intelligent extraction of text, tables, and complex layouts.

Features

  • Vision Mode: Enhanced extraction using multimodal AI for complex layouts, tables, charts, and diagrams
  • Multi-Provider Support: Use Anthropic (Claude) or OpenAI (GPT) models
  • Smart Conversion: Intelligently converts PDF content to clean markdown with proper formatting
  • Large File Support: Automatically chunks large PDFs for optimal processing
  • Batch Processing: Convert entire folders of PDFs with preserved directory structure
  • Table Preservation: Accurately converts tables to markdown format with vision-enhanced detection
  • Structure Detection: Automatically generates appropriate heading hierarchy
  • Dual Interface: Use as both a CLI tool and a Python library

Quick Start

# 1. Install with uv (recommended - faster)
uv tool install pdf-to-md-llm

# 2. Set your API key
export ANTHROPIC_API_KEY='your-api-key-here'

# 3. Convert a PDF
pdf-to-md-llm convert document.pdf --vision

Installation

Using uv (Recommended)

uv is a fast Python package installer:

# Install the package as a tool
uv tool install pdf-to-md-llm

# Or run directly without installing
uvx pdf-to-md-llm convert document.pdf

Using pip (Alternative)

pip install pdf-to-md-llm

Configuration

Set your API key for at least one provider:

# For Anthropic (Claude) - recommended
export ANTHROPIC_API_KEY='your-anthropic-api-key-here'

# For OpenAI (GPT)
export OPENAI_API_KEY='your-openai-api-key-here'

Or create a .env.local file:

ANTHROPIC_API_KEY=your-anthropic-api-key-here
OPENAI_API_KEY=your-openai-api-key-here

Default Models (Optimized for Cost/Quality)

The tool uses cost-effective models by default:

  • Anthropic: claude-3-5-haiku-20241022 ($0.80 input / $4 output per million tokens)
  • OpenAI: gpt-4o-mini ($0.15 input / $0.60 output per million tokens)

These defaults provide excellent quality for most PDF conversion tasks at significantly lower cost. For complex documents requiring maximum accuracy, you can override with premium models:

# Use more powerful Anthropic model for complex documents
pdf-to-md-llm convert complex-doc.pdf --model claude-sonnet-4-20250514 --vision

# Use OpenAI's flagship model
pdf-to-md-llm convert complex-doc.pdf --provider openai --model gpt-4o --vision

To see all available models from your configured providers, see List Available Models.

Usage Examples

Basic Conversion

# Simple document conversion
pdf-to-md-llm convert document.pdf

# Specify output filename
pdf-to-md-llm convert document.pdf output.md

Scenario 1: Academic Papers with Tables

For research papers, technical documents, or any PDF with complex tables:

# Vision mode provides superior table extraction
pdf-to-md-llm convert research-paper.pdf --vision

Scenario 2: Large Documents (500+ pages)

For textbooks, manuals, or large documents, use smaller chunks for better processing:

# Reduce chunk size for memory efficiency
pdf-to-md-llm convert textbook.pdf --vision --vision-pages-per-chunk 4

Scenario 3: Documents with Charts and Diagrams

For PDFs containing visual elements like charts, graphs, or diagrams:

# Vision mode analyzes images and describes visual content
pdf-to-md-llm convert annual-report.pdf --vision --vision-dpi 200

# Use vision-only mode to rely solely on image analysis (no extracted text)
# Useful for PDFs where text extraction is unreliable or when you want pure visual analysis
pdf-to-md-llm convert diagram-heavy.pdf --vision-only --vision-dpi 200

Scenario 4: Using OpenAI GPT Models

Switch to OpenAI for different model capabilities:

# Use GPT-4o for conversion
pdf-to-md-llm convert document.pdf --provider openai --model gpt-4o --vision

# Use GPT-4o-mini for cost savings
pdf-to-md-llm convert document.pdf --provider openai --model gpt-4o-mini

Scenario 5: Batch Processing Multiple Documents

Convert entire folders of PDFs:

# Convert all PDFs in a folder (single-threaded)
pdf-to-md-llm batch ./research-papers

# With custom output folder and vision mode
pdf-to-md-llm batch ./input-pdfs ./output-markdown --vision

# Skip files that already have .md output (useful for resuming interrupted batches)
pdf-to-md-llm batch ./pdfs --skip-existing --vision

# Batch with OpenAI
pdf-to-md-llm batch ./pdfs --provider openai --vision

# Use multithreading for faster batch conversion (2 threads)
pdf-to-md-llm batch ./pdfs --threads 2 --vision

# Use 4 threads for even faster processing
pdf-to-md-llm batch ./pdfs --threads 4 --vision

# Maximum parallelization (be mindful of API rate limits)
pdf-to-md-llm batch ./large-batch --threads 8

# Combine skip-existing with multithreading for efficient resumption
pdf-to-md-llm batch ./large-batch --skip-existing --threads 4 --vision

Multithreading Benefits:

  • Dramatically reduces total conversion time for large batches
  • Efficiently utilizes multi-core processors
  • Thread count can be adjusted based on system resources and API rate limits
  • Default is single-threaded (1 thread) to avoid rate limit issues

Scenario 6: Simple Text Documents

For PDFs with simple text layout (no tables or complex formatting), standard mode is faster and more cost-effective:

# Standard mode (no vision) - faster and cheaper
pdf-to-md-llm convert simple-doc.pdf

# Adjust chunk size for standard mode
pdf-to-md-llm convert simple-doc.pdf --pages-per-chunk 10

Getting Help

# Check the installed version
pdf-to-md-llm --version

# Show all available options
pdf-to-md-llm --help

# Show help for specific commands
pdf-to-md-llm convert --help
pdf-to-md-llm batch --help
pdf-to-md-llm models --help

List Available Models

Check which AI models are available from your configured providers:

# List all available models from all configured providers
pdf-to-md-llm models

# List models from a specific provider
pdf-to-md-llm models --provider anthropic
pdf-to-md-llm models --provider openai

The models command will:

  • Show available models from providers that have API keys configured
  • Display the default model for each provider
  • Only query providers with valid API keys in your environment

Using as a Python Library

First, add the package to your project:

# Using uv (recommended)
uv add pdf-to-md-llm

# Or using pip
pip install pdf-to-md-llm

Then import and use in your Python code:

from pdf_to_md_llm import convert_pdf_to_markdown, batch_convert

# Convert with vision mode (recommended for complex layouts)
markdown_content = convert_pdf_to_markdown(
    pdf_path="document.pdf",
    output_path="output.md",  # Optional
    provider="anthropic",  # 'anthropic' or 'openai'
    use_vision=True,  # Enable vision mode
    pages_per_chunk=8,  # Pages per chunk (vision default: 8)
    verbose=True  # Show progress
)

# Convert with vision-only mode (no extracted text, just images)
markdown_content = convert_pdf_to_markdown(
    pdf_path="scanned-document.pdf",
    provider="anthropic",
    vision_only=True,  # Only use images, skip extracted text
    vision_dpi=200,  # Higher DPI for better quality
    verbose=True
)

# Use OpenAI with custom model
markdown_content = convert_pdf_to_markdown(
    pdf_path="document.pdf",
    provider="openai",
    model="gpt-4o",
    use_vision=True,
    api_key="your-openai-key"  # Optional if env var set
)

# Batch convert all PDFs in a folder
batch_convert(
    input_folder="./pdfs",
    output_folder="./markdown",  # Optional
    provider="anthropic",
    use_vision=True,
    verbose=True
)

# Batch convert with multithreading for faster processing
batch_convert(
    input_folder="./pdfs",
    output_folder="./markdown",
    provider="anthropic",
    use_vision=True,
    threads=4,  # Use 4 threads for parallel processing
    verbose=True
)

# Batch convert with skip_existing to resume interrupted batches
batch_convert(
    input_folder="./pdfs",
    output_folder="./markdown",
    provider="anthropic",
    use_vision=True,
    skip_existing=True,  # Skip files that already have .md output
    threads=4,
    verbose=True
)

Advanced Library Usage

from pdf_to_md_llm import extract_text_from_pdf, extract_pages_with_vision, chunk_pages

# Extract text only (standard mode)
pages = extract_text_from_pdf("document.pdf")
print(f"Found {len(pages)} pages")

# Extract with vision data (text + images)
vision_pages = extract_pages_with_vision("document.pdf", dpi=150)
for page in vision_pages:
    print(f"Page {page['page_num']}: has_tables={page['has_tables']}, has_images={page['has_images']}")

# Create custom chunks
chunks = chunk_pages(pages, pages_per_chunk=5)
print(f"Created {len(chunks)} chunks")

How It Works

Standard Mode

  1. Text Extraction: Extracts text from PDF using PyMuPDF
  2. Chunking: Breaks content into manageable chunks (default: 5 pages per chunk)
  3. LLM Processing: Sends each chunk to your chosen AI provider for intelligent markdown conversion
  4. Reassembly: Combines all chunks into a single, formatted markdown document

Vision Mode (Recommended)

  1. Multimodal Extraction: Extracts both text and renders page images from PDF
  2. Smart Chunking: Groups pages into larger chunks (default: 8 pages) for better context
  3. Visual Analysis: AI analyzes both text and images for superior layout understanding
  4. Enhanced Accuracy: Better detection of tables, charts, diagrams, and complex layouts
  5. Reassembly: Combines chunks with intelligent deduplication of headers/footers

When to use Vision Mode:

  • Documents with tables or complex layouts
  • PDFs containing charts, diagrams, or visual elements
  • Academic papers or technical documentation
  • Any document where layout matters

Vision-Only Mode:

Use --vision-only flag to send only page images to the AI without extracted text. This mode:

  • Relies completely on visual analysis of page images
  • Useful when PDF text extraction produces garbled or unreliable text
  • Better for image-heavy documents, scanned PDFs, or when layout is critical
  • Still uses chunking (controlled by --vision-pages-per-chunk)
  • Automatically enables --vision mode

Performance Tips

Choosing Between Standard and Vision Mode

Use Vision Mode when:

  • PDF contains tables, charts, or diagrams
  • Layout and formatting are important
  • You need accurate table extraction
  • Document has complex multi-column layouts

Use Vision-Only Mode when:

  • Text extraction produces garbled or unreliable output
  • Working with scanned PDFs or images embedded in PDFs
  • Visual layout is more important than extracted text
  • You want pure AI visual analysis without text hints

Use Standard Mode when:

  • Simple text-only documents
  • Speed and cost are priorities
  • Document has straightforward single-column layout

Chunk Size Optimization

Larger chunks (8-10 pages):

  • Better context for the AI model
  • More efficient API usage
  • Better for documents with consistent formatting
  • Default for vision mode

Smaller chunks (3-5 pages):

  • Better for very large documents (500+ pages)
  • Reduces memory usage
  • Helpful when hitting API token limits
  • Default for standard mode

Vision Mode Settings

DPI Settings:

  • Default (150 DPI): Good balance of quality and performance
  • High quality (200-300 DPI): For small text or detailed diagrams
  • Lower (100 DPI): Faster processing, suitable for simple layouts

Adjusting chunk size in vision mode:

# Smaller chunks for very large documents
pdf-to-md-llm convert large.pdf --vision --vision-pages-per-chunk 4

# Larger chunks for better context
pdf-to-md-llm convert doc.pdf --vision --vision-pages-per-chunk 12

# Vision-only mode with custom chunk size
pdf-to-md-llm convert scanned.pdf --vision-only --vision-pages-per-chunk 6

Troubleshooting

API Key Errors

Error: ValueError: API key not found

Solution:

  • Verify your API key is set in environment variables
  • Check the key name matches your provider (ANTHROPIC_API_KEY or OPENAI_API_KEY)
  • Ensure the key is valid and not expired

Rate Limiting

Error: API rate limit exceeded

Solution:

  • Reduce chunk size to make smaller API requests
  • Add delays between batch conversions
  • Upgrade your API plan for higher limits
  • Switch providers if one is experiencing issues

Large File Issues

Error: Memory errors or timeouts on large PDFs

Solution:

  • Use smaller chunk sizes: --vision-pages-per-chunk 3
  • Process in batches by splitting the PDF first
  • Use standard mode instead of vision for simple documents
  • Increase available system memory

Vision Mode Memory Issues

Error: Out of memory when using vision mode

Solution:

  • Reduce DPI: --vision-dpi 100
  • Use smaller chunks: --vision-pages-per-chunk 4
  • Process fewer pages at once
  • Close other applications to free memory

Poor Quality Output

Problem: Markdown output has formatting issues

Solution:

  • Try vision mode for better layout detection: --vision
  • Increase DPI for better image quality: --vision-dpi 200
  • Try vision-only mode if extracted text is garbled: --vision-only
  • Try different models: --provider openai --model gpt-4o
  • Adjust chunk size for better context

API Reference

Main Functions

convert_pdf_to_markdown()

def convert_pdf_to_markdown(
    pdf_path: str,
    output_path: Optional[str] = None,
    pages_per_chunk: int = 5,
    provider: str = "anthropic",
    api_key: Optional[str] = None,
    model: Optional[str] = None,
    max_tokens: int = 4000,
    verbose: bool = True,
    use_vision: bool = False,
    vision_dpi: int = 150,
    vision_only: bool = False
) -> str

Convert a single PDF to markdown.

Parameters:

  • pdf_path: Path to the PDF file
  • output_path: Optional output file path (defaults to PDF name with .md extension)
  • pages_per_chunk: Number of pages per API call (default: 5 for standard, 8 for vision)
  • provider: AI provider - 'anthropic' or 'openai' (default: 'anthropic')
  • api_key: API key (defaults to provider-specific environment variable)
  • model: Model to use (optional, uses provider defaults)
  • max_tokens: Maximum tokens per API call (default: 4000)
  • verbose: Print progress messages (default: True)
  • use_vision: Enable vision mode for better extraction (default: False)
  • vision_dpi: DPI for page images in vision mode (default: 150)
  • vision_only: Use only images without extracted text (default: False, automatically enables use_vision)

Returns: The complete markdown content as a string

Raises: ValueError if API key is missing or provider is invalid

batch_convert()

def batch_convert(
    input_folder: str,
    output_folder: Optional[str] = None,
    pages_per_chunk: int = 5,
    provider: str = "anthropic",
    api_key: Optional[str] = None,
    model: Optional[str] = None,
    max_tokens: int = 4000,
    verbose: bool = True,
    use_vision: bool = False,
    vision_dpi: int = 150,
    vision_only: bool = False,
    threads: int = 1,
    skip_existing: bool = False
) -> None

Convert all PDFs in a folder to markdown.

Parameters:

  • input_folder: Folder containing PDF files
  • output_folder: Optional output folder (defaults to input folder)
  • vision_only: Use only images without extracted text (default: False, automatically enables use_vision)
  • threads: Number of threads for parallel processing (default: 1 for single-threaded)
  • skip_existing: Skip files that already have corresponding .md files in output directory (default: False)
  • All other parameters same as convert_pdf_to_markdown()

Note on Multithreading:

  • Single-threaded (threads=1): Default, sequential processing
  • Multithreaded (threads>1): Parallel processing for faster batch conversion
  • Be mindful of API rate limits when using higher thread counts
  • Progress output is simplified in multithreaded mode for clarity

extract_text_from_pdf()

def extract_text_from_pdf(pdf_path: str) -> List[str]

Extract raw text from PDF (standard mode).

Returns: List of strings, one per page

extract_pages_with_vision()

def extract_pages_with_vision(pdf_path: str, dpi: int = 150) -> List[Dict[str, Any]]

Extract text and images from PDF pages for vision processing.

Returns: List of dicts with keys: page_num, text, image_base64, has_images, has_tables

chunk_pages()

def chunk_pages(pages: List[str], pages_per_chunk: int) -> List[str]

Combine pages into chunks for processing.

Returns: List of combined page chunks

Output Format

Converted markdown files include:

  • Document title header
  • Clean heading hierarchy
  • Properly formatted tables
  • Organized lists
  • Removed page numbers and PDF artifacts
  • Conversion metadata footer

Requirements

  • Python 3.9 or higher
  • API key for at least one provider:
    • Anthropic API key (for Claude models)
    • OpenAI API key (for GPT models)

Dependencies

All dependencies are automatically installed:

  • anthropic: Claude API client (for Anthropic provider)
  • openai: OpenAI API client (for OpenAI provider)
  • pymupdf: PDF text and image extraction
  • python-dotenv: Environment variable management
  • click: CLI framework

License

This project is open source and available under the MIT License.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for development setup, testing, and contribution guidelines.

For bug reports and feature requests, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_to_md_llm-2.7.1.tar.gz (46.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_to_md_llm-2.7.1-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file pdf_to_md_llm-2.7.1.tar.gz.

File metadata

  • Download URL: pdf_to_md_llm-2.7.1.tar.gz
  • Upload date:
  • Size: 46.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.4

File hashes

Hashes for pdf_to_md_llm-2.7.1.tar.gz
Algorithm Hash digest
SHA256 c8eda96631fa821eb74929622f3d8f3a318bbe0d773da45345924d41e2374f03
MD5 e702e6cfb3ab8a662ddf99f9c1d08055
BLAKE2b-256 60b3b79441d940a141e2f3dd06bae59196a8dcc3ffff4dc0a1445662326d8bf5

See more details on using hashes here.

File details

Details for the file pdf_to_md_llm-2.7.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_to_md_llm-2.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6d23d5ae31a18579918dce5bded35208856da255a6c50c612347297c9f3f9c29
MD5 bbc5af060d4e0222a22a6be460b8ca39
BLAKE2b-256 1d094556012504fb9c29f310abb2338cad6f7e2410e43e56cdbd9664eab736aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page