Library and CLI to convert PDF documents to clean, well-structured Markdown using LLM-assisted processing, leveraging Antrhopic and OpenAI models for intelligent extraction of text, tables, and complex layouts.
Project description
PDF to Markdown Converter
Library and CLI to convert PDF documents to clean, well-structured Markdown using LLM-assisted processing, leveraging Antrhopic and OpenAI models for intelligent extraction of text, tables, and complex layouts.
Features
- Vision Mode: Enhanced extraction using multimodal AI for complex layouts, tables, charts, and diagrams
- Multi-Provider Support: Use Anthropic (Claude) or OpenAI (GPT) models
- Smart Conversion: Intelligently converts PDF content to clean markdown with proper formatting
- Large File Support: Automatically chunks large PDFs for optimal processing
- Batch Processing: Convert entire folders of PDFs with preserved directory structure
- Table Preservation: Accurately converts tables to markdown format with vision-enhanced detection
- Structure Detection: Automatically generates appropriate heading hierarchy
- Dual Interface: Use as both a CLI tool and a Python library
Quick Start
# 1. Install with uv (recommended - faster)
uv tool install pdf-to-md-llm
# 2. Set your API key
export ANTHROPIC_API_KEY='your-api-key-here'
# 3. Convert a PDF
pdf-to-md-llm convert document.pdf --vision
Installation
Using uv (Recommended)
uv is a fast Python package installer:
# Install the package as a tool
uv tool install pdf-to-md-llm
# Or run directly without installing
uvx pdf-to-md-llm convert document.pdf
Using pip (Alternative)
pip install pdf-to-md-llm
Configuration
Set your API key for at least one provider:
# For Anthropic (Claude) - recommended
export ANTHROPIC_API_KEY='your-anthropic-api-key-here'
# For OpenAI (GPT)
export OPENAI_API_KEY='your-openai-api-key-here'
Or create a .env.local file:
ANTHROPIC_API_KEY=your-anthropic-api-key-here
OPENAI_API_KEY=your-openai-api-key-here
Default Models (Optimized for Cost/Quality)
The tool uses cost-effective models by default:
- Anthropic:
claude-3-5-haiku-20241022($0.80 input / $4 output per million tokens) - OpenAI:
gpt-4o-mini($0.15 input / $0.60 output per million tokens)
These defaults provide excellent quality for most PDF conversion tasks at significantly lower cost. For complex documents requiring maximum accuracy, you can override with premium models:
# Use more powerful Anthropic model for complex documents
pdf-to-md-llm convert complex-doc.pdf --model claude-sonnet-4-20250514 --vision
# Use OpenAI's flagship model
pdf-to-md-llm convert complex-doc.pdf --provider openai --model gpt-4o --vision
To see all available models from your configured providers, see List Available Models.
Usage Examples
Basic Conversion
# Simple document conversion
pdf-to-md-llm convert document.pdf
# Specify output filename
pdf-to-md-llm convert document.pdf output.md
Scenario 1: Academic Papers with Tables
For research papers, technical documents, or any PDF with complex tables:
# Vision mode provides superior table extraction
pdf-to-md-llm convert research-paper.pdf --vision
Scenario 2: Large Documents (500+ pages)
For textbooks, manuals, or large documents, use smaller chunks for better processing:
# Reduce chunk size for memory efficiency
pdf-to-md-llm convert textbook.pdf --vision --vision-pages-per-chunk 4
Scenario 3: Documents with Charts and Diagrams
For PDFs containing visual elements like charts, graphs, or diagrams:
# Vision mode analyzes images and describes visual content
pdf-to-md-llm convert annual-report.pdf --vision --vision-dpi 200
# Use vision-only mode to rely solely on image analysis (no extracted text)
# Useful for PDFs where text extraction is unreliable or when you want pure visual analysis
pdf-to-md-llm convert diagram-heavy.pdf --vision-only --vision-dpi 200
Scenario 4: Using OpenAI GPT Models
Switch to OpenAI for different model capabilities:
# Use GPT-4o for conversion
pdf-to-md-llm convert document.pdf --provider openai --model gpt-4o --vision
# Use GPT-4o-mini for cost savings
pdf-to-md-llm convert document.pdf --provider openai --model gpt-4o-mini
Scenario 5: Batch Processing Multiple Documents
Convert entire folders of PDFs:
# Convert all PDFs in a folder (single-threaded)
pdf-to-md-llm batch ./research-papers
# With custom output folder and vision mode
pdf-to-md-llm batch ./input-pdfs ./output-markdown --vision
# Skip files that already have .md output (useful for resuming interrupted batches)
pdf-to-md-llm batch ./pdfs --skip-existing --vision
# Batch with OpenAI
pdf-to-md-llm batch ./pdfs --provider openai --vision
# Use multithreading for faster batch conversion (2 threads)
pdf-to-md-llm batch ./pdfs --threads 2 --vision
# Use 4 threads for even faster processing
pdf-to-md-llm batch ./pdfs --threads 4 --vision
# Maximum parallelization (be mindful of API rate limits)
pdf-to-md-llm batch ./large-batch --threads 8
# Combine skip-existing with multithreading for efficient resumption
pdf-to-md-llm batch ./large-batch --skip-existing --threads 4 --vision
Multithreading Benefits:
- Dramatically reduces total conversion time for large batches
- Efficiently utilizes multi-core processors
- Thread count can be adjusted based on system resources and API rate limits
- Default is single-threaded (1 thread) to avoid rate limit issues
Scenario 6: Simple Text Documents
For PDFs with simple text layout (no tables or complex formatting), standard mode is faster and more cost-effective:
# Standard mode (no vision) - faster and cheaper
pdf-to-md-llm convert simple-doc.pdf
# Adjust chunk size for standard mode
pdf-to-md-llm convert simple-doc.pdf --pages-per-chunk 10
Getting Help
# Check the installed version
pdf-to-md-llm --version
# Show all available options
pdf-to-md-llm --help
# Show help for specific commands
pdf-to-md-llm convert --help
pdf-to-md-llm batch --help
pdf-to-md-llm models --help
List Available Models
Check which AI models are available from your configured providers:
# List all available models from all configured providers
pdf-to-md-llm models
# List models from a specific provider
pdf-to-md-llm models --provider anthropic
pdf-to-md-llm models --provider openai
The models command will:
- Show available models from providers that have API keys configured
- Display the default model for each provider
- Only query providers with valid API keys in your environment
Using as a Python Library
First, add the package to your project:
# Using uv (recommended)
uv add pdf-to-md-llm
# Or using pip
pip install pdf-to-md-llm
Then import and use in your Python code:
from pdf_to_md_llm import convert_pdf_to_markdown, batch_convert
# Convert with vision mode (recommended for complex layouts)
markdown_content = convert_pdf_to_markdown(
pdf_path="document.pdf",
output_path="output.md", # Optional
provider="anthropic", # 'anthropic' or 'openai'
use_vision=True, # Enable vision mode
pages_per_chunk=8, # Pages per chunk (vision default: 8)
verbose=True # Show progress
)
# Convert with vision-only mode (no extracted text, just images)
markdown_content = convert_pdf_to_markdown(
pdf_path="scanned-document.pdf",
provider="anthropic",
vision_only=True, # Only use images, skip extracted text
vision_dpi=200, # Higher DPI for better quality
verbose=True
)
# Use OpenAI with custom model
markdown_content = convert_pdf_to_markdown(
pdf_path="document.pdf",
provider="openai",
model="gpt-4o",
use_vision=True,
api_key="your-openai-key" # Optional if env var set
)
# Batch convert all PDFs in a folder
batch_convert(
input_folder="./pdfs",
output_folder="./markdown", # Optional
provider="anthropic",
use_vision=True,
verbose=True
)
# Batch convert with multithreading for faster processing
batch_convert(
input_folder="./pdfs",
output_folder="./markdown",
provider="anthropic",
use_vision=True,
threads=4, # Use 4 threads for parallel processing
verbose=True
)
# Batch convert with skip_existing to resume interrupted batches
batch_convert(
input_folder="./pdfs",
output_folder="./markdown",
provider="anthropic",
use_vision=True,
skip_existing=True, # Skip files that already have .md output
threads=4,
verbose=True
)
Advanced Library Usage
from pdf_to_md_llm import extract_text_from_pdf, extract_pages_with_vision, chunk_pages
# Extract text only (standard mode)
pages = extract_text_from_pdf("document.pdf")
print(f"Found {len(pages)} pages")
# Extract with vision data (text + images)
vision_pages = extract_pages_with_vision("document.pdf", dpi=150)
for page in vision_pages:
print(f"Page {page['page_num']}: has_tables={page['has_tables']}, has_images={page['has_images']}")
# Create custom chunks
chunks = chunk_pages(pages, pages_per_chunk=5)
print(f"Created {len(chunks)} chunks")
How It Works
Standard Mode
- Text Extraction: Extracts text from PDF using PyMuPDF
- Chunking: Breaks content into manageable chunks (default: 5 pages per chunk)
- LLM Processing: Sends each chunk to your chosen AI provider for intelligent markdown conversion
- Reassembly: Combines all chunks into a single, formatted markdown document
Vision Mode (Recommended)
- Multimodal Extraction: Extracts both text and renders page images from PDF
- Smart Chunking: Groups pages into larger chunks (default: 8 pages) for better context
- Visual Analysis: AI analyzes both text and images for superior layout understanding
- Enhanced Accuracy: Better detection of tables, charts, diagrams, and complex layouts
- Reassembly: Combines chunks with intelligent deduplication of headers/footers
When to use Vision Mode:
- Documents with tables or complex layouts
- PDFs containing charts, diagrams, or visual elements
- Academic papers or technical documentation
- Any document where layout matters
Vision-Only Mode:
Use --vision-only flag to send only page images to the AI without extracted text. This mode:
- Relies completely on visual analysis of page images
- Useful when PDF text extraction produces garbled or unreliable text
- Better for image-heavy documents, scanned PDFs, or when layout is critical
- Still uses chunking (controlled by
--vision-pages-per-chunk) - Automatically enables
--visionmode
Performance Tips
Choosing Between Standard and Vision Mode
Use Vision Mode when:
- PDF contains tables, charts, or diagrams
- Layout and formatting are important
- You need accurate table extraction
- Document has complex multi-column layouts
Use Vision-Only Mode when:
- Text extraction produces garbled or unreliable output
- Working with scanned PDFs or images embedded in PDFs
- Visual layout is more important than extracted text
- You want pure AI visual analysis without text hints
Use Standard Mode when:
- Simple text-only documents
- Speed and cost are priorities
- Document has straightforward single-column layout
Chunk Size Optimization
Larger chunks (8-10 pages):
- Better context for the AI model
- More efficient API usage
- Better for documents with consistent formatting
- Default for vision mode
Smaller chunks (3-5 pages):
- Better for very large documents (500+ pages)
- Reduces memory usage
- Helpful when hitting API token limits
- Default for standard mode
Vision Mode Settings
DPI Settings:
- Default (150 DPI): Good balance of quality and performance
- High quality (200-300 DPI): For small text or detailed diagrams
- Lower (100 DPI): Faster processing, suitable for simple layouts
Adjusting chunk size in vision mode:
# Smaller chunks for very large documents
pdf-to-md-llm convert large.pdf --vision --vision-pages-per-chunk 4
# Larger chunks for better context
pdf-to-md-llm convert doc.pdf --vision --vision-pages-per-chunk 12
# Vision-only mode with custom chunk size
pdf-to-md-llm convert scanned.pdf --vision-only --vision-pages-per-chunk 6
Troubleshooting
API Key Errors
Error: ValueError: API key not found
Solution:
- Verify your API key is set in environment variables
- Check the key name matches your provider (ANTHROPIC_API_KEY or OPENAI_API_KEY)
- Ensure the key is valid and not expired
Rate Limiting
Error: API rate limit exceeded
Solution:
- Reduce chunk size to make smaller API requests
- Add delays between batch conversions
- Upgrade your API plan for higher limits
- Switch providers if one is experiencing issues
Large File Issues
Error: Memory errors or timeouts on large PDFs
Solution:
- Use smaller chunk sizes:
--vision-pages-per-chunk 3 - Process in batches by splitting the PDF first
- Use standard mode instead of vision for simple documents
- Increase available system memory
Vision Mode Memory Issues
Error: Out of memory when using vision mode
Solution:
- Reduce DPI:
--vision-dpi 100 - Use smaller chunks:
--vision-pages-per-chunk 4 - Process fewer pages at once
- Close other applications to free memory
Poor Quality Output
Problem: Markdown output has formatting issues
Solution:
- Try vision mode for better layout detection:
--vision - Increase DPI for better image quality:
--vision-dpi 200 - Try vision-only mode if extracted text is garbled:
--vision-only - Try different models:
--provider openai --model gpt-4o - Adjust chunk size for better context
API Reference
Main Functions
convert_pdf_to_markdown()
def convert_pdf_to_markdown(
pdf_path: str,
output_path: Optional[str] = None,
pages_per_chunk: int = 5,
provider: str = "anthropic",
api_key: Optional[str] = None,
model: Optional[str] = None,
max_tokens: int = 4000,
verbose: bool = True,
use_vision: bool = False,
vision_dpi: int = 150,
vision_only: bool = False
) -> str
Convert a single PDF to markdown.
Parameters:
pdf_path: Path to the PDF fileoutput_path: Optional output file path (defaults to PDF name with .md extension)pages_per_chunk: Number of pages per API call (default: 5 for standard, 8 for vision)provider: AI provider - 'anthropic' or 'openai' (default: 'anthropic')api_key: API key (defaults to provider-specific environment variable)model: Model to use (optional, uses provider defaults)max_tokens: Maximum tokens per API call (default: 4000)verbose: Print progress messages (default: True)use_vision: Enable vision mode for better extraction (default: False)vision_dpi: DPI for page images in vision mode (default: 150)vision_only: Use only images without extracted text (default: False, automatically enables use_vision)
Returns: The complete markdown content as a string
Raises: ValueError if API key is missing or provider is invalid
batch_convert()
def batch_convert(
input_folder: str,
output_folder: Optional[str] = None,
pages_per_chunk: int = 5,
provider: str = "anthropic",
api_key: Optional[str] = None,
model: Optional[str] = None,
max_tokens: int = 4000,
verbose: bool = True,
use_vision: bool = False,
vision_dpi: int = 150,
vision_only: bool = False,
threads: int = 1,
skip_existing: bool = False
) -> None
Convert all PDFs in a folder to markdown.
Parameters:
input_folder: Folder containing PDF filesoutput_folder: Optional output folder (defaults to input folder)vision_only: Use only images without extracted text (default: False, automatically enables use_vision)threads: Number of threads for parallel processing (default: 1 for single-threaded)skip_existing: Skip files that already have corresponding .md files in output directory (default: False)- All other parameters same as
convert_pdf_to_markdown()
Note on Multithreading:
- Single-threaded (
threads=1): Default, sequential processing - Multithreaded (
threads>1): Parallel processing for faster batch conversion - Be mindful of API rate limits when using higher thread counts
- Progress output is simplified in multithreaded mode for clarity
extract_text_from_pdf()
def extract_text_from_pdf(pdf_path: str) -> List[str]
Extract raw text from PDF (standard mode).
Returns: List of strings, one per page
extract_pages_with_vision()
def extract_pages_with_vision(pdf_path: str, dpi: int = 150) -> List[Dict[str, Any]]
Extract text and images from PDF pages for vision processing.
Returns: List of dicts with keys: page_num, text, image_base64, has_images, has_tables
chunk_pages()
def chunk_pages(pages: List[str], pages_per_chunk: int) -> List[str]
Combine pages into chunks for processing.
Returns: List of combined page chunks
Output Format
Converted markdown files include:
- Document title header
- Clean heading hierarchy
- Properly formatted tables
- Organized lists
- Removed page numbers and PDF artifacts
- Conversion metadata footer
Requirements
- Python 3.9 or higher
- API key for at least one provider:
- Anthropic API key (for Claude models)
- OpenAI API key (for GPT models)
Dependencies
All dependencies are automatically installed:
- anthropic: Claude API client (for Anthropic provider)
- openai: OpenAI API client (for OpenAI provider)
- pymupdf: PDF text and image extraction
- python-dotenv: Environment variable management
- click: CLI framework
License
This project is open source and available under the MIT License.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for development setup, testing, and contribution guidelines.
For bug reports and feature requests, please open an issue on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_to_md_llm-2.7.1.tar.gz.
File metadata
- Download URL: pdf_to_md_llm-2.7.1.tar.gz
- Upload date:
- Size: 46.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8eda96631fa821eb74929622f3d8f3a318bbe0d773da45345924d41e2374f03
|
|
| MD5 |
e702e6cfb3ab8a662ddf99f9c1d08055
|
|
| BLAKE2b-256 |
60b3b79441d940a141e2f3dd06bae59196a8dcc3ffff4dc0a1445662326d8bf5
|
File details
Details for the file pdf_to_md_llm-2.7.1-py3-none-any.whl.
File metadata
- Download URL: pdf_to_md_llm-2.7.1-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d23d5ae31a18579918dce5bded35208856da255a6c50c612347297c9f3f9c29
|
|
| MD5 |
bbc5af060d4e0222a22a6be460b8ca39
|
|
| BLAKE2b-256 |
1d094556012504fb9c29f310abb2338cad6f7e2410e43e56cdbd9664eab736aa
|