PDF to Markdown conversion with multiple backend support
Project description
pdfsmith
PDF to Markdown conversion with multiple backend support
A unified interface to 19+ PDF parsing libraries including frontier LLMs. Pick the right tool for the job, or let pdfsmith choose for you.
Why pdfsmith?
- One API, many backends - Switch between parsers without changing your code
- Auto-selection - Automatically uses the best available parser
- Lightweight core - Install only the backends you need
- Battle-tested - Wrappers refined through extensive benchmarking
Installation
# Core package (no backends)
pip install pdfsmith
# With lightweight backends
pip install pdfsmith[light]
# Recommended stack (good balance of quality and speed)
pip install pdfsmith[recommended]
# All open-source backends
pip install pdfsmith[all]
# Frontier LLMs (GPT, Claude, Gemini)
pip install pdfsmith[frontier]
# Commercial cloud APIs
pip install pdfsmith[commercial]
# Specific backend
pip install pdfsmith[docling]
Quick Start
from pdfsmith import parse
# Auto-select best available backend
markdown = parse("document.pdf")
# Use a specific backend
markdown = parse("document.pdf", backend="docling")
# Check available backends
from pdfsmith import available_backends
for backend in available_backends():
print(f"{backend.name}: {backend.description}")
CLI Usage
# Parse PDF to stdout
pdfsmith parse document.pdf
# Parse to file
pdfsmith parse document.pdf -o output.md
# Use specific backend
pdfsmith parse document.pdf -b docling
# List available backends
pdfsmith backends
Available Backends
Open Source
| Backend | Weight | Best For |
|---|---|---|
docling |
heavy | Highest quality, complex documents |
marker |
heavy | Academic papers, LaTeX content |
pymupdf4llm |
medium | Good balance of speed and quality |
kreuzberg |
medium | Fast extraction with OCR |
unstructured |
medium | Versatile document processing |
pdfplumber |
light | Tables and structured data |
pymupdf |
light | Fast general-purpose extraction |
pypdf |
light | Lightweight, pure Python |
pdfminer |
light | Mature, handles encodings well |
pypdfium2 |
light | Chrome's PDF engine |
extractous |
medium | Rust-based extraction |
Commercial Cloud APIs
| Backend | Provider | Cost | Best For |
|---|---|---|---|
aws_textract |
AWS | $1.50/1k pages | High-accuracy OCR |
azure_document_intelligence |
Azure | $1.50/1k pages | Enterprise documents |
google_document_ai |
Google Cloud | $1.50/1k pages | Multi-language support |
databricks |
Databricks | ~$3/1k pages | SQL-based workflows |
llamaparse |
LlamaIndex | $0.003/page | Cost-effective API |
Frontier LLMs
| Backend | Model | Cost | Best For |
|---|---|---|---|
anthropic |
Claude Sonnet 4.5 | ~$0.04/page | High accuracy |
openai |
GPT-4o | ~$0.02/page | General purpose |
gemini |
Gemini 2.0 Flash | ~$0.001/page | Budget LLM option |
Note: Frontier LLM backends require API keys set via environment variables (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY).
Choosing a Backend
- Best quality:
anthropicoropenai- Frontier LLM accuracy (highest cost) - Best value:
llamaparse- Near-LLM quality at 10x lower cost - Structure preservation:
docling- Deep learning, GPU recommended - Academic papers:
marker- Optimized for LaTeX/equations - Tables:
pdfplumber- Excellent table detection - Speed:
pymupdforkreuzberg- Fast extraction - Minimal dependencies:
pypdf- Pure Python, no binaries - Budget LLM:
geminiwith gemini-2.0-flash - Very low cost LLM option
System Dependencies
Some backends require system packages for OCR functionality:
Tesseract OCR (for kreuzberg and unstructured with OCR):
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Windows
# Download from https://github.com/UB-Mannheim/tesseract/wiki
Without tesseract, these backends will still work for text-based PDFs but cannot extract text from scanned/image PDFs.
Async Support
from pdfsmith import parse_async
# Async parsing (uses backend's native async if available)
markdown = await parse_async("document.pdf")
Benchmarks
pdfsmith's backend wrappers were developed and refined through the pdf-bench benchmarking project, which evaluates parser performance across diverse document types.
License
MIT
Contributing
Contributions welcome! Please read our contributing guidelines before submitting PRs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfsmith-0.2.0.tar.gz.
File metadata
- Download URL: pdfsmith-0.2.0.tar.gz
- Upload date:
- Size: 499.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8fe4da11e50c15b2521799f0001befbf2c4b7a1d01d5c897ba8f9e0585507bc7
|
|
| MD5 |
9ea1c6a6d13950d02adf9099de9de900
|
|
| BLAKE2b-256 |
35c0130e0947c510451dcc8af2cfc6509c0e943fa5b1a24e1323f01787ea6932
|
File details
Details for the file pdfsmith-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pdfsmith-0.2.0-py3-none-any.whl
- Upload date:
- Size: 39.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c9ff402fa7af7688f9183c7441df49a0b375dde8daff8332aa92bdc94804b13
|
|
| MD5 |
ab9d8fb156d34716b01ae2b106e65995
|
|
| BLAKE2b-256 |
c5ad1733d1b7889bc3150ef13910790eaa3c66a59c385750e935087883a9a277
|