Skip to main content

PDF to Markdown conversion with multiple backend support

Project description

pdfsmith

PDF to Markdown conversion with multiple backend support

PyPI version CI Python 3.10+ License: MIT

A unified interface to 19+ PDF parsing libraries including frontier LLMs. Pick the right tool for the job, or let pdfsmith choose for you.

Why pdfsmith?

  • One API, many backends - Switch between parsers without changing your code
  • Auto-selection - Automatically uses the best available parser
  • Lightweight core - Install only the backends you need
  • Battle-tested - Wrappers refined through extensive benchmarking

Installation

# Core package (no backends)
pip install pdfsmith

# With lightweight backends
pip install pdfsmith[light]

# Recommended stack (good balance of quality and speed)
pip install pdfsmith[recommended]

# All open-source backends
pip install pdfsmith[all]

# Frontier LLMs (GPT, Claude, Gemini)
pip install pdfsmith[frontier]

# Commercial cloud APIs
pip install pdfsmith[commercial]

# Specific backend
pip install pdfsmith[docling]

Quick Start

from pdfsmith import parse

# Auto-select best available backend
markdown = parse("document.pdf")

# Use a specific backend
markdown = parse("document.pdf", backend="docling")

# Check available backends
from pdfsmith import available_backends
for backend in available_backends():
    print(f"{backend.name}: {backend.description}")

CLI Usage

# Parse PDF to stdout
pdfsmith parse document.pdf

# Parse to file
pdfsmith parse document.pdf -o output.md

# Use specific backend
pdfsmith parse document.pdf -b docling

# List available backends
pdfsmith backends

Available Backends

Open Source

Backend Weight Best For
docling heavy Highest quality, complex documents
marker heavy Academic papers, LaTeX content
pymupdf4llm medium Good balance of speed and quality
kreuzberg medium Fast extraction with OCR
unstructured medium Versatile document processing
pdfplumber light Tables and structured data
pymupdf light Fast general-purpose extraction
pypdf light Lightweight, pure Python
pdfminer light Mature, handles encodings well
pypdfium2 light Chrome's PDF engine
extractous medium Rust-based extraction

Commercial Cloud APIs

Backend Provider Cost Best For
aws_textract AWS $1.50/1k pages High-accuracy OCR
azure_document_intelligence Azure $1.50/1k pages Enterprise documents
google_document_ai Google Cloud $1.50/1k pages Multi-language support
databricks Databricks ~$3/1k pages SQL-based workflows
llamaparse LlamaIndex $0.003/page Cost-effective API

Frontier LLMs

Backend Model Cost Best For
anthropic Claude Sonnet 4.5 ~$0.04/page High accuracy
openai GPT-4o ~$0.02/page General purpose
gemini Gemini 2.0 Flash ~$0.001/page Budget LLM option

Note: Frontier LLM backends require API keys set via environment variables (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY).

Choosing a Backend

  • Best quality: anthropic or openai - Frontier LLM accuracy (highest cost)
  • Best value: llamaparse - Near-LLM quality at 10x lower cost
  • Structure preservation: docling - Deep learning, GPU recommended
  • Academic papers: marker - Optimized for LaTeX/equations
  • Tables: pdfplumber - Excellent table detection
  • Speed: pymupdf or kreuzberg - Fast extraction
  • Minimal dependencies: pypdf - Pure Python, no binaries
  • Budget LLM: gemini with gemini-2.0-flash - Very low cost LLM option

System Dependencies

Some backends require system packages for OCR functionality:

Tesseract OCR (for kreuzberg and unstructured with OCR):

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
# Download from https://github.com/UB-Mannheim/tesseract/wiki

Without tesseract, these backends will still work for text-based PDFs but cannot extract text from scanned/image PDFs.

Async Support

from pdfsmith import parse_async

# Async parsing (uses backend's native async if available)
markdown = await parse_async("document.pdf")

Benchmarks

pdfsmith's backend wrappers were developed and refined through the pdf-bench benchmarking project, which evaluates parser performance across diverse document types.

License

MIT

Contributing

Contributions welcome! Please read our contributing guidelines before submitting PRs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfsmith-0.2.0.tar.gz (499.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfsmith-0.2.0-py3-none-any.whl (39.7 kB view details)

Uploaded Python 3

File details

Details for the file pdfsmith-0.2.0.tar.gz.

File metadata

  • Download URL: pdfsmith-0.2.0.tar.gz
  • Upload date:
  • Size: 499.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdfsmith-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8fe4da11e50c15b2521799f0001befbf2c4b7a1d01d5c897ba8f9e0585507bc7
MD5 9ea1c6a6d13950d02adf9099de9de900
BLAKE2b-256 35c0130e0947c510451dcc8af2cfc6509c0e943fa5b1a24e1323f01787ea6932

See more details on using hashes here.

File details

Details for the file pdfsmith-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pdfsmith-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 39.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdfsmith-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5c9ff402fa7af7688f9183c7441df49a0b375dde8daff8332aa92bdc94804b13
MD5 ab9d8fb156d34716b01ae2b106e65995
BLAKE2b-256 c5ad1733d1b7889bc3150ef13910790eaa3c66a59c385750e935087883a9a277

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page