PDF to Markdown conversion with multiple backend support

These details have not been verified by PyPI

Project links

Project description

pdfsmith

PDF to Markdown conversion with multiple backend support

A unified interface to 19+ PDF parsing libraries including frontier LLMs. Pick the right tool for the job, or let pdfsmith choose for you.

Why pdfsmith?

One API, many backends - Switch between parsers without changing your code
Auto-selection - Automatically uses the best available parser
Lightweight core - Install only the backends you need
Battle-tested - Wrappers refined through extensive benchmarking

Installation

# Core package (no backends)
pip install pdfsmith

# With lightweight backends
pip install pdfsmith[light]

# Recommended stack (good balance of quality and speed)
pip install pdfsmith[recommended]

# All open-source backends
pip install pdfsmith[all]

# Frontier LLMs (GPT, Claude, Gemini)
pip install pdfsmith[frontier]

# Commercial cloud APIs
pip install pdfsmith[commercial]

# Specific backend
pip install pdfsmith[docling]

Quick Start

from pdfsmith import parse

# Auto-select best available backend
markdown = parse("document.pdf")

# Use a specific backend
markdown = parse("document.pdf", backend="docling")

# Check available backends
from pdfsmith import available_backends
for backend in available_backends():
    print(f"{backend.name}: {backend.description}")

CLI Usage

# Parse PDF to stdout
pdfsmith parse document.pdf

# Parse to file
pdfsmith parse document.pdf -o output.md

# Use specific backend
pdfsmith parse document.pdf -b docling

# List available backends
pdfsmith backends

Available Backends

Open Source

Backend	Weight	Best For
`docling`	heavy	Highest quality, complex documents
`marker`	heavy	Academic papers, LaTeX content
`pymupdf4llm`	medium	Good balance of speed and quality
`kreuzberg`	medium	Fast extraction with OCR
`unstructured`	medium	Versatile document processing
`pdfplumber`	light	Tables and structured data
`pymupdf`	light	Fast general-purpose extraction
`pypdf`	light	Lightweight, pure Python
`pdfminer`	light	Mature, handles encodings well
`pypdfium2`	light	Chrome's PDF engine
`extractous`	medium	Rust-based extraction

Commercial Cloud APIs

Backend	Provider	Cost	Best For
`aws_textract`	AWS	$1.50/1k pages	High-accuracy OCR
`azure_document_intelligence`	Azure	$1.50/1k pages	Enterprise documents
`google_document_ai`	Google Cloud	$1.50/1k pages	Multi-language support
`databricks`	Databricks	~$3/1k pages	SQL-based workflows
`llamaparse`	LlamaIndex	$0.003/page	Cost-effective API

Frontier LLMs

Backend	Model	Cost	Best For
`anthropic`	Claude Sonnet 4.5	~$0.04/page	High accuracy
`openai`	GPT-4o	~$0.02/page	General purpose
`gemini`	Gemini 2.0 Flash	~$0.001/page	Budget LLM option

Note: Frontier LLM backends require API keys set via environment variables (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY).

Choosing a Backend

Best quality: anthropic or openai - Frontier LLM accuracy (highest cost)
Best value: llamaparse - Near-LLM quality at 10x lower cost
Structure preservation: docling - Deep learning, GPU recommended
Academic papers: marker - Optimized for LaTeX/equations
Tables: pdfplumber - Excellent table detection
Speed: pymupdf or kreuzberg - Fast extraction
Minimal dependencies: pypdf - Pure Python, no binaries
Budget LLM: gemini with gemini-2.0-flash - Very low cost LLM option

System Dependencies

Some backends require system packages for OCR functionality:

Tesseract OCR (for kreuzberg and unstructured with OCR):

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
# Download from https://github.com/UB-Mannheim/tesseract/wiki

Without tesseract, these backends will still work for text-based PDFs but cannot extract text from scanned/image PDFs.

Async Support

from pdfsmith import parse_async

# Async parsing (uses backend's native async if available)
markdown = await parse_async("document.pdf")

Benchmarks

pdfsmith's backend wrappers were developed and refined through the pdf-bench benchmarking project, which evaluates parser performance across diverse document types.

License

MIT

Contributing

Contributions welcome! Please read our contributing guidelines before submitting PRs.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Dec 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfsmith-0.2.0.tar.gz (499.2 kB view details)

Uploaded Dec 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfsmith-0.2.0-py3-none-any.whl (39.7 kB view details)

Uploaded Dec 2, 2025 Python 3

File details

Details for the file pdfsmith-0.2.0.tar.gz.

File metadata

Download URL: pdfsmith-0.2.0.tar.gz
Upload date: Dec 2, 2025
Size: 499.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdfsmith-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`8fe4da11e50c15b2521799f0001befbf2c4b7a1d01d5c897ba8f9e0585507bc7`
MD5	`9ea1c6a6d13950d02adf9099de9de900`
BLAKE2b-256	`35c0130e0947c510451dcc8af2cfc6509c0e943fa5b1a24e1323f01787ea6932`

See more details on using hashes here.

File details

Details for the file pdfsmith-0.2.0-py3-none-any.whl.

File metadata

Download URL: pdfsmith-0.2.0-py3-none-any.whl
Upload date: Dec 2, 2025
Size: 39.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdfsmith-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5c9ff402fa7af7688f9183c7441df49a0b375dde8daff8332aa92bdc94804b13`
MD5	`ab9d8fb156d34716b01ae2b106e65995`
BLAKE2b-256	`c5ad1733d1b7889bc3150ef13910790eaa3c66a59c385750e935087883a9a277`

See more details on using hashes here.

pdfsmith 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdfsmith

Why pdfsmith?

Installation

Quick Start

CLI Usage

Available Backends

Open Source

Commercial Cloud APIs

Frontier LLMs

Choosing a Backend

System Dependencies

Async Support

Benchmarks

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes