Unified document processing with AI-powered OCR
Project description
doc2mark
Turn any document into clean Markdown -- in one line.
Features
- Converts PDFs, DOCX/XLSX/PPTX, images, HTML, CSV/JSON, and more
- AI-powered OCR via OpenAI, Google Gemini (Vertex AI), or Tesseract
- Preserves complex tables (merged cells, rowspan/colspan)
- One unified API + CLI for single files or entire directories
- Batch processing with parallel execution
- Per-call token usage tracking for OpenAI and Vertex AI providers
Install
# Core (no OCR)
pip install doc2mark
# With OpenAI OCR
pip install doc2mark[ocr]
# With Google Gemini / Vertex AI OCR
pip install doc2mark[vertex_ai]
# Everything
pip install doc2mark[all]
Quick start
from doc2mark import UnifiedDocumentLoader
loader = UnifiedDocumentLoader()
result = loader.load("document.pdf")
print(result.content)
OCR providers
doc2mark supports three OCR providers. Pass ocr_provider to UnifiedDocumentLoader to choose one.
OpenAI (default)
Uses GPT-4.1 vision. Requires an API key.
export OPENAI_API_KEY=sk-...
loader = UnifiedDocumentLoader(ocr_provider="openai")
result = loader.load(
"scanned_doc.pdf",
extract_images=True,
ocr_images=True,
)
Customize the model or use an OpenAI-compatible endpoint:
loader = UnifiedDocumentLoader(
ocr_provider="openai",
model="gpt-4o-mini", # cheaper model
base_url="http://localhost:11434/v1", # self-hosted / Ollama
api_key="any-string",
)
Google Gemini / Vertex AI
Uses Gemini models via Google Cloud. Authenticates with Application Default Credentials.
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
loader = UnifiedDocumentLoader(
ocr_provider="vertex_ai",
project="my-gcp-project", # or set GOOGLE_CLOUD_PROJECT
)
result = loader.load("scan.pdf", extract_images=True, ocr_images=True)
Override model and region:
loader = UnifiedDocumentLoader(
ocr_provider="vertex_ai",
project="my-gcp-project",
model="gemini-2.0-flash", # default: gemini-3.1-flash-lite-preview
location="us-central1", # default: global
)
Tesseract (offline)
Local OCR, no API key needed. Requires Tesseract installed on your system.
from doc2mark.ocr.base import OCRConfig
loader = UnifiedDocumentLoader(
ocr_provider="tesseract",
ocr_config=OCRConfig(language="chinese"), # optional language hint
)
result = loader.load("scan.png", extract_images=True, ocr_images=True)
Provider comparison
| Provider | Requires | Best for | Install extra |
|---|---|---|---|
openai |
OPENAI_API_KEY |
Highest accuracy, complex layouts | pip install doc2mark[ocr] |
vertex_ai |
GCP service account | Google Cloud workflows, Gemini models | pip install doc2mark[vertex_ai] |
tesseract |
Tesseract binary | Offline / air-gapped environments | pip install doc2mark[ocr] |
Supported formats
| Category | Formats |
|---|---|
| Office | DOCX, XLSX, PPTX |
| PDF (text + scanned) | |
| Images | PNG, JPG, WEBP, TIFF, BMP, GIF, HEIC, HEIF, AVIF |
| Text / Data | TXT, CSV, TSV, JSON, JSONL |
| Markup | HTML, XML, Markdown |
| Legacy | DOC, XLS, PPT, RTF, PPS (requires LibreOffice) |
Common recipes
Single file
from doc2mark import load
# Text-only extraction (no OCR)
md = load("report.pdf").content
# With OCR for embedded images
md = load("report.pdf", extract_images=True, ocr_images=True).content
Batch processing
from doc2mark import UnifiedDocumentLoader
loader = UnifiedDocumentLoader(ocr_provider="openai")
loader.batch_process(
input_dir="documents/",
output_dir="converted/",
extract_images=True,
ocr_images=True,
save_files=True,
show_progress=True,
)
Process specific files
from doc2mark import batch_process_files
results = batch_process_files(
["invoice.pdf", "contract.docx", "receipt.png"],
output_dir="output/",
extract_images=True,
ocr_images=True,
)
OCR prompt templates
doc2mark includes specialized prompts for different content types:
loader = UnifiedDocumentLoader(
ocr_provider="openai",
prompt_template="table_focused", # optimized for tables
)
Available templates: default, table_focused, document_focused, multilingual, form_focused, receipt_focused, handwriting_focused, code_focused.
Table output styles
Control how complex tables (with merged cells) are rendered:
loader = UnifiedDocumentLoader(
table_style="minimal_html", # clean HTML with rowspan/colspan (default)
# table_style="markdown_grid", # markdown with merge annotations
# table_style="styled_html", # full HTML with inline styles
)
Token usage tracking
When using OpenAI or Vertex AI, each OCR result includes token usage in its metadata:
from doc2mark import UnifiedDocumentLoader
loader = UnifiedDocumentLoader(ocr_provider="openai")
result = loader.load("scan.pdf", extract_images=True, ocr_images=True)
# Token usage per OCR call is in result metadata
usage = result.metadata.get("token_usage", {})
print(usage)
# {"input_tokens": 1234, "output_tokens": 567, "total_tokens": 1801}
CLI
# Single file to stdout
doc2mark report.pdf
# Save to file
doc2mark report.pdf -o report.md
# Batch convert a directory
doc2mark documents/ -o converted/ -r
# With OpenAI OCR
doc2mark scan.pdf --ocr openai --ocr-images
# With Tesseract OCR
doc2mark scan.pdf --ocr tesseract --ocr-images
# Disable OCR entirely
doc2mark report.pdf --ocr none --no-ocr-images
# JSON output
doc2mark report.pdf --format json
License
MIT -- see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doc2mark-0.5.1.tar.gz.
File metadata
- Download URL: doc2mark-0.5.1.tar.gz
- Upload date:
- Size: 133.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
854b262d495a83d7aa7a1cf25a510a3bc3b10bf43b7b983acc37df6b4fc0b3f3
|
|
| MD5 |
ee3dbfa36169d005c28fde0329ebefb3
|
|
| BLAKE2b-256 |
bd8c324d904c691fca4bf1b88c3b3e39e14929186e2a9a18bfa6b5e64084d4a4
|
File details
Details for the file doc2mark-0.5.1-py3-none-any.whl.
File metadata
- Download URL: doc2mark-0.5.1-py3-none-any.whl
- Upload date:
- Size: 128.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b6baeea1524b1f30b0bcf1a7975f0a016ec0bb2918a8d9406043b309fa2f4f2
|
|
| MD5 |
6dc2deb13e04c22d21ca4063e1b1c9f6
|
|
| BLAKE2b-256 |
2f4278e8cfd4b6184200d94c7f0dcff2af614ad809d637f6d61491acb0644b22
|