Skip to main content

Unified document processing with AI-powered OCR

Project description

doc2mark

PyPI version Python License: MIT

Turn any document into clean Markdown -- in one line.

Features

  • Converts PDFs, DOCX/XLSX/PPTX, images, HTML, CSV/JSON, and more
  • AI-powered OCR via OpenAI, Google Gemini (Vertex AI), or Tesseract
  • Preserves complex tables (merged cells, rowspan/colspan)
  • One unified API + CLI for single files or entire directories
  • Batch processing with parallel execution
  • Per-call token usage tracking for OpenAI and Vertex AI providers

Install

# Core (no OCR)
pip install doc2mark

# With OpenAI OCR
pip install doc2mark[ocr]

# With Google Gemini / Vertex AI OCR
pip install doc2mark[vertex_ai]

# Everything
pip install doc2mark[all]

Quick start

from doc2mark import UnifiedDocumentLoader

loader = UnifiedDocumentLoader()
result = loader.load("document.pdf")
print(result.content)

OCR providers

doc2mark supports three OCR providers. Pass ocr_provider to UnifiedDocumentLoader to choose one.

OpenAI (default)

Uses GPT-4.1 vision. Requires an API key.

export OPENAI_API_KEY=sk-...
loader = UnifiedDocumentLoader(ocr_provider="openai")

result = loader.load(
    "scanned_doc.pdf",
    extract_images=True,
    ocr_images=True,
)

Customize the model or use an OpenAI-compatible endpoint:

loader = UnifiedDocumentLoader(
    ocr_provider="openai",
    model="gpt-4o-mini",                     # cheaper model
    base_url="http://localhost:11434/v1",     # self-hosted / Ollama
    api_key="any-string",
)

Google Gemini / Vertex AI

Uses Gemini models via Google Cloud. Authenticates with Application Default Credentials.

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
loader = UnifiedDocumentLoader(
    ocr_provider="vertex_ai",
    project="my-gcp-project",          # or set GOOGLE_CLOUD_PROJECT
)

result = loader.load("scan.pdf", extract_images=True, ocr_images=True)

Override model and region:

loader = UnifiedDocumentLoader(
    ocr_provider="vertex_ai",
    project="my-gcp-project",
    model="gemini-2.0-flash",          # default: gemini-3.1-flash-lite-preview
    location="us-central1",            # default: global
)

Tesseract (offline)

Local OCR, no API key needed. Requires Tesseract installed on your system.

from doc2mark.ocr.base import OCRConfig

loader = UnifiedDocumentLoader(
    ocr_provider="tesseract",
    ocr_config=OCRConfig(language="chinese"),   # optional language hint
)

result = loader.load("scan.png", extract_images=True, ocr_images=True)

Provider comparison

Provider Requires Best for Install extra
openai OPENAI_API_KEY Highest accuracy, complex layouts pip install doc2mark[ocr]
vertex_ai GCP service account Google Cloud workflows, Gemini models pip install doc2mark[vertex_ai]
tesseract Tesseract binary Offline / air-gapped environments pip install doc2mark[ocr]

Supported formats

Category Formats
Office DOCX, XLSX, PPTX
PDF PDF (text + scanned)
Images PNG, JPG, WEBP, TIFF, BMP, GIF, HEIC, HEIF, AVIF
Text / Data TXT, CSV, TSV, JSON, JSONL
Markup HTML, XML, Markdown
Legacy DOC, XLS, PPT, RTF, PPS (requires LibreOffice)

Common recipes

Single file

from doc2mark import load

# Text-only extraction (no OCR)
md = load("report.pdf").content

# With OCR for embedded images
md = load("report.pdf", extract_images=True, ocr_images=True).content

Batch processing

from doc2mark import UnifiedDocumentLoader

loader = UnifiedDocumentLoader(ocr_provider="openai")

loader.batch_process(
    input_dir="documents/",
    output_dir="converted/",
    extract_images=True,
    ocr_images=True,
    save_files=True,
    show_progress=True,
)

Process specific files

from doc2mark import batch_process_files

results = batch_process_files(
    ["invoice.pdf", "contract.docx", "receipt.png"],
    output_dir="output/",
    extract_images=True,
    ocr_images=True,
)

OCR prompt templates

doc2mark includes specialized prompts for different content types:

loader = UnifiedDocumentLoader(
    ocr_provider="openai",
    prompt_template="table_focused",    # optimized for tables
)

Available templates: default, table_focused, document_focused, multilingual, form_focused, receipt_focused, handwriting_focused, code_focused.

Table output styles

Control how complex tables (with merged cells) are rendered:

loader = UnifiedDocumentLoader(
    table_style="minimal_html",     # clean HTML with rowspan/colspan (default)
    # table_style="markdown_grid",  # markdown with merge annotations
    # table_style="styled_html",    # full HTML with inline styles
)

Token usage tracking

When using OpenAI or Vertex AI, each OCR result includes token usage in its metadata:

from doc2mark import UnifiedDocumentLoader

loader = UnifiedDocumentLoader(ocr_provider="openai")
result = loader.load("scan.pdf", extract_images=True, ocr_images=True)

# Token usage per OCR call is in result metadata
usage = result.metadata.get("token_usage", {})
print(usage)
# {"input_tokens": 1234, "output_tokens": 567, "total_tokens": 1801}

CLI

# Single file to stdout
doc2mark report.pdf

# Save to file
doc2mark report.pdf -o report.md

# Batch convert a directory
doc2mark documents/ -o converted/ -r

# With OpenAI OCR
doc2mark scan.pdf --ocr openai --ocr-images

# With Tesseract OCR
doc2mark scan.pdf --ocr tesseract --ocr-images

# Disable OCR entirely
doc2mark report.pdf --ocr none --no-ocr-images

# JSON output
doc2mark report.pdf --format json

License

MIT -- see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc2mark-0.5.1.tar.gz (133.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc2mark-0.5.1-py3-none-any.whl (128.8 kB view details)

Uploaded Python 3

File details

Details for the file doc2mark-0.5.1.tar.gz.

File metadata

  • Download URL: doc2mark-0.5.1.tar.gz
  • Upload date:
  • Size: 133.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for doc2mark-0.5.1.tar.gz
Algorithm Hash digest
SHA256 854b262d495a83d7aa7a1cf25a510a3bc3b10bf43b7b983acc37df6b4fc0b3f3
MD5 ee3dbfa36169d005c28fde0329ebefb3
BLAKE2b-256 bd8c324d904c691fca4bf1b88c3b3e39e14929186e2a9a18bfa6b5e64084d4a4

See more details on using hashes here.

File details

Details for the file doc2mark-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: doc2mark-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 128.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for doc2mark-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1b6baeea1524b1f30b0bcf1a7975f0a016ec0bb2918a8d9406043b309fa2f4f2
MD5 6dc2deb13e04c22d21ca4063e1b1c9f6
BLAKE2b-256 2f4278e8cfd4b6184200d94c7f0dcff2af614ad809d637f6d61491acb0644b22

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page