Skip to main content

CLI tool for converting PDF documents to Markdown or HTML using Mistral OCR.

Project description

emx-mistral-ocr-cli

CLI tool for converting PDF documents to Markdown or HTML using Mistral OCR.

Features

  • PDF -> Markdown (default) or HTML output
  • Automatic output format detection from --out extension (.html/.htm -> HTML)
  • Optional page selection via --pages (1-12, 2,5,10-12, ...)
  • Optional local PDF slicing before upload (--slice-pdf) to help with very large PDFs (e.g. >1000 pages)
  • Optional extracted image export
  • HTML mode with embedded HTML tables and built-in CSS styling
  • Local chapter index analysis before OCR (--analyze-index)
  • Retry handling for temporary Mistral API errors
  • Safe output behavior (no overwrite without --force)

Requirements

  • Python 3.10+
  • A valid Mistral API key in environment variable MISTRAL_API_KEY

Installation

Install via pip:

pip install emx-mistral-ocr-cli

Install from source (repo checkout):

pip install -r requirements.txt

Optional (editable install with console script):

pip install -e .

Development / Run from Source

If you want to run directly from a git checkout (without installing the package from PyPI), install dependencies and execute the script:

pip install -r requirements.txt
python mistral_ocr_cli.py <input.pdf> [options]

Setup

Set your API key:

Linux/macOS (bash/zsh):

export MISTRAL_API_KEY="your_key_here"

Windows PowerShell / PowerShell:

$env:MISTRAL_API_KEY="your_key_here"

Windows cmd.exe:

set MISTRAL_API_KEY=your_key_here

Usage

emx-mistral-ocr-cli <input.pdf> [options]

Show help:

emx-mistral-ocr-cli -h

Common Examples

Default Markdown output:

emx-mistral-ocr-cli doc.pdf

Write Markdown to a specific file:

emx-mistral-ocr-cli doc.pdf --out result.md

HTML output (auto-selected by extension):

emx-mistral-ocr-cli doc.pdf --out result.html

Explicit HTML output:

emx-mistral-ocr-cli doc.pdf --output-format html --out result.html

Process only selected pages:

emx-mistral-ocr-cli doc.pdf --pages "1-20"

Slice selected pages locally before upload:

emx-mistral-ocr-cli doc.pdf --pages "1150-1200" --slice-pdf --out result.html --force

Disable images entirely:

emx-mistral-ocr-cli doc.pdf --no-images

Export images to custom directory:

emx-mistral-ocr-cli doc.pdf --images-dir extracted_images

Analyze chapter index locally (no OCR call):

emx-mistral-ocr-cli doc.pdf --analyze-index

Analyze chapter index and write it to file:

emx-mistral-ocr-cli doc.pdf --analyze-index --chapter-index-out index.tsv --force

Options

  • --out <path>: Output file path
  • --output-format {markdown,html}: Output format (default: markdown)
  • --force: Overwrite existing outputs
  • --pages "<spec>": 1-based page selection, e.g. 1-12, 2,5,10-12
  • --slice-pdf: Build temporary sliced PDF locally before upload (requires --pages). Useful when Mistral rejects very large PDFs (e.g. >1000 pages) and you want to process it in chunks.
  • --images-dir <dir>: Directory for extracted images (default: <out_stem>_images)
  • --no-images: Disable image extraction/export
  • --image-limit <n>: Maximum number of images to extract
  • --image-min-size <px>: Minimum image width/height
  • --no-header-footer: Disable header/footer extraction
  • --chapter-index-out <file>: Write local chapter index output
  • --analyze-index: Local chapter index analysis and exit

Notes

  • In HTML mode, OCR tables are requested as HTML and embedded into the final HTML document. HTML is generally more expressive than Markdown for complex layouts (e.g. tables with colspan/rowspan, which standard Markdown tables do not support).
  • For large PDFs, --slice-pdf can still take time (PDF parsing/writing), but it reduces upload size and processed content and can avoid API errors for extremely large documents (e.g. >1000 pages).
  • --analyze-index is useful to discover chapter boundaries and page numbers so you can select specific chapters via --pages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emx_mistral_ocr_cli-0.1.3.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emx_mistral_ocr_cli-0.1.3-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file emx_mistral_ocr_cli-0.1.3.tar.gz.

File metadata

  • Download URL: emx_mistral_ocr_cli-0.1.3.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emx_mistral_ocr_cli-0.1.3.tar.gz
Algorithm Hash digest
SHA256 5ff8bdc6260e862d109dc519edb9af299a1f31b1b5c01a6f4c93df4a1f9039e2
MD5 e7e5c1a69b0500bf754520298f43bf63
BLAKE2b-256 6a3c19591a894d78e32390f90d95b8368d794bf0035e4a63ac2f4c9fafe4f73b

See more details on using hashes here.

Provenance

The following attestation bundles were made for emx_mistral_ocr_cli-0.1.3.tar.gz:

Publisher: release.yml on emmtrix/emx-mistral-ocr-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file emx_mistral_ocr_cli-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for emx_mistral_ocr_cli-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f58bfd4b1315f3e0c56857e1b57efe0e698185fdfbfcf3d704307fa2c7768551
MD5 853e6ba8f0a075c0917438571e817aef
BLAKE2b-256 376bdea1058f1133dfbaf056c908f88706ee9d7df60ac66a13cdc10e661decda

See more details on using hashes here.

Provenance

The following attestation bundles were made for emx_mistral_ocr_cli-0.1.3-py3-none-any.whl:

Publisher: release.yml on emmtrix/emx-mistral-ocr-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page