CLI tool for converting PDF documents to Markdown or HTML using Mistral OCR.

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

timo.stripf

Project description

emx-mistral-ocr-cli

CLI tool for converting PDF documents to Markdown or HTML using Mistral OCR.

Features

PDF -> Markdown (default) or HTML output
Automatic output format detection from --out extension (.html/.htm -> HTML)
Optional page selection via --pages (1-12, 2,5,10-12, ...)
Optional local PDF slicing before upload (--slice-pdf) to help with very large PDFs (e.g. >1000 pages)
Optional extracted image export
HTML mode with embedded HTML tables and built-in CSS styling
Local chapter index analysis before OCR (--analyze-index)
Retry handling for temporary Mistral API errors
Safe output behavior (no overwrite without --force)

Requirements

Python 3.10+
A valid Mistral API key in environment variable MISTRAL_API_KEY

Install dependencies:

pip install -r requirements.txt

Setup

Set your API key:

Linux/macOS (bash/zsh):

export MISTRAL_API_KEY="your_key_here"

Windows PowerShell / PowerShell:

$env:MISTRAL_API_KEY="your_key_here"

Windows cmd.exe:

set MISTRAL_API_KEY=your_key_here

Usage

python mistral_ocr_cli.py <input.pdf> [options]

Show help:

python mistral_ocr_cli.py -h

Common Examples

Default Markdown output:

python mistral_ocr_cli.py doc.pdf

Write Markdown to a specific file:

python mistral_ocr_cli.py doc.pdf --out result.md

HTML output (auto-selected by extension):

python mistral_ocr_cli.py doc.pdf --out result.html

Explicit HTML output:

python mistral_ocr_cli.py doc.pdf --output-format html --out result.html

Process only selected pages:

python mistral_ocr_cli.py doc.pdf --pages "1-20"

Slice selected pages locally before upload:

python mistral_ocr_cli.py doc.pdf --pages "1150-1200" --slice-pdf --out result.html --force

Disable images entirely:

python mistral_ocr_cli.py doc.pdf --no-images

Export images to custom directory:

python mistral_ocr_cli.py doc.pdf --images-dir extracted_images

Analyze chapter index locally (no OCR call):

python mistral_ocr_cli.py doc.pdf --analyze-index

Analyze chapter index and write it to file:

python mistral_ocr_cli.py doc.pdf --analyze-index --chapter-index-out index.tsv --force

Options

--out <path>: Output file path
--output-format {markdown,html}: Output format (default: markdown)
--force: Overwrite existing outputs
--pages "<spec>": 1-based page selection, e.g. 1-12, 2,5,10-12
--slice-pdf: Build temporary sliced PDF locally before upload (requires --pages). Useful when Mistral rejects very large PDFs (e.g. >1000 pages) and you want to process it in chunks.
--images-dir <dir>: Directory for extracted images (default: <out_stem>_images)
--no-images: Disable image extraction/export
--image-limit <n>: Maximum number of images to extract
--image-min-size <px>: Minimum image width/height
--no-header-footer: Disable header/footer extraction
--chapter-index-out <file>: Write local chapter index output
--analyze-index: Local chapter index analysis and exit

Notes

In HTML mode, OCR tables are requested as HTML and embedded into the final HTML document. HTML is generally more expressive than Markdown for complex layouts (e.g. tables with colspan/rowspan, which standard Markdown tables do not support).
For large PDFs, --slice-pdf can still take time (PDF parsing/writing), but it reduces upload size and processed content and can avoid API errors for extremely large documents (e.g. >1000 pages).
--analyze-index is useful to discover chapter boundaries and page numbers so you can select specific chapters via --pages.

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

timo.stripf

Release history Release notifications | RSS feed

0.1.3

Feb 20, 2026

0.1.2

Feb 20, 2026

This version

0.1.1

Feb 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emx_mistral_ocr_cli-0.1.1.tar.gz (11.8 kB view details)

Uploaded Feb 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

emx_mistral_ocr_cli-0.1.1-py3-none-any.whl (11.5 kB view details)

Uploaded Feb 20, 2026 Python 3

File details

Details for the file emx_mistral_ocr_cli-0.1.1.tar.gz.

File metadata

Download URL: emx_mistral_ocr_cli-0.1.1.tar.gz
Upload date: Feb 20, 2026
Size: 11.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emx_mistral_ocr_cli-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d350ca94cd4441e65b357a398081a2c111dd96732c31ef327c0780f3513f1e59`
MD5	`99bed03319aaa9995d4f8063c473300b`
BLAKE2b-256	`ceab6bfcf2b03bd2637f78d6dc5b83eb77b68674727c26a2d385126b45088e22`

See more details on using hashes here.

Provenance

The following attestation bundles were made for emx_mistral_ocr_cli-0.1.1.tar.gz:

Publisher: release.yml on emmtrix/emx-mistral-ocr-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: emx_mistral_ocr_cli-0.1.1.tar.gz
- Subject digest: d350ca94cd4441e65b357a398081a2c111dd96732c31ef327c0780f3513f1e59
- Sigstore transparency entry: 973157916
- Sigstore integration time: Feb 20, 2026
Source repository:
- Permalink: emmtrix/emx-mistral-ocr-cli@1416ec14b1c9cd2a960160a54f9016836a9527ac
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/emmtrix
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1416ec14b1c9cd2a960160a54f9016836a9527ac
- Trigger Event: release

File details

Details for the file emx_mistral_ocr_cli-0.1.1-py3-none-any.whl.

File metadata

Download URL: emx_mistral_ocr_cli-0.1.1-py3-none-any.whl
Upload date: Feb 20, 2026
Size: 11.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emx_mistral_ocr_cli-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c7fe5e4a8c7dc5f1599b66d857986fb82125f39973eba1d55e489521c2233cd`
MD5	`880635e36285a61f8a6ca69fffbd0ff5`
BLAKE2b-256	`f9d3c45141519860567a112f97c771221abb948581e614c52e8acb9f3be1418f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for emx_mistral_ocr_cli-0.1.1-py3-none-any.whl:

Publisher: release.yml on emmtrix/emx-mistral-ocr-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: emx_mistral_ocr_cli-0.1.1-py3-none-any.whl
- Subject digest: 8c7fe5e4a8c7dc5f1599b66d857986fb82125f39973eba1d55e489521c2233cd
- Sigstore transparency entry: 973157919
- Sigstore integration time: Feb 20, 2026
Source repository:
- Permalink: emmtrix/emx-mistral-ocr-cli@1416ec14b1c9cd2a960160a54f9016836a9527ac
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/emmtrix
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1416ec14b1c9cd2a960160a54f9016836a9527ac
- Trigger Event: release

emx-mistral-ocr-cli 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

emx-mistral-ocr-cli

Features

Requirements

Setup

Usage

Common Examples

Options

Notes

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance