CLI tool for converting PDF documents to Markdown or HTML using Mistral OCR.
Project description
emx-mistral-ocr-cli
CLI tool for converting PDF documents to Markdown or HTML using Mistral OCR.
Features
- PDF -> Markdown (default) or HTML output
- Automatic output format detection from
--outextension (.html/.htm-> HTML) - Optional page selection via
--pages(1-12,2,5,10-12, ...) - Optional local PDF slicing before upload (
--slice-pdf) to help with very large PDFs (e.g. >1000 pages) - Optional extracted image export
- HTML mode with embedded HTML tables and built-in CSS styling
- Local chapter index analysis before OCR (
--analyze-index) - Retry handling for temporary Mistral API errors
- Safe output behavior (no overwrite without
--force)
Requirements
- Python 3.10+
- A valid Mistral API key in environment variable
MISTRAL_API_KEY
Installation
Install via pip:
pip install emx-mistral-ocr-cli
Install from source (repo checkout):
pip install -r requirements.txt
Optional (editable install with console script):
pip install -e .
Development / Run from Source
If you want to run directly from a git checkout (without installing the package from PyPI), install dependencies and execute the script:
pip install -r requirements.txt
python mistral_ocr_cli.py <input.pdf> [options]
Setup
Set your API key:
Linux/macOS (bash/zsh):
export MISTRAL_API_KEY="your_key_here"
Windows PowerShell / PowerShell:
$env:MISTRAL_API_KEY="your_key_here"
Windows cmd.exe:
set MISTRAL_API_KEY=your_key_here
Usage
emx-mistral-ocr-cli <input.pdf> [options]
Show help:
emx-mistral-ocr-cli -h
Common Examples
Default Markdown output:
emx-mistral-ocr-cli doc.pdf
Write Markdown to a specific file:
emx-mistral-ocr-cli doc.pdf --out result.md
HTML output (auto-selected by extension):
emx-mistral-ocr-cli doc.pdf --out result.html
Explicit HTML output:
emx-mistral-ocr-cli doc.pdf --output-format html --out result.html
Process only selected pages:
emx-mistral-ocr-cli doc.pdf --pages "1-20"
Slice selected pages locally before upload:
emx-mistral-ocr-cli doc.pdf --pages "1150-1200" --slice-pdf --out result.html --force
Disable images entirely:
emx-mistral-ocr-cli doc.pdf --no-images
Export images to custom directory:
emx-mistral-ocr-cli doc.pdf --images-dir extracted_images
Analyze chapter index locally (no OCR call):
emx-mistral-ocr-cli doc.pdf --analyze-index
Analyze chapter index and write it to file:
emx-mistral-ocr-cli doc.pdf --analyze-index --chapter-index-out index.tsv --force
Options
--out <path>: Output file path--output-format {markdown,html}: Output format (default:markdown)--force: Overwrite existing outputs--pages "<spec>": 1-based page selection, e.g.1-12,2,5,10-12--slice-pdf: Build temporary sliced PDF locally before upload (requires--pages). Useful when Mistral rejects very large PDFs (e.g. >1000 pages) and you want to process it in chunks.--images-dir <dir>: Directory for extracted images (default:<out_stem>_images)--no-images: Disable image extraction/export--image-limit <n>: Maximum number of images to extract--image-min-size <px>: Minimum image width/height--no-header-footer: Disable header/footer extraction--chapter-index-out <file>: Write local chapter index output--analyze-index: Local chapter index analysis and exit
Notes
- In HTML mode, OCR tables are requested as HTML and embedded into the final HTML document. HTML is generally more expressive than Markdown for complex layouts (e.g. tables with
colspan/rowspan, which standard Markdown tables do not support). - For large PDFs,
--slice-pdfcan still take time (PDF parsing/writing), but it reduces upload size and processed content and can avoid API errors for extremely large documents (e.g. >1000 pages). --analyze-indexis useful to discover chapter boundaries and page numbers so you can select specific chapters via--pages.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file emx_mistral_ocr_cli-0.1.3.tar.gz.
File metadata
- Download URL: emx_mistral_ocr_cli-0.1.3.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ff8bdc6260e862d109dc519edb9af299a1f31b1b5c01a6f4c93df4a1f9039e2
|
|
| MD5 |
e7e5c1a69b0500bf754520298f43bf63
|
|
| BLAKE2b-256 |
6a3c19591a894d78e32390f90d95b8368d794bf0035e4a63ac2f4c9fafe4f73b
|
Provenance
The following attestation bundles were made for emx_mistral_ocr_cli-0.1.3.tar.gz:
Publisher:
release.yml on emmtrix/emx-mistral-ocr-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
emx_mistral_ocr_cli-0.1.3.tar.gz -
Subject digest:
5ff8bdc6260e862d109dc519edb9af299a1f31b1b5c01a6f4c93df4a1f9039e2 - Sigstore transparency entry: 973166303
- Sigstore integration time:
-
Permalink:
emmtrix/emx-mistral-ocr-cli@02c9b96a01dd085aed8356f2715badaa4e1ee737 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/emmtrix
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@02c9b96a01dd085aed8356f2715badaa4e1ee737 -
Trigger Event:
release
-
Statement type:
File details
Details for the file emx_mistral_ocr_cli-0.1.3-py3-none-any.whl.
File metadata
- Download URL: emx_mistral_ocr_cli-0.1.3-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f58bfd4b1315f3e0c56857e1b57efe0e698185fdfbfcf3d704307fa2c7768551
|
|
| MD5 |
853e6ba8f0a075c0917438571e817aef
|
|
| BLAKE2b-256 |
376bdea1058f1133dfbaf056c908f88706ee9d7df60ac66a13cdc10e661decda
|
Provenance
The following attestation bundles were made for emx_mistral_ocr_cli-0.1.3-py3-none-any.whl:
Publisher:
release.yml on emmtrix/emx-mistral-ocr-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
emx_mistral_ocr_cli-0.1.3-py3-none-any.whl -
Subject digest:
f58bfd4b1315f3e0c56857e1b57efe0e698185fdfbfcf3d704307fa2c7768551 - Sigstore transparency entry: 973166306
- Sigstore integration time:
-
Permalink:
emmtrix/emx-mistral-ocr-cli@02c9b96a01dd085aed8356f2715badaa4e1ee737 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/emmtrix
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@02c9b96a01dd085aed8356f2715badaa4e1ee737 -
Trigger Event:
release
-
Statement type: