CLI tool for converting PDF documents to Markdown or HTML using Mistral OCR.
Project description
emx-mistral-ocr-cli
CLI tool for converting PDF documents to Markdown or HTML using Mistral OCR.
Features
- PDF -> Markdown (default) or HTML output
- Automatic output format detection from
--outextension (.html/.htm-> HTML) - Optional page selection via
--pages(1-12,2,5,10-12, ...) - Optional local PDF slicing before upload (
--slice-pdf) to help with very large PDFs (e.g. >1000 pages) - Optional extracted image export
- HTML mode with embedded HTML tables and built-in CSS styling
- Local chapter index analysis before OCR (
--analyze-index) - Retry handling for temporary Mistral API errors
- Safe output behavior (no overwrite without
--force)
Requirements
- Python 3.10+
- A valid Mistral API key in environment variable
MISTRAL_API_KEY
Install dependencies:
pip install -r requirements.txt
Setup
Set your API key:
Linux/macOS (bash/zsh):
export MISTRAL_API_KEY="your_key_here"
Windows PowerShell / PowerShell:
$env:MISTRAL_API_KEY="your_key_here"
Windows cmd.exe:
set MISTRAL_API_KEY=your_key_here
Usage
python mistral_ocr_cli.py <input.pdf> [options]
Show help:
python mistral_ocr_cli.py -h
Common Examples
Default Markdown output:
python mistral_ocr_cli.py doc.pdf
Write Markdown to a specific file:
python mistral_ocr_cli.py doc.pdf --out result.md
HTML output (auto-selected by extension):
python mistral_ocr_cli.py doc.pdf --out result.html
Explicit HTML output:
python mistral_ocr_cli.py doc.pdf --output-format html --out result.html
Process only selected pages:
python mistral_ocr_cli.py doc.pdf --pages "1-20"
Slice selected pages locally before upload:
python mistral_ocr_cli.py doc.pdf --pages "1150-1200" --slice-pdf --out result.html --force
Disable images entirely:
python mistral_ocr_cli.py doc.pdf --no-images
Export images to custom directory:
python mistral_ocr_cli.py doc.pdf --images-dir extracted_images
Analyze chapter index locally (no OCR call):
python mistral_ocr_cli.py doc.pdf --analyze-index
Analyze chapter index and write it to file:
python mistral_ocr_cli.py doc.pdf --analyze-index --chapter-index-out index.tsv --force
Options
--out <path>: Output file path--output-format {markdown,html}: Output format (default:markdown)--force: Overwrite existing outputs--pages "<spec>": 1-based page selection, e.g.1-12,2,5,10-12--slice-pdf: Build temporary sliced PDF locally before upload (requires--pages). Useful when Mistral rejects very large PDFs (e.g. >1000 pages) and you want to process it in chunks.--images-dir <dir>: Directory for extracted images (default:<out_stem>_images)--no-images: Disable image extraction/export--image-limit <n>: Maximum number of images to extract--image-min-size <px>: Minimum image width/height--no-header-footer: Disable header/footer extraction--chapter-index-out <file>: Write local chapter index output--analyze-index: Local chapter index analysis and exit
Notes
- In HTML mode, OCR tables are requested as HTML and embedded into the final HTML document. HTML is generally more expressive than Markdown for complex layouts (e.g. tables with
colspan/rowspan, which standard Markdown tables do not support). - For large PDFs,
--slice-pdfcan still take time (PDF parsing/writing), but it reduces upload size and processed content and can avoid API errors for extremely large documents (e.g. >1000 pages). --analyze-indexis useful to discover chapter boundaries and page numbers so you can select specific chapters via--pages.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file emx_mistral_ocr_cli-0.1.1.tar.gz.
File metadata
- Download URL: emx_mistral_ocr_cli-0.1.1.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d350ca94cd4441e65b357a398081a2c111dd96732c31ef327c0780f3513f1e59
|
|
| MD5 |
99bed03319aaa9995d4f8063c473300b
|
|
| BLAKE2b-256 |
ceab6bfcf2b03bd2637f78d6dc5b83eb77b68674727c26a2d385126b45088e22
|
Provenance
The following attestation bundles were made for emx_mistral_ocr_cli-0.1.1.tar.gz:
Publisher:
release.yml on emmtrix/emx-mistral-ocr-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
emx_mistral_ocr_cli-0.1.1.tar.gz -
Subject digest:
d350ca94cd4441e65b357a398081a2c111dd96732c31ef327c0780f3513f1e59 - Sigstore transparency entry: 973157916
- Sigstore integration time:
-
Permalink:
emmtrix/emx-mistral-ocr-cli@1416ec14b1c9cd2a960160a54f9016836a9527ac -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/emmtrix
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1416ec14b1c9cd2a960160a54f9016836a9527ac -
Trigger Event:
release
-
Statement type:
File details
Details for the file emx_mistral_ocr_cli-0.1.1-py3-none-any.whl.
File metadata
- Download URL: emx_mistral_ocr_cli-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c7fe5e4a8c7dc5f1599b66d857986fb82125f39973eba1d55e489521c2233cd
|
|
| MD5 |
880635e36285a61f8a6ca69fffbd0ff5
|
|
| BLAKE2b-256 |
f9d3c45141519860567a112f97c771221abb948581e614c52e8acb9f3be1418f
|
Provenance
The following attestation bundles were made for emx_mistral_ocr_cli-0.1.1-py3-none-any.whl:
Publisher:
release.yml on emmtrix/emx-mistral-ocr-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
emx_mistral_ocr_cli-0.1.1-py3-none-any.whl -
Subject digest:
8c7fe5e4a8c7dc5f1599b66d857986fb82125f39973eba1d55e489521c2233cd - Sigstore transparency entry: 973157919
- Sigstore integration time:
-
Permalink:
emmtrix/emx-mistral-ocr-cli@1416ec14b1c9cd2a960160a54f9016836a9527ac -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/emmtrix
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1416ec14b1c9cd2a960160a54f9016836a9527ac -
Trigger Event:
release
-
Statement type: