Batch OCR for PDFs with heading restoration and visual content integration

These details have not been verified by PyPI

Project links

Homepage

Project description

Mistocr

PDF OCR is a critical bottleneck in AI pipelines. It’s often mentioned in passing, as if it’s a trivial step. Practice shows it’s far from it. Poorly converted PDFs mean garbage-in-garbage-out for downstream AI-system (RAG, …).

When Mistral AI released their state-of-the-art OCR model in March 2025, it opened new possibilities for large-scale document processing. While alternatives like datalab.to and docling.ai offer viable solutions, Mistral OCR delivers exceptional accuracy at a compelling price point.

mistocr emerged from months of real-world usage across projects requiring large-scale processing of niche-domain PDFs. It addresses two fundamental challenges that raw OCR output leaves unsolved:

Heading hierarchy restoration: Even state-of-the-art OCR sometimes produces inconsistent heading levels in large documents—a complex task to get right. mistocr uses LLM-based analysis to restore proper document structure, essential for downstream AI tasks.
Visual content integration: Charts, figures and diagrams are automatically classified and described, then integrated into the markdown. This makes visual information searchable and accessible for downstream applications.
Cost-efficient batch processing: The OCR step exclusively uses Mistral’s batch API, cutting costs by 50% ($0.50 vs $1.00 per 1000 pages) while eliminating the boilerplate code typically required.

In short: Complete PDF OCR with heading hierarchy fixes and image descriptions for RAG and LLM pipelines.

[!NOTE]

Want to see mistocr in action? This tutorial demonstrates real-world PDF processing and shows how clean markdown enables structure-aware navigation through long documents—letting you find exactly what you need, fast.

Get Started

Install latest from pypi, then:

$ pip install mistocr

Set your API keys:

import os
os.environ['MISTRAL_API_KEY'] = 'your-key-here'
os.environ['ANTHROPIC_API_KEY'] = 'your-key-here'  # for refine features (see Advanced Usage for other LLMs)

Complete Pipeline

Single File Processing

Process a single PDF with OCR (using Mistral’s batch API for cost efficiency), heading fixes, and image descriptions:

from mistocr.pipeline import pdf_to_md
await pdf_to_md('files/test/resnet.pdf', 'files/test/md_test')

Step 1/3: Running OCR on files/test/resnet.pdf...
Mistral batch job status: QUEUED
Mistral batch job status: RUNNING
Mistral batch job status: RUNNING
Step 2/3: Fixing heading hierarchy...
Step 3/3: Adding image descriptions...
Describing 7 images...
Saved descriptions to ocr_temp/resnet/img_descriptions.json
Adding descriptions to 12 pages...
Done! Enriched pages saved to files/test/md_test
Done!

This will (as indicated by the output):

OCR the PDF using Mistral’s batch API
Fix heading hierarchy inconsistencies
Describe images (charts, diagrams) and add those descriptions into the markdown Save everything to files/test/md_test

The output structure will be:

files/test/md_test/
├── img/
│   ├── img-0.jpeg
│   ├── img-1.jpeg
│   └── ...
├── page_1.md
├── page_2.md
└── ...

Each page’s markdown will include inline image descriptions:

```markdown
![Figure 1](img/img-0.jpeg)
AI-generated image description:
___
A residual learning block...
___
```

To print the the processed markdown, you can use the read_pgs function. Here’s how:

Then to read the fully processed document:

from mistocr.pipeline import read_pgs
md = read_pgs('files/test/md_test')
print(md[:500])

# Deep Residual Learning for Image Recognition  ... page 1

Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com


## Abstract ... page 1

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins

By default, read_pgs() joins all pages. Pass join=False to get a list of individual pages instead.

Advanced Usage

Batch OCR for entire folders:

from mistocr.core import ocr_pdf

# OCR all PDFs in a folder using Mistral's batch API
output_dirs = ocr_pdf('path/to/pdf_folder', dst='output_folder')

Custom models and prompts for heading fixes:

from mistocr.refine import fix_hdgs

# Use a different model or custom prompt
fix_hdgs('ocr_output/doc1', 
         model='gpt-4o',
         prompt=your_custom_prompt)

Custom image description with rate limiting:

from mistocr.refine import add_img_descs

# Control API usage and customize descriptions
await add_img_descs('ocr_output/doc1',
                    model='claude-opus-4',
                    semaphore=5,  # More concurrent requests
                    delay=0.5)    # Shorter delay between calls

For complete control over each pipeline step, see the core, refine, and pipeline module documentation.

Known Limitations & Future Work

mistocr is under active development. Current limitations include:

No timeout on batch jobs: Jobs poll indefinitely until completion. If a job stalls, manual intervention is required.
Limited error handling: When batch jobs fail, error reporting and recovery options are minimal.
Progress monitoring: Currently limited to periodic status prints. Future versions will support callbacks or streaming updates for better real-time monitoring.

Contributions are welcome! If you encounter issues or have ideas for improvements, please open an issue or discussion on GitHub.

Developer Guide

If you are new to using nbdev here are some useful pointers to get you started.

Install mistocr in Development mode

# make sure mistocr package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to mistocr
$ nbdev_prepare

Documentation

Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.5.2

Feb 12, 2026

0.5.1

Feb 12, 2026

0.5.0

Feb 6, 2026

0.4.3

Feb 3, 2026

0.4.2

Feb 2, 2026

0.4.1

Dec 16, 2025

This version

0.4.0

Dec 16, 2025

0.3.2

Dec 16, 2025

0.3.1

Dec 15, 2025

0.3.0

Dec 14, 2025

0.2.12

Dec 10, 2025

0.2.11

Dec 10, 2025

0.2.10

Dec 10, 2025

0.2.9

Dec 9, 2025

0.2.8

Dec 9, 2025

0.2.7

Dec 9, 2025

0.2.6

Nov 26, 2025

0.2.5

Nov 23, 2025

0.2.4

Nov 23, 2025

0.2.3

Nov 23, 2025

0.2.2

Nov 22, 2025

0.2.1

Nov 21, 2025

0.2.0

Nov 21, 2025

0.1.6

Nov 19, 2025

0.1.5

Nov 17, 2025

0.1.4

Nov 17, 2025

0.1.3

Nov 17, 2025

0.1.2

Nov 17, 2025

0.1.1

Nov 17, 2025

0.1.0

Nov 17, 2025

0.0.4

Nov 14, 2025

0.0.3

Nov 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistocr-0.4.0.tar.gz (22.0 kB view details)

Uploaded Dec 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mistocr-0.4.0-py3-none-any.whl (19.1 kB view details)

Uploaded Dec 16, 2025 Python 3

File details

Details for the file mistocr-0.4.0.tar.gz.

File metadata

Download URL: mistocr-0.4.0.tar.gz
Upload date: Dec 16, 2025
Size: 22.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for mistocr-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`6efe77703d6eaabb961097c8f9ffd8aadcfc7b3f108ec1dd3d7e6bb6b076b43a`
MD5	`d4aecb1ad0d2bc69152675ef496acb5b`
BLAKE2b-256	`3d91640b906dfc76713c38ebce0a4df43c3a2cf6dd50c4ff175eaaa20dbaf922`

See more details on using hashes here.

File details

Details for the file mistocr-0.4.0-py3-none-any.whl.

File metadata

Download URL: mistocr-0.4.0-py3-none-any.whl
Upload date: Dec 16, 2025
Size: 19.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for mistocr-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fda34c47861697b02578270c919a1fd1cfb12e39ac84acdc0644281832e78517`
MD5	`5b380c769856577f20655e777074bbbb`
BLAKE2b-256	`4ce682ac3b694d72fee28d47b6cc7089f93e5f9f7383ddc4b9f42dd8c6057782`

See more details on using hashes here.

mistocr 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Mistocr

Get Started

Complete Pipeline

Single File Processing

Advanced Usage

Known Limitations & Future Work

Developer Guide

Install mistocr in Development mode

Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes