Batch OCR for PDFs with heading restoration and visual content integration
Project description
Mistocr
PDF OCR is a critical bottleneck in AI pipelines. It’s often mentioned in passing, as if it’s a trivial step. Practice shows it’s far from it. Poorly converted PDFs mean garbage-in-garbage-out for downstream AI-system (RAG, …).
mistocr is powered by Mistral OCR 3, Mistral AI’s latest OCR model (December 2025), which delivers exceptional accuracy at a compelling price point.
mistocr emerged from months of real-world usage across projects requiring large-scale processing of niche-domain PDFs. It addresses two fundamental challenges that raw OCR output leaves unsolved:
-
Heading hierarchy restoration: Even state-of-the-art OCR sometimes produces inconsistent heading levels in large documents—a complex task to get right. mistocr uses LLM-based analysis to restore proper document structure, essential for downstream AI tasks.
-
Visual content integration: Charts, figures and diagrams are automatically classified and described, then integrated into the markdown. This makes visual information searchable and accessible for downstream applications.
-
Cost-efficient batch processing: When processing multiple PDFs, mistocr automatically uses Mistral’s batch API, cutting costs by 50% ($1 vs $2 per 1000 pages). Single PDFs are processed immediately without batch queue delays.
In short: Complete PDF OCR with heading hierarchy fixes and image descriptions for RAG and LLM pipelines.
[!NOTE]
Want to see mistocr in action? This tutorial demonstrates real-world PDF processing and shows how clean markdown enables structure-aware navigation through long documents—letting you find exactly what you need, fast.
Get Started
Install latest from pypi, then:
$ pip install mistocr
Set your API keys:
import os
os.environ['MISTRAL_API_KEY'] = 'your-key-here'
os.environ['ANTHROPIC_API_KEY'] = 'your-key-here' # for refine features (see Advanced Usage for other LLMs)
Complete Pipeline
Single File Processing
Process a single PDF with OCR (using Mistral’s batch API for cost efficiency), heading fixes, and image descriptions:
from mistocr.pipeline import pdf_to_md
await pdf_to_md('files/test/resnet.pdf', 'files/test/md_test')
mistocr.pipeline - INFO - Step 1/3: Running OCR on files/test/resnet.pdf...
mistocr.core - INFO - Waiting for batch job 4ec899ca-ada8-4fa7-8894-0191ff6ac4e5 (initial status: QUEUED)
mistocr.core - DEBUG - Job 4ec899ca-ada8-4fa7-8894-0191ff6ac4e5 status: QUEUED (elapsed: 0s)
mistocr.core - DEBUG - Job 4ec899ca-ada8-4fa7-8894-0191ff6ac4e5 status: RUNNING (elapsed: 2s)
mistocr.core - DEBUG - Job 4ec899ca-ada8-4fa7-8894-0191ff6ac4e5 status: RUNNING (elapsed: 2s)
mistocr.core - DEBUG - Job 4ec899ca-ada8-4fa7-8894-0191ff6ac4e5 status: RUNNING (elapsed: 2s)
mistocr.core - INFO - Job 4ec899ca-ada8-4fa7-8894-0191ff6ac4e5 completed with status: SUCCESS
mistocr.pipeline - INFO - Step 2/3: Fixing heading hierarchy...
mistocr.pipeline - INFO - Step 3/3: Adding image descriptions...
Describing 12 images...
mistocr.pipeline - INFO - Done!
Saved descriptions to /tmp/tmp62c7_ac1/resnet/img_descriptions.json
Adding descriptions to 12 pages...
Done! Enriched pages saved to files/test/md_test
This will:
- OCR the PDF using Mistral’s batch API
- Fix heading hierarchy inconsistencies
- Describe images (charts, diagrams) and add those descriptions into
the markdown Save everything to
files/test/md_test
The output structure will be:
files/test/md_test/
├── img/
│ ├── img-0.jpeg
│ ├── img-1.jpeg
│ └── ...
├── page_1.md
├── page_2.md
└── ...
Each page’s markdown will include inline image descriptions:
```markdown

AI-generated image description:
___
A residual learning block...
___
```
To print the the processed markdown, you can use the
read_pgs
function. Here’s how:
Then to read the fully processed document:
from mistocr.pipeline import read_pgs
md = read_pgs('files/test/md_test')
print(md[:500])
# Deep Residual Learning for Image Recognition ... page 1
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
## Abstract ... page 1
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins
By default,
read_pgs()
joins all pages. Pass join=False to get a list of individual pages
instead.
Advanced Usage
Batch OCR for entire folders:
from mistocr.core import ocr_pdf
# OCR all PDFs in a folder using Mistral's batch API
output_dirs = ocr_pdf('path/to/pdf_folder', dst='output_folder')
Custom models and prompts for heading fixes:
from mistocr.refine import fix_hdgs
# Use a different model or custom prompt
fix_hdgs('ocr_output/doc1',
model='gpt-4o',
prompt=your_custom_prompt)
Custom image description with rate limiting:
from mistocr.refine import add_img_descs
# Control API usage and customize descriptions
await add_img_descs('ocr_output/doc1',
model='claude-opus-4',
semaphore=5, # More concurrent requests
delay=0.5) # Shorter delay between calls
For complete control over each pipeline step, see the core, refine, and pipeline module documentation.
Developer Guide
If you are new to using nbdev here are some useful pointers to get you
started.
Install mistocr in Development mode
# make sure mistocr package is installed in development mode
$ pip install -e .
# make changes under nbs/ directory
# ...
# compile to have changes apply to mistocr
$ nbdev_prepare
Documentation
Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mistocr-0.5.1.tar.gz.
File metadata
- Download URL: mistocr-0.5.1.tar.gz
- Upload date:
- Size: 22.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b07940e2d29200431a38684a66a8cb319e7ad400a58b142b6ecb7f62fca6bf4e
|
|
| MD5 |
23def6a925736a4178a38a0a95f86461
|
|
| BLAKE2b-256 |
e64374b9e5190b6def2238e6240331f63a2b17a6699096e8f768a50d4296f953
|
File details
Details for the file mistocr-0.5.1-py3-none-any.whl.
File metadata
- Download URL: mistocr-0.5.1-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb39b6ac02811b906c3b535162fe16da511fa089aac30a79c5bfa1e888076758
|
|
| MD5 |
cde3887da9b923079e662df4920496bb
|
|
| BLAKE2b-256 |
d62b29dad4b3e3aeab8e3bce8a7ed44db627f834ff23cf1dcda3042c75db49bd
|