Simple batch OCR for PDFs using Mistral's state-of-the-art vision model
Project description
mistocr
Why mistocr?
Performance: Mistral’s OCR delivers state-of-the-art accuracy on complex documents including tables, charts, and multi-column layouts.
Scale: Process entire folders of PDFs in a single batch job. Upload once, process asynchronously, and retrieve results when ready - perfect for large document sets.
Cost savings: Batch OCR mode reduces costs from $1/1000 pages to $0.50/1000 pages - a 50% reduction compared to synchronous processing.
Simplicity: A single ocr() function handles everything -
uploading, batch submission, polling for completion, and saving results
as markdown with extracted images. Process one PDF or an entire folder
with the same simple interface.
Organized output: Each PDF is automatically saved to its own folder
with pages as separate markdown files and images in an img subfolder,
making results easy to navigate and process further.
Installation
Install latest from the GitHub repository:
$ pip install git+https://github.com/franckalbinet/mistocr.git
or from pypi
$ pip install mistocr
How to use
from mistocr.core import ocr
- Process a single PDF:
fname = 'files/test/attention-is-all-you-need.pdf'
result = ocr(fname)
</code></pre>
<pre><code>files/test/md/attention-is-all-you-need:
img/ page_11.md page_14.md page_3.md page_6.md page_9.md
page_1.md page_12.md page_15.md page_4.md page_7.md
page_10.md page_13.md page_2.md page_5.md page_8.md
files/test/md/attention-is-all-you-need/img:
img-0.jpeg img-1.jpeg img-2.jpeg img-3.jpeg img-4.jpeg
- Or process an entire folder:
results = ocr('files/test')
</code></pre>
<pre><code>files/test/md:
attention-is-all-you-need/ resnet/
files/test/md/attention-is-all-you-need:
img/ page_11.md page_14.md page_3.md page_6.md page_9.md
page_1.md page_12.md page_15.md page_4.md page_7.md
page_10.md page_13.md page_2.md page_5.md page_8.md
files/test/md/attention-is-all-you-need/img:
img-0.jpeg img-1.jpeg img-2.jpeg img-3.jpeg img-4.jpeg
files/test/md/resnet:
img/ page_10.md page_12.md page_3.md page_5.md page_7.md page_9.md
page_1.md page_11.md page_2.md page_4.md page_6.md page_8.md
files/test/md/resnet/img:
img-0.jpeg img-2.jpeg img-4.jpeg img-6.jpeg
img-1.jpeg img-3.jpeg img-5.jpeg
- Customize the output:
results = ocr('files/test', out_dir='output', inc_img=False, poll_interval=5)
Parameters:
path: A single PDF file or folder containing multiple PDFsout_dir: Directory name for saving markdown output (default:'md')inc_img: Include extracted images in the output (default:True)key: Your Mistral API key (usesMISTRAL_API_KEYenvironment variable if not provided)poll_interval: Seconds between batch job status checks (default:2)
Returns: List of paths to the generated markdown files
Developer Guide
If you are new to using nbdev here are some useful pointers to get you
started.
Install mistocr in Development mode
# make sure mistocr package is installed in development mode
$ pip install -e .
# make changes under nbs/ directory
# ...
# compile to have changes apply to mistocr
$ nbdev_prepare
Documentation
Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mistocr-0.0.3.tar.gz.
File metadata
- Download URL: mistocr-0.0.3.tar.gz
- Upload date:
- Size: 11.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23da6b8ce3eb3a6b5d6dd1115d7495f4a982911e461250e76b4fac21ed67fd6c
|
|
| MD5 |
1d387edbefa672107458a071f290c63f
|
|
| BLAKE2b-256 |
792f3e7299911f03238a9ed75bcaef0e53ea3fa07424cba568b309b64d14cc7f
|
File details
Details for the file mistocr-0.0.3-py3-none-any.whl.
File metadata
- Download URL: mistocr-0.0.3-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64ace512802ee2a025cb64e616069382b60639a09bae3c8ef9c001b14cfe597c
|
|
| MD5 |
b136e4a79304d6e7b6aef67a678d72a7
|
|
| BLAKE2b-256 |
226e211e1d33568ad88c728c4e0859714c3b7c6c9a3cdf9ce682381abf88ab4b
|