Skip to main content

Simple batch OCR for PDFs using Mistral's state-of-the-art vision model

Project description

mistocr

Why mistocr?

Performance: Mistral’s OCR delivers state-of-the-art accuracy on complex documents including tables, charts, and multi-column layouts.

Scale: Process entire folders of PDFs in a single batch job. Upload once, process asynchronously, and retrieve results when ready - perfect for large document sets.

Cost savings: Batch OCR mode reduces costs from $1/1000 pages to $0.50/1000 pages - a 50% reduction compared to synchronous processing.

Simplicity: A single ocr() function handles everything - uploading, batch submission, polling for completion, and saving results as markdown with extracted images. Process one PDF or an entire folder with the same simple interface.

Organized output: Each PDF is automatically saved to its own folder with pages as separate markdown files and images in an img subfolder, making results easy to navigate and process further.

Installation

Install latest from the GitHub repository:

$ pip install git+https://github.com/franckalbinet/mistocr.git

or from pypi

$ pip install mistocr

How to use

from mistocr.core import ocr
  • Process a single PDF:
fname = 'files/test/attention-is-all-you-need.pdf'
result = ocr(fname)
</code></pre>
<pre><code>files/test/md/attention-is-all-you-need:
img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

files/test/md/attention-is-all-you-need/img:
img-0.jpeg  img-1.jpeg  img-2.jpeg  img-3.jpeg  img-4.jpeg
  • Or process an entire folder:
results = ocr('files/test')
</code></pre>
<pre><code>files/test/md:
attention-is-all-you-need/  resnet/

files/test/md/attention-is-all-you-need:
img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

files/test/md/attention-is-all-you-need/img:
img-0.jpeg  img-1.jpeg  img-2.jpeg  img-3.jpeg  img-4.jpeg

files/test/md/resnet:
img/       page_10.md  page_12.md  page_3.md  page_5.md  page_7.md  page_9.md
page_1.md  page_11.md  page_2.md   page_4.md  page_6.md  page_8.md

files/test/md/resnet/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg
  • Customize the output:
results = ocr('files/test', out_dir='output', inc_img=False, poll_interval=5)

Parameters:

  • path: A single PDF file or folder containing multiple PDFs
  • out_dir: Directory name for saving markdown output (default: 'md')
  • inc_img: Include extracted images in the output (default: True)
  • key: Your Mistral API key (uses MISTRAL_API_KEY environment variable if not provided)
  • poll_interval: Seconds between batch job status checks (default: 2)

Returns: List of paths to the generated markdown files

Developer Guide

If you are new to using nbdev here are some useful pointers to get you started.

Install mistocr in Development mode

# make sure mistocr package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to mistocr
$ nbdev_prepare

Documentation

Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistocr-0.0.3.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mistocr-0.0.3-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file mistocr-0.0.3.tar.gz.

File metadata

  • Download URL: mistocr-0.0.3.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for mistocr-0.0.3.tar.gz
Algorithm Hash digest
SHA256 23da6b8ce3eb3a6b5d6dd1115d7495f4a982911e461250e76b4fac21ed67fd6c
MD5 1d387edbefa672107458a071f290c63f
BLAKE2b-256 792f3e7299911f03238a9ed75bcaef0e53ea3fa07424cba568b309b64d14cc7f

See more details on using hashes here.

File details

Details for the file mistocr-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: mistocr-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for mistocr-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 64ace512802ee2a025cb64e616069382b60639a09bae3c8ef9c001b14cfe597c
MD5 b136e4a79304d6e7b6aef67a678d72a7
BLAKE2b-256 226e211e1d33568ad88c728c4e0859714c3b7c6c9a3cdf9ce682381abf88ab4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page