Skip to main content

A Python package for performing OCR and document indexing on legacy documents using the Mistral Ocr API.

Project description

README for docin OCR Tool

Overview

docin is a lightweight OCR (Optical Character Recognition) tool powered by the Mistral API. It extracts text and images from PDF and image files, converting them into clean, structured Markdown format for easy reading, indexing, or further processing.


Features

  • Automatically detects PDF or image input
  • Performs OCR using the Mistral API
  • Exports results as Markdown (.md)
  • Optionally includes extracted images
  • Displays real-time progress for multi-page documents
  • Prevents accidental overwriting of output files

Requirements

  • Python 3.8+
  • A valid Mistral API key

Installation

pip install docin

Usage

from ocr import MistOcr

# Initialize with your Mistral API key
ocr = MistOcr(api_key='your_mistral_api_key')

# Run OCR on a PDF or image file
ocr.doc_to_md(
    filename='path/to/document.pdf',
    output_filename='output/result.md',
    include_image=False,        # Include embedded or saved images (optional)
    return_response=False      # Return OCR response (optional)
)

Output

  • Saves extracted text in a Markdown (.md) file
  • Creates an images/ folder in the same directory for any extracted images
  • Displays progress during export
  • Returns an OCR response object when return_response=True

Supported File Types

  • PDF (.pdf)
  • Image formats: .jpg, .jpeg, .png, .bmp, .tiff

Error Handling

  • Raises ValueError for unsupported file types
  • Prompts before overwriting existing files
  • Logs warnings for missing or invalid image data

Notes

  • For best accuracy, use high-resolution images (≥300 DPI)
  • Supports multi-page PDFs and large documents
  • Extracted images are named using their unique IDs and saved in the images/ directory

Example Output

Markdown file:

# Page 1
Extracted text...

# Page 2
More extracted text...

Images folder:

images/
 ├── image_1.png
 ├── image_2.jpg

Known Issue

  • Typo in _load_image method: 'base66' should be 'base64'.
    Replace base66.b64encode with base64.b64encode for correct encoding.

Author

EnServ docin development team: Chukwudi Asibe, Ime Inyang, Oluwasey Akinbosola

License

MIT

Version

1.0.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docin-0.1.1.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docin-0.1.1-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file docin-0.1.1.tar.gz.

File metadata

  • Download URL: docin-0.1.1.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for docin-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ef8a6f3fabc15b3ee6dac67531c14e32cf09e95d9ff76072f905048f1cca21ec
MD5 9704bc6a3694e2621840465bf7de6ae6
BLAKE2b-256 0b21829f9599e2f5d014dfd235297bdfdc6d53effead17d26882085175988245

See more details on using hashes here.

File details

Details for the file docin-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: docin-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for docin-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 907c8001c2ee46f98c7ab9b16ffb8a6a1fb2617cdc1308b95e12c1327b84520f
MD5 e06c88c5bee65339caf5a2d66c75c724
BLAKE2b-256 42cafbd75444dc6375b2038131c611c69602e99628a7dea7418c97300139422f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page