A Python package for performing OCR and document indexing on legacy documents using the Mistral Ocr API.

Project description

README for docin OCR Tool

Overview

docin is a lightweight OCR (Optical Character Recognition) tool powered by the Mistral API. It extracts text and images from PDF and image files, converting them into clean, structured Markdown format for easy reading, indexing, or further processing.

Features

Automatically detects PDF or image input
Performs OCR using the Mistral API
Exports results as Markdown (.md)
Optionally includes extracted images
Displays real-time progress for multi-page documents
Prevents accidental overwriting of output files

Requirements

Python 3.8+
A valid Mistral API key

Installation

pip install docin

Usage

from ocr import MistOcr

# Initialize with your Mistral API key
ocr = MistOcr(api_key='your_mistral_api_key')

# Run OCR on a PDF or image file
ocr.doc_to_md(
    filename='path/to/document.pdf',
    output_filename='output/result.md',
    include_image=False,        # Include embedded or saved images (optional)
    return_response=False      # Return OCR response (optional)
)

Output

Saves extracted text in a Markdown (.md) file
Creates an images/ folder in the same directory for any extracted images
Displays progress during export
Returns an OCR response object when return_response=True

Supported File Types

PDF (.pdf)
Image formats: .jpg, .jpeg, .png, .bmp, .tiff

Error Handling

Raises ValueError for unsupported file types
Prompts before overwriting existing files
Logs warnings for missing or invalid image data

Notes

For best accuracy, use high-resolution images (â‰¥300 DPI)
Supports multi-page PDFs and large documents
Extracted images are named using their unique IDs and saved in the images/ directory

Example Output

Markdown file:

# Page 1
Extracted text...

# Page 2
More extracted text...

Images folder:

images/
 â”œâ”€â”€ image_1.png
 â”œâ”€â”€ image_2.jpg

Known Issue

Typo in _load_image method: 'base66' should be 'base64'.
Replace base66.b64encode with base64.b64encode for correct encoding.

Author

EnServ docin development team: Chukwudi Asibe, Ime Inyang, Oluwasey Akinbosola

License

MIT

Version

1.0.0

Project details

Release history Release notifications | RSS feed

0.1.2

Oct 13, 2025

This version

0.1.1

Oct 6, 2025

0.1.0

Oct 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docin-0.1.1.tar.gz (5.8 kB view details)

Uploaded Oct 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docin-0.1.1-py3-none-any.whl (6.0 kB view details)

Uploaded Oct 6, 2025 Python 3

File details

Details for the file docin-0.1.1.tar.gz.

File metadata

Download URL: docin-0.1.1.tar.gz
Upload date: Oct 6, 2025
Size: 5.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for docin-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ef8a6f3fabc15b3ee6dac67531c14e32cf09e95d9ff76072f905048f1cca21ec`
MD5	`9704bc6a3694e2621840465bf7de6ae6`
BLAKE2b-256	`0b21829f9599e2f5d014dfd235297bdfdc6d53effead17d26882085175988245`

See more details on using hashes here.

File details

Details for the file docin-0.1.1-py3-none-any.whl.

File metadata

Download URL: docin-0.1.1-py3-none-any.whl
Upload date: Oct 6, 2025
Size: 6.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for docin-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`907c8001c2ee46f98c7ab9b16ffb8a6a1fb2617cdc1308b95e12c1327b84520f`
MD5	`e06c88c5bee65339caf5a2d66c75c724`
BLAKE2b-256	`42cafbd75444dc6375b2038131c611c69602e99628a7dea7418c97300139422f`

See more details on using hashes here.

docin 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

README for docin OCR Tool

Overview

Features

Requirements

Installation

Usage

Output

Supported File Types

Error Handling

Notes

Example Output

Known Issue

Author

License

Version

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes