A Python package for performing OCR and document indexing on legacy documents using the Mistral Ocr API.
Project description
README for docin OCR Tool
Overview
docin is a lightweight OCR (Optical Character Recognition) tool powered by the Mistral API. It extracts text and images from PDF and image files, converting them into clean, structured Markdown format for easy reading, indexing, or further processing.
Features
- Automatically detects PDF or image input
- Performs OCR using the Mistral API
- Exports results as Markdown (.md)
- Optionally includes extracted images
- Displays real-time progress for multi-page documents
- Prevents accidental overwriting of output files
Requirements
- Python 3.8+
- A valid Mistral API key
Installation
pip install docin
Usage
from ocr import MistOcr
# Initialize with your Mistral API key
ocr = MistOcr(api_key='your_mistral_api_key')
# Run OCR on a PDF or image file
ocr.doc_to_md(
filename='path/to/document.pdf',
output_filename='output/result.md',
include_image=False, # Include embedded or saved images (optional)
return_response=False # Return OCR response (optional)
)
Output
- Saves extracted text in a Markdown (.md) file
- Creates an
images/folder in the same directory for any extracted images - Displays progress during export
- Returns an OCR response object when
return_response=True
Supported File Types
- PDF (.pdf)
- Image formats: .jpg, .jpeg, .png, .bmp, .tiff
Error Handling
- Raises
ValueErrorfor unsupported file types - Prompts before overwriting existing files
- Logs warnings for missing or invalid image data
Notes
- For best accuracy, use high-resolution images (≥300 DPI)
- Supports multi-page PDFs and large documents
- Extracted images are named using their unique IDs and saved in the
images/directory
Example Output
Markdown file:
# Page 1
Extracted text...
# Page 2
More extracted text...
Images folder:
images/
├── image_1.png
├── image_2.jpg
Known Issue
- Typo in
_load_imagemethod: 'base66' should be 'base64'.
Replacebase66.b64encodewithbase64.b64encodefor correct encoding.
Author
EnServ docin development team: Chukwudi Asibe, Ime Inyang, Oluwasey Akinbosola
License
MIT
Version
1.0.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docin-0.1.1.tar.gz.
File metadata
- Download URL: docin-0.1.1.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef8a6f3fabc15b3ee6dac67531c14e32cf09e95d9ff76072f905048f1cca21ec
|
|
| MD5 |
9704bc6a3694e2621840465bf7de6ae6
|
|
| BLAKE2b-256 |
0b21829f9599e2f5d014dfd235297bdfdc6d53effead17d26882085175988245
|
File details
Details for the file docin-0.1.1-py3-none-any.whl.
File metadata
- Download URL: docin-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
907c8001c2ee46f98c7ab9b16ffb8a6a1fb2617cdc1308b95e12c1327b84520f
|
|
| MD5 |
e06c88c5bee65339caf5a2d66c75c724
|
|
| BLAKE2b-256 |
42cafbd75444dc6375b2038131c611c69602e99628a7dea7418c97300139422f
|