A Python package for performing OCR and document indexing on legacy documents using the Mistral Ocr API.

Project description

README for docin OCR Tool

Overview

docin is a lightweight document processing toolkit that combines OCR (Optical Character Recognition) and intelligent document analysis. It features two main components:

OCR Engine: Powered by the Mistral API, it extracts text and images from PDF and image files, converting them into clean, structured Markdown format.
LangQuery: An intelligent document analysis tool that uses LangChain and spaCy to extract structured information through natural language queries and visualize named entities.

Features

OCR Capabilities

Automatically detects PDF or image input
Performs OCR using the Mistral API
Exports results as Markdown (.md)
Optionally includes extracted images
Displays real-time progress for multi-page documents
Prevents accidental overwriting of output files

Document Analysis

Extract structured information using natural language queries
Visualize named entities with interactive highlighting
Customize prompts and examples for specific use cases
Return results in JSON format for easy processing

Requirements

Python 3.8+
A valid Mistral API key
A LangChain-compatible language model (for document analysis)
spaCy with 'en_core_web_sm' model (auto-installed)

Installation

pip install docin

The spaCy model 'en_core_web_sm' will be automatically downloaded during installation.

Usage

OCR Processing

from ocr import MistOcr

# Initialize with your Mistral API key
ocr = MistOcr(api_key='your_mistral_api_key')

# Run OCR on a PDF or image file
ocr.doc_to_md(
    filename='path/to/document.pdf',
    output_filename='output/result.md',
    include_image=False,        # Include embedded or saved images (optional)
    return_response=False      # Return OCR response (optional)
)

Document Analysis with LangQuery

from ocr.query import LangQuery

# Initialize with your LangChain model
query = LangQuery(llm=your_llm)

# Load and analyze a document
query.load_document("Your document text here")

# Extract information using natural language
response = query.query_document("Find all company names and locations")

# Visualize results (in Jupyter/IPython)
query.render(response)

You can customize the analysis by setting different examples:

# Set custom examples for specific entity types
query.set_examples("""
{
    'companies': ['Example Corp', 'Tech Inc'],
    'dates': ['2023-01-01'],
    'locations': ['New York']
}
""")

Output

OCR Output

Saves extracted text in a Markdown (.md) file
Creates an images/ folder in the same directory for any extracted images
Displays progress during export
Returns an OCR response object when return_response=True

LangQuery Output

Returns structured JSON with extracted entities
Provides interactive entity highlighting in Jupyter/IPython
Supports customizable response formats through examples

Supported File Types

PDF (.pdf)
Image formats: .jpg, .jpeg, .png, .bmp, .tiff
LangQuery works with any text content

Error Handling

Raises ValueError for unsupported file types
Prompts before overwriting existing files
Logs warnings for missing or invalid image data
Validates document loading for LangQuery
Ensures proper model initialization

Notes

For OCR: Use high-resolution images (â‰¥300 DPI) for best accuracy
Supports multi-page PDFs and large documents
Extracted images are saved with unique IDs in the images/ directory
LangQuery works best with well-structured text content
Entity visualization requires Jupyter/IPython environment

Example Output

OCR Output

Markdown file:

# Page 1
Extracted text...

# Page 2
More extracted text...

Images folder:

images/
 â”œâ”€â”€ image_1.png
 â”œâ”€â”€ image_2.jpg

LangQuery Output

JSON Response:

{
    "companies": ["Acme Corp", "TechStart Inc"],
    "locations": ["New York", "San Francisco"],
    "dates": ["2023-01-15"]
}

Visual Output:

Interactive highlighting of found entities in the document text
Color-coded entity types for easy identification

Author

Ime Inyang

License

MIT

Version

0.1.2

Project details

Release history Release notifications | RSS feed

This version

0.1.2

Oct 13, 2025

0.1.1

Oct 6, 2025

0.1.0

Oct 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docin-0.1.2.tar.gz (9.3 kB view details)

Uploaded Oct 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docin-0.1.2-py3-none-any.whl (10.3 kB view details)

Uploaded Oct 13, 2025 Python 3

File details

Details for the file docin-0.1.2.tar.gz.

File metadata

Download URL: docin-0.1.2.tar.gz
Upload date: Oct 13, 2025
Size: 9.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for docin-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`5dc215701cf304c9dd08c4d3c24b9be595521a61ff21b9e9131b67c72f85f512`
MD5	`07647cc1c22b1a5f4a257e6d2a184199`
BLAKE2b-256	`e7854160100218760200bf735512f87e7b759244620d0367b80ae4943d942d17`

See more details on using hashes here.

File details

Details for the file docin-0.1.2-py3-none-any.whl.

File metadata

Download URL: docin-0.1.2-py3-none-any.whl
Upload date: Oct 13, 2025
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for docin-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2982a0502e69a6241cdf579b9c6eca5776dcbba78fc2dd041c8cff92eaa30369`
MD5	`6564bdbf7588469d9b0053fdc1eeeb22`
BLAKE2b-256	`7f0cae7ff449daad0bd373736d4faabbda2616696a725a3c228718211e9bd797`

See more details on using hashes here.

docin 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

README for docin OCR Tool

Overview

Features

OCR Capabilities

Document Analysis

Requirements

Installation

Usage

OCR Processing

Document Analysis with LangQuery

Output

OCR Output

LangQuery Output

Supported File Types

Error Handling

Notes

Example Output

OCR Output

LangQuery Output

Author

License

Version

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes