A library for parsing PDF documents using Mistral's OCR API

These details have not been verified by PyPI

Project links

Project description

Mistral OCR

A lightweight Python library for parsing PDF documents using Mistral's OCR API, extracting text content while maintaining document structure, and converting images into structured markdown sections with detailed descriptions.

Features

PDF document parsing with Mistral OCR
Text extraction with preserved formatting
Image extraction with detailed descriptions
Structured markdown output
Support for complex document layouts
Batch processing capabilities
Structured OCR with schema validation
Language detection and topic extraction

Installation

From PyPI

pip install mistral-ocr-parser

From Source

git clone https://github.com/raviraina/mistral-ocr-parser.git
cd mistral-ocr-parser
pip install -e .

Quick Start

Set up your API key

Create a .env file in your project directory:

MISTRAL_API_KEY=your_api_key_here

Or set it as an environment variable:

export MISTRAL_API_KEY=your_api_key_here

Basic Usage

from mistral_ocr import parse_pdf

# Parse a PDF file
result = parse_pdf("path/to/your/document.pdf")

# Save the result to a markdown file
with open("output.md", "w") as f:
    f.write(result)

Command Line Interface

# Process a single PDF
mistral-ocr --input document.pdf --output result.md

# Process multiple PDFs
mistral-ocr-batch --input-dir pdfs/ --output-dir outputs/

Structured OCR

from mistral_ocr import structured_ocr
from mistralai import Mistral

# Initialize Mistral client
client = Mistral(api_key="your_api_key_here")

# Process an image with structured OCR
result = structured_ocr("path/to/your/image.png", client)

# Access structured data
print(f"File name: {result['file_name']}")
print(f"Topics: {result['topics']}")
print(f"Languages: {result['languages']}")
print(f"OCR Contents: {result['ocr_contents']}")

Running Examples

The repository includes example scripts that demonstrate how to use the library. You can run these examples using the run_examples.py script:

# List all available examples
python run_examples.py --list

# Run a specific example
python run_examples.py simple_example

Available examples:

simple_example: Demonstrates basic PDF parsing
batch_processing: Shows how to process multiple PDFs in batch
image_example: Demonstrates processing an image with structured OCR

Example Output

PDF Parsing

The output is a markdown file that preserves the document structure and includes detailed descriptions of images:

# Document Title

## Section 1

This is the text content of section 1.

![Image Description](image_placeholder.png)
*Image Description: A graph showing the relationship between X and Y variables. The graph has a positive slope indicating a direct correlation.*

**Image Metadata:**
- Type: Graph
- Dimensions: 500x300
- Content: Statistical data visualization
- Key Elements: X-axis (Time), Y-axis (Value), Trend line

## Section 2

This is the text content of section 2.

Structured OCR

The structured OCR output is a JSON object with the following structure:

{
    "file_name": "receipt.png",
    "topics": ["receipt", "transaction", "purchase"],
    "languages": ["English"],
    "ocr_contents": {
        "store": {
            "name": "GROCERY STORE",
            "address": "123 Main Street, Anytown, USA",
            "phone": "555-123-4567"
        },
        "date": "2023-05-15",
        "time": "14:30",
        "items": [
            {
                "name": "Milk",
                "quantity": 1,
                "price": 3.99
            },
            {
                "name": "Bread",
                "quantity": 2,
                "price": 2.49
            }
        ],
        "subtotal": 8.97,
        "tax": 0.72,
        "total": 9.69,
        "payment": {
            "method": "Credit Card",
            "card": "VISA ****1234"
        }
    }
}

Advanced Usage

Batch Processing

from mistral_ocr import batch_process_pdfs

# Process multiple PDF files
output_files = batch_process_pdfs(
    input_dir="pdfs/",
    output_dir="outputs/",
    file_pattern="*.pdf"
)

Custom API Key

from mistral_ocr import MistralOCRParser

# Initialize the parser with a custom API key
parser = MistralOCRParser(api_key="your_api_key_here")

# Parse a PDF file
result = parser.parse_pdf("document.pdf", "output.md")

Image Processing

from mistral_ocr import process_image_ocr
from mistralai import Mistral

# Initialize Mistral client
client = Mistral(api_key="your_api_key_here")

# Process an image with OCR
ocr_result = process_image_ocr("image.png", client)

# Extract markdown content
markdown_content = ocr_result.pages[0].markdown

Development

Project Structure

mistral-ocr/
├── .github/workflows/        # GitHub CI/CD workflows
├── examples/                 # Example scripts
├── mistral_ocr/              # Main package directory
│   ├── __init__.py           # Package initialization
│   ├── parser.py             # Core parser functionality
│   ├── image.py              # Image processing functionality
│   └── utils.py              # Utility functions
├── tests/                    # Test directory
├── run_examples.py           # Script to run examples
└── run_tests.py              # Script to run tests

Setup

Clone the repository
Install development dependencies:
```
pip install -e ".[dev]"
```

Running Tests

You can run the test suite using the run_tests.py script:

# Run all tests
python run_tests.py

# Run tests with verbose output
python run_tests.py --verbose

# Run tests with coverage report
python run_tests.py --coverage

# Run a specific test file
python run_tests.py --file tests/test_parser.py

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project uses the Mistral OCR API for document processing. For more information about Mistral's OCR capabilities, visit Mistral AI's documentation.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistral_ocr_parser-0.1.0.tar.gz (15.0 kB view details)

Uploaded Mar 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mistral_ocr_parser-0.1.0-py3-none-any.whl (11.8 kB view details)

Uploaded Mar 7, 2025 Python 3

File details

Details for the file mistral_ocr_parser-0.1.0.tar.gz.

File metadata

Download URL: mistral_ocr_parser-0.1.0.tar.gz
Upload date: Mar 7, 2025
Size: 15.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for mistral_ocr_parser-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a96cfe39331486ee098479a6181a80532f37d09639a6c77654bb72ec8d70aedc`
MD5	`1a2029acb9e5904f6461ed8a0622e232`
BLAKE2b-256	`efe47a5f1f724f4ffc39e5ee5a7a9fe91c3fe03da762f880ef50bb388a71a0bb`

See more details on using hashes here.

File details

Details for the file mistral_ocr_parser-0.1.0-py3-none-any.whl.

File metadata

Download URL: mistral_ocr_parser-0.1.0-py3-none-any.whl
Upload date: Mar 7, 2025
Size: 11.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for mistral_ocr_parser-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf2b60a7f83547c775d3259edf73e8175604de8a8e72f61e91d005a8fee76b30`
MD5	`e47f0fc578d6f864a719bf041e81ca65`
BLAKE2b-256	`15f3e4747817503a5ab0a1b2ba836da5032f676447733e3becfc122e06381a2a`

See more details on using hashes here.

mistral-ocr-parser 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Mistral OCR

Features

Installation

From PyPI

From Source

Quick Start

Set up your API key

Basic Usage

Command Line Interface

Structured OCR

Running Examples

Example Output

PDF Parsing

Structured OCR

Advanced Usage

Batch Processing

Custom API Key

Image Processing

Development

Project Structure

Setup

Running Tests

Contributing

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes