Skip to main content

A small OCR package

Project description

MiniOCR

Python 3.8+ License: MIT

A powerful and easy-to-use Python package for performing Optical Character Recognition (OCR) on images, PDF documents, and PowerPoint presentations using OpenAI's Vision API.

Features

  • 🖼️ Multi-format support: Process images (PNG, JPG, JPEG, GIF, BMP, TIFF, WebP), PDF files, and PPTX presentations
  • Parallel processing: Concurrent processing of multiple pages/slides for improved performance
  • 🌐 Cross-platform: Works on Windows, macOS, and Linux
  • 📄 Visual OCR for PPTX: Converts PowerPoint slides to images for accurate visual content extraction
  • 🔄 Async support: Built with asyncio for efficient processing
  • 📝 Markdown output: Converts documents to clean, structured markdown format

Installation

Prerequisites

For PDF processing, you'll need to install Poppler:

macOS:

brew install poppler

Ubuntu/Debian:

sudo apt-get install poppler-utils

Windows: Download and install from Poppler for Windows

For PowerPoint (.pptx) processing, you'll need to install LibreOffice:

macOS:

brew install --cask libreoffice

Ubuntu/Debian:

sudo apt-get install libreoffice

Windows: Download and install from LibreOffice official website

Install MiniOCR

pip install miniocr

Or install from source:

git clone https://github.com/w95/miniocr.git
cd miniocr
pip install -e .

Quick Start

Setup

First, you'll need an OpenAI API key. Set it as an environment variable:

export OPENAI_API_KEY="your-api-key-here"

Or pass it directly when initializing the class.

Basic Usage

import asyncio
from miniocr import MiniOCR

async def main():
    # Initialize with API key (or use environment variable)
    ocr = MiniOCR(api_key="your-api-key-here")
    
    # Process an image
    result = await ocr.ocr("path/to/image.jpg")
    print(result["content"])
    
    # Process a PDF
    result = await ocr.ocr("path/to/document.pdf")
    print(f"Processed {result['pages']} pages")
    print(result["content"])
    
    # Process a PowerPoint presentation
    result = await ocr.ocr("path/to/presentation.pptx")
    print(result["content"])

if __name__ == "__main__":
    asyncio.run(main())

Advanced Usage

import asyncio
from miniocr import MiniOCR

async def advanced_example():
    ocr = MiniOCR()
    
    # Process with custom settings
    result = await ocr.ocr(
        file_path="document.pdf",
        model="gpt-4o",  # Use different OpenAI model
        concurrency=10,  # Process up to 10 pages simultaneously
        output_dir="./output",  # Save markdown to file
        cleanup=True  # Clean up temporary files
    )
    
    print(f"File: {result['file_name']}")
    print(f"Pages processed: {result['pages']}")
    print(f"Content length: {len(result['content'])} characters")

asyncio.run(advanced_example())

Processing URLs

import asyncio
from miniocr import MiniOCR

async def process_url():
    ocr = MiniOCR()
    
    # Process a file from URL
    result = await ocr.ocr("https://example.com/document.pdf")
    print(result["content"])

asyncio.run(process_url())

API Reference

MiniOCR Class

__init__(api_key: str = None)

Initialize the MiniOCR instance.

Parameters:

  • api_key (str, optional): OpenAI API key. If not provided, will use OPENAI_API_KEY environment variable.

async ocr(file_path, model="gpt-4o-mini", concurrency=5, output_dir=None, cleanup=True)

Process a file and extract text using OCR.

Parameters:

  • file_path (str): Path or URL to the file to process
  • model (str): OpenAI model to use (default: "gpt-4o-mini")
  • concurrency (int): Number of concurrent API requests (default: 5)
  • output_dir (str, optional): Directory to save markdown output
  • cleanup (bool): Whether to clean up temporary files (default: True)

Returns:

  • dict: Dictionary containing:
    • content (str): Extracted text in markdown format
    • pages (int): Number of pages/slides processed
    • file_name (str): Name of the processed file

Supported file types:

  • Images: .png, .jpg, .jpeg, .gif, .bmp, .tiff, .webp
  • Documents: .pdf
  • Presentations: .pptx

Configuration

Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key (required)

Model Options

MiniOCR supports various OpenAI models:

  • gpt-4o-mini (default, cost-effective)
  • gpt-4o (higher accuracy)
  • gpt-4-turbo

Output Format

MiniOCR converts documents to clean markdown with the following features:

  • Tables: Converted to HTML format for better structure
  • Checkboxes: Represented as ☐ (unchecked) and ☑ (checked)
  • Special elements: Logos, watermarks, and page numbers are wrapped in brackets
  • Charts and infographics: Interpreted and converted to markdown tables when applicable

Error Handling

import asyncio
from miniocr import MiniOCR

async def handle_errors():
    ocr = MiniOCR()
    
    try:
        result = await ocr.ocr("nonexistent.pdf")
    except ValueError as e:
        print(f"Unsupported file type: {e}")
    except Exception as e:
        print(f"Processing error: {e}")

asyncio.run(handle_errors())

Performance Tips

  1. Adjust concurrency: Increase concurrency parameter for faster processing of multi-page documents
  2. Use appropriate models: gpt-4o-mini for cost-effectiveness, gpt-4o for higher accuracy
  3. Process in batches: For large numbers of files, process them in batches to avoid rate limits
  4. Local processing: Keep files local when possible to avoid download overhead

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Testing

Run the test suite:

pytest tests/

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

v0.0.1

  • Initial release
  • Support for images, PDF, and PPTX files
  • Async processing with concurrency control
  • Cross-platform compatibility

Support

If you encounter any issues or have questions, please open an issue on GitHub.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

miniocr-0.0.3.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

miniocr-0.0.3-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file miniocr-0.0.3.tar.gz.

File metadata

  • Download URL: miniocr-0.0.3.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for miniocr-0.0.3.tar.gz
Algorithm Hash digest
SHA256 1003aa1a62aff8390122c973fb14befa0e53b2b0b9e14b10a676ac98e2c26cce
MD5 d44e46958eed05721e35e769957f3844
BLAKE2b-256 e9560802ecb218a49f08dd05ef9b2949f4a022ed480468cc81d4ffb7cc87cf7d

See more details on using hashes here.

File details

Details for the file miniocr-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: miniocr-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for miniocr-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 616a7fc2107f9683c9480dd5e4828c2fb5a5b4564795e3ba642f5c7d67934087
MD5 612f1dfa22e057a38b8c52014fd27eb0
BLAKE2b-256 9bad99674607b4ef51c91e92c2e287d11c71150a4b150375df78bfe52f57b9de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page