A small OCR package
Project description
MiniOCR
A powerful and easy-to-use Python package for performing Optical Character Recognition (OCR) on images, PDF documents, and PowerPoint presentations using OpenAI's Vision API.
Features
- 🖼️ Multi-format support: Process images (PNG, JPG, JPEG, GIF, BMP, TIFF, WebP), PDF files, and PPTX presentations
- ⚡ Parallel processing: Concurrent processing of multiple pages/slides for improved performance
- 🌐 Cross-platform: Works on Windows, macOS, and Linux
- 📄 Visual OCR for PPTX: Converts PowerPoint slides to images for accurate visual content extraction
- 🔄 Async support: Built with asyncio for efficient processing
- 📝 Markdown output: Converts documents to clean, structured markdown format
Installation
System Requirements
- Python: 3.8 or higher
- Operating System: Windows, macOS, or Linux
Dependencies
Python Dependencies (Auto-installed)
MiniOCR automatically installs these Python packages:
- openai - OpenAI API client for Vision processing
- aiohttp - Async HTTP client for file downloads
- aiofiles - Async file operations
- pdf2image - PDF to image conversion
- python-pptx - PowerPoint file parsing
- Pillow - Image processing and manipulation
System Dependencies (Manual Installation Required)
For PDF Processing - Poppler
macOS:
brew install poppler
Ubuntu/Debian:
sudo apt-get install poppler-utils
Windows: Download and install from Poppler for Windows
For PPTX Visual Processing - LibreOffice
macOS:
brew install --cask libreoffice
Ubuntu/Debian:
sudo apt-get install libreoffice
Windows: Download and install from LibreOffice official website
Note: LibreOffice is required for high-quality PPTX processing with visual content extraction. Without it, MiniOCR will fall back to text-only extraction. On Windows, MiniOCR automatically detects LibreOffice in common installation paths.
Install MiniOCR
From PyPI (Recommended)
pip install miniocr
From Source
git clone https://github.com/w95/miniocr.git
cd miniocr
pip install -e .
Verify Installation
from miniocr import MiniOCR, __version__
print(f"MiniOCR v{__version__} installed successfully!")
Quick Start
Setup
First, you'll need an OpenAI API key. Set it as an environment variable:
export OPENAI_API_KEY="your-api-key-here"
Or pass it directly when initializing the class.
Basic Usage
import asyncio
from miniocr import MiniOCR
async def main():
# Initialize with API key (or use environment variable)
ocr = MiniOCR(api_key="your-api-key-here")
# Process an image
result = await ocr.ocr("path/to/image.jpg")
print(result["content"])
# Process a PDF
result = await ocr.ocr("path/to/document.pdf")
print(f"Processed {result['pages']} pages")
print(result["content"])
# Process a PowerPoint presentation
result = await ocr.ocr("path/to/presentation.pptx")
print(result["content"])
if __name__ == "__main__":
asyncio.run(main())
Advanced Usage
import asyncio
from miniocr import MiniOCR
async def advanced_example():
ocr = MiniOCR()
# Process with custom settings
result = await ocr.ocr(
file_path="document.pdf",
model="gpt-4o", # Use different OpenAI model
concurrency=10, # Process up to 10 pages simultaneously
output_dir="./output", # Save markdown to file
cleanup=True # Clean up temporary files
)
print(f"File: {result['file_name']}")
print(f"Pages processed: {result['pages']}")
print(f"Content length: {len(result['content'])} characters")
asyncio.run(advanced_example())
Processing URLs
import asyncio
from miniocr import MiniOCR
async def process_url():
ocr = MiniOCR()
# Process a file from URL
result = await ocr.ocr("https://example.com/document.pdf")
print(result["content"])
asyncio.run(process_url())
API Reference
MiniOCR Class
__init__(api_key: str = None)
Initialize the MiniOCR instance.
Parameters:
api_key(str, optional): OpenAI API key. If not provided, will useOPENAI_API_KEYenvironment variable.
async ocr(file_path, model="gpt-4o-mini", concurrency=5, output_dir=None, cleanup=True)
Process a file and extract text using OCR.
Parameters:
file_path(str): Path or URL to the file to processmodel(str): OpenAI model to use (default: "gpt-4o-mini")concurrency(int): Number of concurrent API requests (default: 5)output_dir(str, optional): Directory to save markdown outputcleanup(bool): Whether to clean up temporary files (default: True)
Returns:
dict: Dictionary containing:content(str): Extracted text in markdown formatpages(int): Number of pages/slides processedfile_name(str): Name of the processed file
Supported file types:
- Images:
.png,.jpg,.jpeg,.gif,.bmp,.tiff,.webp - Documents:
.pdf - Presentations:
.pptx
Configuration
Required Configuration
OpenAI API Key
You must provide an OpenAI API key to use MiniOCR. You can set it in two ways:
Option 1: Environment Variable (Recommended)
export OPENAI_API_KEY="your-api-key-here"
Option 2: Pass directly to class
from miniocr import MiniOCR
ocr = MiniOCR(api_key="your-api-key-here")
Optional Dependencies Behavior
Without Poppler (PDF Processing)
- Effect: PDF processing will fail
- Error:
pdf2imagewill raise an exception - Solution: Install Poppler following the instructions above
Without LibreOffice (PPTX Processing)
- Effect: Falls back to text-only extraction from PPTX files
- Warning: Visual content (charts, images, formatting) will be lost
- Solution: Install LibreOffice for full visual processing capabilities
Model Options
MiniOCR supports various OpenAI models:
gpt-4o-mini(default, cost-effective)gpt-4o(higher accuracy)gpt-4-turbo
Output Format
MiniOCR converts documents to clean markdown with the following features:
- Tables: Converted to HTML format for better structure
- Checkboxes: Represented as ☐ (unchecked) and ☑ (checked)
- Special elements: Logos, watermarks, and page numbers are wrapped in brackets
- Charts and infographics: Interpreted and converted to markdown tables when applicable
Error Handling
import asyncio
from miniocr import MiniOCR
async def handle_errors():
ocr = MiniOCR()
try:
result = await ocr.ocr("nonexistent.pdf")
except ValueError as e:
print(f"Unsupported file type: {e}")
except Exception as e:
print(f"Processing error: {e}")
asyncio.run(handle_errors())
Troubleshooting
Common Issues
"No module named 'pdf2image'" or PDF processing fails
Solution: Install Poppler system dependency
# macOS
brew install poppler
# Ubuntu/Debian
sudo apt-get install poppler-utils
"LibreOffice conversion failed" for PPTX files
Solution: Install LibreOffice
# macOS
brew install --cask libreoffice
# Ubuntu/Debian
sudo apt-get install libreoffice
"Invalid API key" or OpenAI authentication errors
Solution: Verify your OpenAI API key
# Check if environment variable is set
echo $OPENAI_API_KEY
# Or test with a simple script
python -c "from openai import OpenAI; client = OpenAI(); print('API key is valid')"
PPTX files show only text content, missing charts/images
Cause: LibreOffice not installed, falling back to text-only extraction
Solution: Install LibreOffice for full visual processing capabilities
"soffice command not found" errors
Cause: LibreOffice not in system PATH
Solutions:
- macOS: Ensure LibreOffice is installed via Homebrew:
brew install --cask libreoffice - Linux: Install via package manager:
sudo apt-get install libreoffice - Windows: Add LibreOffice to PATH or reinstall
Rate limiting or quota exceeded errors
Cause: Too many requests to OpenAI API
Solutions:
- Reduce
concurrencyparameter (try 1-3 for free tier) - Add delays between processing batches
- Upgrade your OpenAI plan for higher rate limits
Dependencies Verification
Check Python Dependencies
# Verify MiniOCR installation
from miniocr import MiniOCR, __version__
print(f"MiniOCR v{__version__} installed")
# Check key dependencies
import openai, aiohttp, pdf2image, pptx
print("All Python dependencies available")
Check System Dependencies
# Test Poppler installation
pdftoppm -h
# Test LibreOffice installation
soffice --version
Performance Tips
- Adjust concurrency: Increase
concurrencyparameter for faster processing of multi-page documents - Use appropriate models:
gpt-4o-minifor cost-effectiveness,gpt-4ofor higher accuracy - Process in batches: For large numbers of files, process them in batches to avoid rate limits
- Local processing: Keep files local when possible to avoid download overhead
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Testing
Run the test suite:
pytest tests/
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
v0.0.4
- Enhanced Windows Support: Improved LibreOffice detection on Windows systems
- Cross-platform Compatibility: Automatically finds LibreOffice in common installation paths
- Robust Error Handling: Better timeout and fallback mechanisms for PPTX processing
- Improved Reliability: More resilient LibreOffice executable detection across platforms
v0.0.3
- Enhanced PPTX Processing: Now uses LibreOffice for PDF conversion and OpenAI Vision API
- Visual Content Extraction: Captures charts, images, and formatting from PowerPoint slides
- Improved Accuracy: Better OCR results for complex PPTX layouts
- Fallback Support: Graceful degradation to text extraction if LibreOffice unavailable
- Updated Documentation: Comprehensive dependency and troubleshooting information
v0.0.2
- PyPI Publication: Package published to Python Package Index
- Improved Package Structure: Better organization and imports
- Enhanced README: Complete documentation with examples
- Testing Infrastructure: Comprehensive test suite
v0.0.1
- Initial Release: Basic OCR functionality
- Multi-format Support: Images, PDF, and PPTX files
- Async Processing: Concurrent processing with configurable limits
- Cross-platform Compatibility: Windows, macOS, and Linux support
Support
If you encounter any issues or have questions, please open an issue on GitHub.
Acknowledgments
- Built with OpenAI's Vision API
- Uses pdf2image for PDF processing
- Uses python-pptx for PowerPoint processing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file miniocr-0.0.4.tar.gz.
File metadata
- Download URL: miniocr-0.0.4.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec5ad39ae9bf930e982703a6f86cd53acde1979d8f18bc0dc104720b54a6d06f
|
|
| MD5 |
c5a33bf3194e36771e0c03249aab308a
|
|
| BLAKE2b-256 |
8df302b685c7e466bc65457bc89dc10b18c8825a76ea2d740ea49303d938b485
|
File details
Details for the file miniocr-0.0.4-py3-none-any.whl.
File metadata
- Download URL: miniocr-0.0.4-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ed1eded2d1e7fec9b8054de685fe0ecafe2ae8b14c2c4e84469879ab451923d
|
|
| MD5 |
16d26273cb4196b76ec3711b811bb8cf
|
|
| BLAKE2b-256 |
5ede2b6b7cffba8c6d9f91cb017497b9ebb516c5fdd71510c3068d951a2f668b
|