Invoice OCR System - Convert invoices between PDF, JSON, XML, HTML formats using OCR
Project description
InvOCR - Invoice OCR & Conversion System
๐ Universal document processing system with OCR capabilities for invoices, receipts, and financial documents
๐ Features
๐ Document Processing
- PDF โ Images (PNG/JPG) with configurable DPI
- Image โ JSON using advanced OCR (Tesseract + EasyOCR)
- PDF โ JSON (direct text + OCR fallback)
- JSON โ XML (EU Invoice standard format)
- JSON โ HTML (responsive templates)
- HTML โ PDF (professional output)
๐ Multi-language Support
- English, Polish, German, French, Spanish, Italian
- Auto-detection of document language
- Custom language combinations
๐ Document Types
- โ Invoices (commercial invoices)
- โ Receipts (retail receipts)
- โ Payment confirmations
- โ Financial documents
- โ Custom business documents
๐ง Interfaces
- CLI - Command line interface
- REST API - Web API with OpenAPI docs
- Docker - Containerized deployment
- Batch processing - Multiple files
๐๏ธ Project Structure
invocr/
โโโ ๐ invocr/ # Main package
โ โโโ ๐ core/ # Core processing modules
โ โ โโโ ocr.py # OCR engine (Tesseract + EasyOCR)
โ โ โโโ converter.py # Universal format converter
โ โ โโโ extractor.py # Data extraction logic
โ โ โโโ validator.py # Data validation
โ โ
โ โโโ ๐ formats/ # Format-specific handlers
โ โ โโโ pdf.py # PDF operations
โ โ โโโ image.py # Image processing
โ โ โโโ json_handler.py # JSON operations
โ โ โโโ xml_handler.py # EU XML format
โ โ โโโ html_handler.py # HTML generation
โ โ
โ โโโ ๐ api/ # REST API
โ โ โโโ main.py # FastAPI application
โ โ โโโ routes.py # API endpoints
โ โ โโโ models.py # Pydantic models
โ โ
โ โโโ ๐ cli/ # Command line interface
โ โ โโโ commands.py # CLI commands
โ โ
โ โโโ ๐ utils/ # Utilities
โ โโโ config.py # Configuration
โ โโโ logger.py # Logging setup
โ โโโ helpers.py # Helper functions
โ
โโโ ๐ tests/ # Test suite
โโโ ๐ scripts/ # Installation scripts
โโโ ๐ docs/ # Documentation
โโโ ๐ณ Dockerfile # Docker configuration
โโโ ๐ณ docker-compose.yml # Docker Compose
โโโ ๐ pyproject.toml # Poetry configuration
โโโ ๐ README.md # This file
โก Quick Start
Option 1: Auto Installation (Recommended)
# Clone repository
git clone https://github.com/your-username/invocr.git
cd invocr
# Run installation script
chmod +x scripts/install.sh
./scripts/install.sh
Option 2: Manual Installation
# Install system dependencies (Ubuntu/Debian)
sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-pol poppler-utils \
libpango-1.0-0 libharfbuzz0b python3-dev build-essential
# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -
# Install Python dependencies
poetry install
# Setup environment
cp .env.example .env
Option 3: Docker
# Using Docker Compose (easiest)
docker-compose up
# Or build manually
docker build -t invocr .
docker run -p 8000:8000 invocr
๐ Usage Examples
CLI Commands
# Convert PDF to JSON
invocr convert invoice.pdf invoice.json
# Convert with specific languages
invocr convert -l en,pl,de document.pdf output.json
# PDF to images
invocr pdf2img document.pdf ./images/ --format png --dpi 300
# Image to JSON (OCR)
invocr img2json scan.png data.json --doc-type invoice
# JSON to EU XML format
invocr json2xml data.json invoice.xml
# Batch processing
invocr batch ./input_files/ ./output/ --format json --parallel 4
# Full pipeline: PDF โ IMG โ JSON โ XML โ HTML โ PDF
invocr pipeline document.pdf ./results/
# Start API server
invocr serve --host 0.0.0.0 --port 8000
REST API
# Start server
invocr serve
# Convert file
curl -X POST "http://localhost:8000/convert" \
-F "file=@invoice.pdf" \
-F "target_format=json" \
-F "languages=en,pl"
# Check job status
curl "http://localhost:8000/status/{job_id}"
# Download result
curl "http://localhost:8000/download/{job_id}" -o result.json
Python API
from invocr import create_converter
# Create converter instance
converter = create_converter(languages=['en', 'pl', 'de'])
# Convert PDF to JSON
result = converter.pdf_to_json('invoice.pdf')
print(result)
# Convert image to JSON with OCR
data = converter.image_to_json('scan.png', document_type='invoice')
# Convert JSON to EU XML
xml_content = converter.json_to_xml(data, format='eu_invoice')
# Full conversion pipeline
result = converter.convert('input.pdf', 'output.json', 'auto', 'json')
๐ API Documentation
When running the API server, visit:
- Interactive docs: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- OpenAPI JSON: http://localhost:8000/openapi.json
Key Endpoints
POST /convert- Convert single filePOST /convert/pdf2img- PDF to imagesPOST /convert/img2json- Image OCR to JSONPOST /batch/convert- Batch processingGET /status/{job_id}- Job statusGET /download/{job_id}- Download resultGET /health- Health checkGET /info- System information
๐ง Configuration
Environment Variables
Key configuration options in .env:
# OCR Settings
DEFAULT_OCR_ENGINE=auto # tesseract, easyocr, auto
DEFAULT_LANGUAGES=en,pl,de,fr,es # Supported languages
OCR_CONFIDENCE_THRESHOLD=0.3 # Minimum confidence
# Processing
MAX_FILE_SIZE=52428800 # 50MB limit
PARALLEL_WORKERS=4 # Concurrent processing
MAX_PAGES_PER_PDF=10 # Page limit
# Storage
UPLOAD_DIR=./uploads
OUTPUT_DIR=./output
TEMP_DIR=./temp
Supported Languages
| Code | Language | Tesseract | EasyOCR |
|---|---|---|---|
en |
English | โ | โ |
pl |
Polish | โ | โ |
de |
German | โ | โ |
fr |
French | โ | โ |
es |
Spanish | โ | โ |
it |
Italian | โ | โ |
๐ Supported Formats
Input Formats
- PDF (.pdf)
- Images (.png, .jpg, .jpeg, .tiff, .bmp)
- JSON (.json)
- XML (.xml)
- HTML (.html)
Output Formats
- JSON - Structured data
- XML - EU Invoice standard
- HTML - Responsive templates
- PDF - Professional documents
๐งช Testing
# Run all tests
poetry run pytest
# Run with coverage
poetry run pytest --cov=invocr
# Run specific test file
poetry run pytest tests/test_ocr.py
# Run API tests
poetry run pytest tests/test_api.py
๐ Deployment
Production with Docker
# docker-compose.prod.yml
version: '3.8'
services:
invocr:
image: invocr:latest
ports:
- "80:8000"
environment:
- ENVIRONMENT=production
- WORKERS=4
volumes:
- ./data:/app/data
Kubernetes
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: invocr
spec:
replicas: 3
selector:
matchLabels:
app: invocr
template:
metadata:
labels:
app: invocr
spec:
containers:
- name: invocr
image: invocr:latest
ports:
- containerPort: 8000
๐ค Contributing
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Make changes
- Add tests
- Run tests (
poetry run pytest) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
Development Setup
# Install development dependencies
poetry install --with dev
# Install pre-commit hooks
poetry run pre-commit install
# Run linting
poetry run black invocr/
poetry run isort invocr/
poetry run flake8 invocr/
# Run type checking
poetry run mypy invocr/
๐ Performance
Benchmarks
| Operation | Time | Memory |
|---|---|---|
| PDF โ JSON (1 page) | ~2-3s | ~50MB |
| Image OCR โ JSON | ~1-2s | ~30MB |
| JSON โ XML | ~0.1s | ~10MB |
| JSON โ HTML | ~0.2s | ~15MB |
| HTML โ PDF | ~1-2s | ~40MB |
Optimization Tips
- Use
--parallelfor batch processing - Enable
IMAGE_ENHANCEMENT=falsefor faster OCR - Use
tesseractengine for better performance - Configure
MAX_PAGES_PER_PDFfor large documents
๐ Security
- File upload validation
- Size limits enforced
- Input sanitization
- No execution of uploaded content
- Rate limiting available
- CORS configuration
๐ Requirements
System Requirements
- Python: 3.9+
- Memory: 1GB+ RAM
- Storage: 500MB+ free space
- OS: Linux, macOS, Windows (Docker)
Dependencies
- Tesseract OCR: Text recognition
- EasyOCR: Neural OCR engine
- WeasyPrint: HTML to PDF conversion
- FastAPI: Web framework
- Pydantic: Data validation
๐ Troubleshooting
Common Issues
OCR not working:
# Check Tesseract installation
tesseract --version
# Install missing languages
sudo apt install tesseract-ocr-pol
WeasyPrint errors:
# Install system dependencies
sudo apt install libpango-1.0-0 libharfbuzz0b
Import errors:
# Reinstall dependencies
poetry install --force
Permission errors:
# Fix file permissions
chmod -R 755 uploads/ output/
๐ Support
- ๐ง Email: support@invocr.com
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
- ๐ Wiki: Project Wiki
๐ License
This project is licensed under the Apache License - see the LICENSE file for details.
๐ Acknowledgments
- Tesseract OCR - OCR engine
- EasyOCR - Neural OCR
- FastAPI - Web framework
- WeasyPrint - HTML/CSS to PDF
- Poetry - Dependency management
Made with โค๏ธ for the open source community
โญ Star this repository if you find it useful!
---
scripts/setup_env.py
#!/usr/bin/env python3 """ Environment setup script for InvOCR Configures environment variables and validates setup """
import os import sys from pathlib import Path
def setup_environment(): """Setup environment variables and directories""" print("๐ง Setting up InvOCR environment...")
# Project root
project_root = Path(__file__).parent.parent
os.chdir(project_root)
# Create directories
directories = ["uploads", "output", "temp", "logs", "static"]
for directory in directories:
Path(directory).mkdir(exist_ok=True)
print(f"๐ Created directory: {directory}")
# Setup environment file
env_file = Path(".env")
env_example = Path(".env.example")
if not env_file.exists() and env_example.exists():
import shutil
shutil.copy(env_example, env_file)
print("โ
Created .env file from template")
elif not env_file.exists():
# Create basic .env file
create_basic_env_file(env_file)
print("โ
Created basic .env file")
# Validate setup
validate_setup()
print("๐ Environment setup completed!")
def create_basic_env_file(env_path: Path): """Create basic environment file""" content = """# InvOCR Environment Configuration ENVIRONMENT=development DEBUG=true LOG_LEVEL=INFO
Server
HOST=0.0.0.0 PORT=8000
Storage
UPLOAD_DIR=./uploads OUTPUT_DIR=./output TEMP_DIR=./temp LOGS_DIR=./logs
OCR
DEFAULT_LANGUAGES=en,pl,de,fr,es,it OCR_CONFIDENCE_THRESHOLD=0.3
Processing
MAX_FILE_SIZE=52428800 PARALLEL_WORKERS=4
Security
SECRET_KEY=change-me-in-production """
with open(env_path, 'w', encoding='utf-8') as f:
f.write(content)
def validate_setup(): """Validate environment setup""" print("๐ Validating setup...")
# Check Python version
if sys.version_info < (3, 9):
print("โ Python 3.9+ required")
return False
print(f"โ
Python {sys.version_info.major}.{sys.version_info.minor}")
# Check directories
required_dirs = ["uploads", "output", "temp", "logs"]
for directory in required_dirs:
if Path(directory).exists():
print(f"โ
Directory exists: {directory}")
else:
print(f"โ Missing directory: {directory}")
# Check environment file
if Path(".env").exists():
print("โ
Environment file exists")
else:
print("โ Missing .env file")
# Try importing invocr
try:
import invocr
print("โ
InvOCR package importable")
except ImportError as e:
print(f"โ Cannot import InvOCR: {e}")
print("Run: poetry install")
return True
if name == "main": setup_environment()
---
docs/api.md
InvOCR API Documentation
Overview
The InvOCR REST API provides endpoints for document conversion and OCR processing.
Base URL
http://localhost:8000
Authentication
Currently no authentication required for local development.
Endpoints
Health Check
GET /health
Returns system health status.
System Information
GET /info
Returns supported formats, languages, and features.
Convert File
POST /convert
Convert uploaded file to specified format.
Parameters:
file(file): Input filetarget_format(string): Output format (json, xml, html, pdf)languages(string): Comma-separated language codesasync_processing(boolean): Process in background
Check Job Status
GET /status/{job_id}
Get conversion job status.
Download Result
GET /download/{job_id}
Download conversion result.
Example Usage
# Convert PDF to JSON
curl -X POST "http://localhost:8000/convert" \
-F "file=@invoice.pdf" \
-F "target_format=json"
# Check status
curl "http://localhost:8000/status/job-id"
# Download result
curl "http://localhost:8000/download/job-id" -o result.json
---
docs/cli.md
InvOCR CLI Documentation
Installation
poetry install
Usage
Basic Commands
# Show help
invocr --help
# Convert single file
invocr convert input.pdf output.json
# Convert with specific languages
invocr convert -l en,pl,de document.pdf output.json
# Convert PDF to images
invocr pdf2img document.pdf ./images/
# Image to JSON (OCR)
invocr img2json scan.png data.json
# JSON to XML
invocr json2xml data.json invoice.xml
# Batch processing
invocr batch ./pdfs/ ./output/ --format json
# Full pipeline
invocr pipeline document.pdf ./results/
# Start API server
invocr serve
Advanced Options
# Batch processing with parallelization
invocr batch ./input/ ./output/ --parallel 8 --format xml
# Custom OCR languages
invocr img2json scan.png data.json --languages en,pl,de,fr
# Custom templates
invocr convert data.json invoice.html --template classic
# API server with custom host/port
invocr serve --host 0.0.0.0 --port 9000
Examples
# Convert invoice PDF to JSON
invocr convert invoice.pdf invoice.json
# Process receipt image
invocr img2json receipt.jpg receipt.json --doc-type receipt
# Generate EU standard XML
invocr json2xml invoice.json eu_invoice.xml
# Create HTML invoice
invocr json2html invoice.json invoice.html --template modern
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file invocr-1.0.2.tar.gz.
File metadata
- Download URL: invocr-1.0.2.tar.gz
- Upload date:
- Size: 63.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5bec2ad2eb2435e12035a96d690b7fb1dc4fd8bfc50e45869048bc22686738b
|
|
| MD5 |
38e64e716b0d5bb91ca5d4c6e4720314
|
|
| BLAKE2b-256 |
baa41d7963bb1cab606043fd413dfee66c1102a1d0b1bcb9594882840463a9d5
|
File details
Details for the file invocr-1.0.2-py3-none-any.whl.
File metadata
- Download URL: invocr-1.0.2-py3-none-any.whl
- Upload date:
- Size: 69.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f378d82a1453c7e272829eb60143b9a163d82096c3c0ba1af10cedcf427de05
|
|
| MD5 |
d5037cb662516b7aa93fb4aa35734af4
|
|
| BLAKE2b-256 |
0f264995ce01d48a74ff0d8bee54c3d9932b4490c89c1fd8b7160d7aeeb44fd0
|