Invoice OCR System - Convert invoices between PDF, JSON, XML, HTML formats using OCR

These details have not been verified by PyPI

Project links

Project description

InvOCR - Invoice OCR & Conversion System

🔍 Universal document processing system with OCR capabilities for invoices, receipts, and financial documents

🚀 Features

📄 Document Processing

PDF → Images (PNG/JPG) with configurable DPI
Image → JSON using advanced OCR (Tesseract + EasyOCR)
PDF → JSON (direct text + OCR fallback)
JSON → XML (EU Invoice standard format)
JSON → HTML (responsive templates)
HTML → PDF (professional output)

🌍 Multi-language Support

English, Polish, German, French, Spanish, Italian
Auto-detection of document language
Custom language combinations

📋 Document Types

✅ Invoices (commercial invoices)
✅ Receipts (retail receipts)
✅ Payment confirmations
✅ Financial documents
✅ Custom business documents

🔧 Interfaces

CLI - Command line interface
REST API - Web API with OpenAPI docs
Docker - Containerized deployment
Batch processing - Multiple files

🏗️ Project Structure

invocr/
├── 📁 invocr/                 # Main package
│   ├── 📁 core/               # Core processing modules
│   │   ├── ocr.py            # OCR engine (Tesseract + EasyOCR)
│   │   ├── converter.py      # Universal format converter
│   │   ├── extractor.py      # Data extraction logic
│   │   └── validator.py      # Data validation
│   │
│   ├── 📁 formats/            # Format-specific handlers
│   │   ├── pdf.py           # PDF operations
│   │   ├── image.py         # Image processing
│   │   ├── json_handler.py  # JSON operations
│   │   ├── xml_handler.py   # EU XML format
│   │   └── html_handler.py  # HTML generation
│   │
│   ├── 📁 api/               # REST API
│   │   ├── main.py          # FastAPI application
│   │   ├── routes.py        # API endpoints
│   │   └── models.py        # Pydantic models
│   │
│   ├── 📁 cli/               # Command line interface
│   │   └── commands.py      # CLI commands
│   │
│   └── 📁 utils/             # Utilities
│       ├── config.py        # Configuration
│       ├── logger.py        # Logging setup
│       └── helpers.py       # Helper functions
│
├── 📁 tests/                 # Test suite
├── 📁 scripts/               # Installation scripts
├── 📁 docs/                  # Documentation
├── 🐳 Dockerfile             # Docker configuration
├── 🐳 docker-compose.yml     # Docker Compose
├── 📋 pyproject.toml         # Poetry configuration
└── 📖 README.md              # This file

⚡ Quick Start

Option 1: Auto Installation (Recommended)

# Clone repository
git clone https://github.com/your-username/invocr.git
cd invocr

# Run installation script
chmod +x scripts/install.sh
./scripts/install.sh

Option 2: Manual Installation

# Install system dependencies (Ubuntu/Debian)
sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-pol poppler-utils \
    libpango-1.0-0 libharfbuzz0b python3-dev build-essential

# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -

# Install Python dependencies
poetry install

# Setup environment
cp .env.example .env

Option 3: Docker

# Using Docker Compose (easiest)
docker-compose up

# Or build manually
docker build -t invocr .
docker run -p 8000:8000 invocr

📚 Usage Examples

CLI Commands

# Convert PDF to JSON
invocr convert invoice.pdf invoice.json

# Convert with specific languages
invocr convert -l en,pl,de document.pdf output.json

# PDF to images
invocr pdf2img document.pdf ./images/ --format png --dpi 300

# Image to JSON (OCR)
invocr img2json scan.png data.json --doc-type invoice

# JSON to EU XML format
invocr json2xml data.json invoice.xml

# Batch processing
invocr batch ./input_files/ ./output/ --format json --parallel 4

# Full pipeline: PDF → IMG → JSON → XML → HTML → PDF
invocr pipeline document.pdf ./results/

# Start API server
invocr serve --host 0.0.0.0 --port 8000

REST API

# Start server
invocr serve

# Convert file
curl -X POST "http://localhost:8000/convert" \
  -F "file=@invoice.pdf" \
  -F "target_format=json" \
  -F "languages=en,pl"

# Check job status
curl "http://localhost:8000/status/{job_id}"

# Download result
curl "http://localhost:8000/download/{job_id}" -o result.json

Python API

from invocr import create_converter

# Create converter instance
converter = create_converter(languages=['en', 'pl', 'de'])

# Convert PDF to JSON
result = converter.pdf_to_json('invoice.pdf')
print(result)

# Convert image to JSON with OCR
data = converter.image_to_json('scan.png', document_type='invoice')

# Convert JSON to EU XML
xml_content = converter.json_to_xml(data, format='eu_invoice')

# Full conversion pipeline
result = converter.convert('input.pdf', 'output.json', 'auto', 'json')

🌐 API Documentation

When running the API server, visit:

Interactive docs: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc
OpenAPI JSON: http://localhost:8000/openapi.json

Key Endpoints

POST /convert - Convert single file
POST /convert/pdf2img - PDF to images
POST /convert/img2json - Image OCR to JSON
POST /batch/convert - Batch processing
GET /status/{job_id} - Job status
GET /download/{job_id} - Download result
GET /health - Health check
GET /info - System information

🔧 Configuration

Environment Variables

Key configuration options in .env:

# OCR Settings
DEFAULT_OCR_ENGINE=auto          # tesseract, easyocr, auto
DEFAULT_LANGUAGES=en,pl,de,fr,es # Supported languages
OCR_CONFIDENCE_THRESHOLD=0.3     # Minimum confidence

# Processing
MAX_FILE_SIZE=52428800          # 50MB limit
PARALLEL_WORKERS=4              # Concurrent processing
MAX_PAGES_PER_PDF=10           # Page limit

# Storage
UPLOAD_DIR=./uploads
OUTPUT_DIR=./output
TEMP_DIR=./temp

Supported Languages

Code	Language	Tesseract	EasyOCR
`en`	English	✅	✅
`pl`	Polish	✅	✅
`de`	German	✅	✅
`fr`	French	✅	✅
`es`	Spanish	✅	✅
`it`	Italian	✅	✅

📊 Supported Formats

Input Formats

PDF (.pdf)
Images (.png, .jpg, .jpeg, .tiff, .bmp)
JSON (.json)
XML (.xml)
HTML (.html)

Output Formats

JSON - Structured data
XML - EU Invoice standard
HTML - Responsive templates
PDF - Professional documents

🧪 Testing

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=invocr

# Run specific test file
poetry run pytest tests/test_ocr.py

# Run API tests
poetry run pytest tests/test_api.py

🚀 Deployment

Production with Docker

# docker-compose.prod.yml
version: '3.8'
services:
  invocr:
    image: invocr:latest
    ports:
      - "80:8000"
    environment:
      - ENVIRONMENT=production
      - WORKERS=4
    volumes:
      - ./data:/app/data

Kubernetes

# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: invocr
spec:
  replicas: 3
  selector:
    matchLabels:
      app: invocr
  template:
    metadata:
      labels:
        app: invocr
    spec:
      containers:
      - name: invocr
        image: invocr:latest
        ports:
        - containerPort: 8000

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Make changes
Add tests
Run tests (poetry run pytest)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

Development Setup

# Install development dependencies
poetry install --with dev

# Install pre-commit hooks
poetry run pre-commit install

# Run linting
poetry run black invocr/
poetry run isort invocr/
poetry run flake8 invocr/

# Run type checking
poetry run mypy invocr/

📈 Performance

Benchmarks

Operation	Time	Memory
PDF → JSON (1 page)	~2-3s	~50MB
Image OCR → JSON	~1-2s	~30MB
JSON → XML	~0.1s	~10MB
JSON → HTML	~0.2s	~15MB
HTML → PDF	~1-2s	~40MB

Optimization Tips

Use --parallel for batch processing
Enable IMAGE_ENHANCEMENT=false for faster OCR
Use tesseract engine for better performance
Configure MAX_PAGES_PER_PDF for large documents

🔒 Security

File upload validation
Size limits enforced
Input sanitization
No execution of uploaded content
Rate limiting available
CORS configuration

📋 Requirements

System Requirements

Python: 3.9+
Memory: 1GB+ RAM
Storage: 500MB+ free space
OS: Linux, macOS, Windows (Docker)

Dependencies

Tesseract OCR: Text recognition
EasyOCR: Neural OCR engine
WeasyPrint: HTML to PDF conversion
FastAPI: Web framework
Pydantic: Data validation

🐛 Troubleshooting

Common Issues

OCR not working:

# Check Tesseract installation
tesseract --version

# Install missing languages
sudo apt install tesseract-ocr-pol

WeasyPrint errors:

# Install system dependencies
sudo apt install libpango-1.0-0 libharfbuzz0b

Import errors:

# Reinstall dependencies
poetry install --force

Permission errors:

# Fix file permissions
chmod -R 755 uploads/ output/

📞 Support

📧 Email: support@invocr.com
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions
📚 Wiki: Project Wiki

📄 License

This project is licensed under the Apache License - see the LICENSE file for details.

🙏 Acknowledgments

Tesseract OCR - OCR engine
EasyOCR - Neural OCR
FastAPI - Web framework
WeasyPrint - HTML/CSS to PDF
Poetry - Dependency management

Made with ❤️ for the open source community

⭐ Star this repository if you find it useful!

---

scripts/setup_env.py

#!/usr/bin/env python3 """ Environment setup script for InvOCR Configures environment variables and validates setup """

import os import sys from pathlib import Path

def setup_environment(): """Setup environment variables and directories""" print("🔧 Setting up InvOCR environment...")

# Project root
project_root = Path(__file__).parent.parent
os.chdir(project_root)

# Create directories
directories = ["uploads", "output", "temp", "logs", "static"]
for directory in directories:
    Path(directory).mkdir(exist_ok=True)
    print(f"📁 Created directory: {directory}")

# Setup environment file
env_file = Path(".env")
env_example = Path(".env.example")

if not env_file.exists() and env_example.exists():
    import shutil
    shutil.copy(env_example, env_file)
    print("✅ Created .env file from template")
elif not env_file.exists():
    # Create basic .env file
    create_basic_env_file(env_file)
    print("✅ Created basic .env file")

# Validate setup
validate_setup()

print("🎉 Environment setup completed!")

def create_basic_env_file(env_path: Path): """Create basic environment file""" content = """# InvOCR Environment Configuration ENVIRONMENT=development DEBUG=true LOG_LEVEL=INFO

Server

HOST=0.0.0.0 PORT=8000

Storage

UPLOAD_DIR=./uploads OUTPUT_DIR=./output TEMP_DIR=./temp LOGS_DIR=./logs

OCR

DEFAULT_LANGUAGES=en,pl,de,fr,es,it OCR_CONFIDENCE_THRESHOLD=0.3

Processing

MAX_FILE_SIZE=52428800 PARALLEL_WORKERS=4

Security

SECRET_KEY=change-me-in-production """

with open(env_path, 'w', encoding='utf-8') as f:
    f.write(content)

def validate_setup(): """Validate environment setup""" print("🔍 Validating setup...")

# Check Python version
if sys.version_info < (3, 9):
    print("❌ Python 3.9+ required")
    return False

print(f"✅ Python {sys.version_info.major}.{sys.version_info.minor}")

# Check directories
required_dirs = ["uploads", "output", "temp", "logs"]
for directory in required_dirs:
    if Path(directory).exists():
        print(f"✅ Directory exists: {directory}")
    else:
        print(f"❌ Missing directory: {directory}")

# Check environment file
if Path(".env").exists():
    print("✅ Environment file exists")
else:
    print("❌ Missing .env file")

# Try importing invocr
try:
    import invocr
    print("✅ InvOCR package importable")
except ImportError as e:
    print(f"❌ Cannot import InvOCR: {e}")
    print("Run: poetry install")

return True

if name == "main": setup_environment()

---

docs/api.md

InvOCR API Documentation

Overview

The InvOCR REST API provides endpoints for document conversion and OCR processing.

Base URL

http://localhost:8000

Authentication

Currently no authentication required for local development.

Endpoints

Health Check

GET /health

Returns system health status.

System Information

GET /info

Returns supported formats, languages, and features.

Convert File

POST /convert

Convert uploaded file to specified format.

Parameters:

file (file): Input file
target_format (string): Output format (json, xml, html, pdf)
languages (string): Comma-separated language codes
async_processing (boolean): Process in background

Check Job Status

GET /status/{job_id}

Get conversion job status.

Download Result

GET /download/{job_id}

Download conversion result.

Example Usage

# Convert PDF to JSON
curl -X POST "http://localhost:8000/convert" \
  -F "file=@invoice.pdf" \
  -F "target_format=json"

# Check status
curl "http://localhost:8000/status/job-id"

# Download result
curl "http://localhost:8000/download/job-id" -o result.json

---

docs/cli.md

InvOCR CLI Documentation

Installation

poetry install

Usage

Basic Commands

# Show help
invocr --help

# Convert single file
invocr convert input.pdf output.json

# Convert with specific languages
invocr convert -l en,pl,de document.pdf output.json

# Convert PDF to images
invocr pdf2img document.pdf ./images/

# Image to JSON (OCR)
invocr img2json scan.png data.json

# JSON to XML
invocr json2xml data.json invoice.xml

# Batch processing
invocr batch ./pdfs/ ./output/ --format json

# Full pipeline
invocr pipeline document.pdf ./results/

# Start API server
invocr serve

Advanced Options

# Batch processing with parallelization
invocr batch ./input/ ./output/ --parallel 8 --format xml

# Custom OCR languages
invocr img2json scan.png data.json --languages en,pl,de,fr

# Custom templates
invocr convert data.json invoice.html --template classic

# API server with custom host/port
invocr serve --host 0.0.0.0 --port 9000

Examples

# Convert invoice PDF to JSON
invocr convert invoice.pdf invoice.json

# Process receipt image
invocr img2json receipt.jpg receipt.json --doc-type receipt

# Generate EU standard XML
invocr json2xml invoice.json eu_invoice.xml

# Create HTML invoice
invocr json2html invoice.json invoice.html --template modern

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.16

Jun 18, 2025

1.0.15

Jun 17, 2025

1.0.14

Jun 17, 2025

1.0.13

Jun 17, 2025

1.0.3

Jun 15, 2025

This version

1.0.2

Jun 15, 2025

1.0.1

Jun 15, 2025

1.0.0

Jun 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invocr-1.0.2.tar.gz (63.8 kB view details)

Uploaded Jun 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

invocr-1.0.2-py3-none-any.whl (69.1 kB view details)

Uploaded Jun 15, 2025 Python 3

File details

Details for the file invocr-1.0.2.tar.gz.

File metadata

Download URL: invocr-1.0.2.tar.gz
Upload date: Jun 15, 2025
Size: 63.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for invocr-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`d5bec2ad2eb2435e12035a96d690b7fb1dc4fd8bfc50e45869048bc22686738b`
MD5	`38e64e716b0d5bb91ca5d4c6e4720314`
BLAKE2b-256	`baa41d7963bb1cab606043fd413dfee66c1102a1d0b1bcb9594882840463a9d5`

See more details on using hashes here.

File details

Details for the file invocr-1.0.2-py3-none-any.whl.

File metadata

Download URL: invocr-1.0.2-py3-none-any.whl
Upload date: Jun 15, 2025
Size: 69.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for invocr-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f378d82a1453c7e272829eb60143b9a163d82096c3c0ba1af10cedcf427de05`
MD5	`d5037cb662516b7aa93fb4aa35734af4`
BLAKE2b-256	`0f264995ce01d48a74ff0d8bee54c3d9932b4490c89c1fd8b7160d7aeeb44fd0`

See more details on using hashes here.

invocr 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

InvOCR - Invoice OCR & Conversion System

🚀 Features

📄 Document Processing

🌍 Multi-language Support

📋 Document Types

🔧 Interfaces

🏗️ Project Structure

⚡ Quick Start

Option 1: Auto Installation (Recommended)

Option 2: Manual Installation

Option 3: Docker

📚 Usage Examples

CLI Commands

REST API

Python API

🌐 API Documentation

Key Endpoints

🔧 Configuration

Environment Variables

Supported Languages

📊 Supported Formats

Input Formats

Output Formats

🧪 Testing

🚀 Deployment

Production with Docker

Kubernetes

🤝 Contributing

Development Setup

📈 Performance

Benchmarks

Optimization Tips

🔒 Security

📋 Requirements

System Requirements

Dependencies

🐛 Troubleshooting

Common Issues

📞 Support

📄 License

🙏 Acknowledgments

---

scripts/setup_env.py

Server

Storage

OCR

Processing

Security

---

docs/api.md

InvOCR API Documentation

Overview

Base URL

Authentication

Endpoints

Health Check

System Information

Convert File

Check Job Status

Download Result

Example Usage

---

docs/cli.md

InvOCR CLI Documentation

Installation

Usage

Basic Commands

Advanced Options

Examples

Project details

Verified details