Skip to main content

Invoice OCR System - Convert invoices between PDF, JSON, XML, HTML formats using OCR

Project description

InvOCR - Invoice OCR & Conversion System

๐Ÿ” Universal document processing system with OCR capabilities for invoices, receipts, and financial documents

Python 3.9+ FastAPI Docker License: Apache

๐Ÿš€ Features

๐Ÿ“„ Document Processing

  • PDF โ†’ Images (PNG/JPG) with configurable DPI
  • Image โ†’ JSON using advanced OCR (Tesseract + EasyOCR)
  • PDF โ†’ JSON (direct text + OCR fallback)
  • JSON โ†’ XML (EU Invoice standard format)
  • JSON โ†’ HTML (responsive templates)
  • HTML โ†’ PDF (professional output)

๐ŸŒ Multi-language Support

  • English, Polish, German, French, Spanish, Italian
  • Auto-detection of document language
  • Custom language combinations

๐Ÿ“‹ Document Types

  • โœ… Invoices (commercial invoices)
  • โœ… Receipts (retail receipts)
  • โœ… Payment confirmations
  • โœ… Financial documents
  • โœ… Custom business documents

๐Ÿ”ง Interfaces

  • CLI - Command line interface
  • REST API - Web API with OpenAPI docs
  • Docker - Containerized deployment
  • Batch processing - Multiple files

๐Ÿ—๏ธ Project Structure

invocr/
โ”œโ”€โ”€ ๐Ÿ“ invocr/                 # Main package
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ core/               # Core processing modules
โ”‚   โ”‚   โ”œโ”€โ”€ ocr.py            # OCR engine (Tesseract + EasyOCR)
โ”‚   โ”‚   โ”œโ”€โ”€ converter.py      # Universal format converter
โ”‚   โ”‚   โ”œโ”€โ”€ extractor.py      # Data extraction logic
โ”‚   โ”‚   โ””โ”€โ”€ validator.py      # Data validation
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ formats/            # Format-specific handlers
โ”‚   โ”‚   โ”œโ”€โ”€ pdf.py           # PDF operations
โ”‚   โ”‚   โ”œโ”€โ”€ image.py         # Image processing
โ”‚   โ”‚   โ”œโ”€โ”€ json_handler.py  # JSON operations
โ”‚   โ”‚   โ”œโ”€โ”€ xml_handler.py   # EU XML format
โ”‚   โ”‚   โ””โ”€โ”€ html_handler.py  # HTML generation
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ api/               # REST API
โ”‚   โ”‚   โ”œโ”€โ”€ main.py          # FastAPI application
โ”‚   โ”‚   โ”œโ”€โ”€ routes.py        # API endpoints
โ”‚   โ”‚   โ””โ”€โ”€ models.py        # Pydantic models
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ cli/               # Command line interface
โ”‚   โ”‚   โ””โ”€โ”€ commands.py      # CLI commands
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ ๐Ÿ“ utils/             # Utilities
โ”‚       โ”œโ”€โ”€ config.py        # Configuration
โ”‚       โ”œโ”€โ”€ logger.py        # Logging setup
โ”‚       โ””โ”€โ”€ helpers.py       # Helper functions
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ tests/                 # Test suite
โ”œโ”€โ”€ ๐Ÿ“ scripts/               # Installation scripts
โ”œโ”€โ”€ ๐Ÿ“ docs/                  # Documentation
โ”œโ”€โ”€ ๐Ÿณ Dockerfile             # Docker configuration
โ”œโ”€โ”€ ๐Ÿณ docker-compose.yml     # Docker Compose
โ”œโ”€โ”€ ๐Ÿ“‹ pyproject.toml         # Poetry configuration
โ””โ”€โ”€ ๐Ÿ“– README.md              # This file

โšก Quick Start

Option 1: Auto Installation (Recommended)

# Clone repository
git clone https://github.com/your-username/invocr.git
cd invocr

# Run installation script
chmod +x scripts/install.sh
./scripts/install.sh

Option 2: Manual Installation

# Install system dependencies (Ubuntu/Debian)
sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-pol poppler-utils \
    libpango-1.0-0 libharfbuzz0b python3-dev build-essential

# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -

# Install Python dependencies
poetry install

# Setup environment
cp .env.example .env

Option 3: Docker

# Using Docker Compose (easiest)
docker-compose up

# Or build manually
docker build -t invocr .
docker run -p 8000:8000 invocr

๐Ÿ“š Usage Examples

CLI Commands

# Convert PDF to JSON
invocr convert invoice.pdf invoice.json

# Convert with specific languages
invocr convert -l en,pl,de document.pdf output.json

# PDF to images
invocr pdf2img document.pdf ./images/ --format png --dpi 300

# Image to JSON (OCR)
invocr img2json scan.png data.json --doc-type invoice

# JSON to EU XML format
invocr json2xml data.json invoice.xml

# Batch processing
invocr batch ./input_files/ ./output/ --format json --parallel 4

# Full pipeline: PDF โ†’ IMG โ†’ JSON โ†’ XML โ†’ HTML โ†’ PDF
invocr pipeline document.pdf ./results/

# Start API server
invocr serve --host 0.0.0.0 --port 8000

REST API

# Start server
invocr serve

# Convert file
curl -X POST "http://localhost:8000/convert" \
  -F "file=@invoice.pdf" \
  -F "target_format=json" \
  -F "languages=en,pl"

# Check job status
curl "http://localhost:8000/status/{job_id}"

# Download result
curl "http://localhost:8000/download/{job_id}" -o result.json

Python API

from invocr import create_converter

# Create converter instance
converter = create_converter(languages=['en', 'pl', 'de'])

# Convert PDF to JSON
result = converter.pdf_to_json('invoice.pdf')
print(result)

# Convert image to JSON with OCR
data = converter.image_to_json('scan.png', document_type='invoice')

# Convert JSON to EU XML
xml_content = converter.json_to_xml(data, format='eu_invoice')

# Full conversion pipeline
result = converter.convert('input.pdf', 'output.json', 'auto', 'json')

๐ŸŒ API Documentation

When running the API server, visit:

Key Endpoints

  • POST /convert - Convert single file
  • POST /convert/pdf2img - PDF to images
  • POST /convert/img2json - Image OCR to JSON
  • POST /batch/convert - Batch processing
  • GET /status/{job_id} - Job status
  • GET /download/{job_id} - Download result
  • GET /health - Health check
  • GET /info - System information

๐Ÿ”ง Configuration

Environment Variables

Key configuration options in .env:

# OCR Settings
DEFAULT_OCR_ENGINE=auto          # tesseract, easyocr, auto
DEFAULT_LANGUAGES=en,pl,de,fr,es # Supported languages
OCR_CONFIDENCE_THRESHOLD=0.3     # Minimum confidence

# Processing
MAX_FILE_SIZE=52428800          # 50MB limit
PARALLEL_WORKERS=4              # Concurrent processing
MAX_PAGES_PER_PDF=10           # Page limit

# Storage
UPLOAD_DIR=./uploads
OUTPUT_DIR=./output
TEMP_DIR=./temp

Supported Languages

Code Language Tesseract EasyOCR
en English โœ… โœ…
pl Polish โœ… โœ…
de German โœ… โœ…
fr French โœ… โœ…
es Spanish โœ… โœ…
it Italian โœ… โœ…

๐Ÿ“Š Supported Formats

Input Formats

  • PDF (.pdf)
  • Images (.png, .jpg, .jpeg, .tiff, .bmp)
  • JSON (.json)
  • XML (.xml)
  • HTML (.html)

Output Formats

  • JSON - Structured data
  • XML - EU Invoice standard
  • HTML - Responsive templates
  • PDF - Professional documents

๐Ÿงช Testing

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=invocr

# Run specific test file
poetry run pytest tests/test_ocr.py

# Run API tests
poetry run pytest tests/test_api.py

๐Ÿš€ Deployment

Production with Docker

# docker-compose.prod.yml
version: '3.8'
services:
  invocr:
    image: invocr:latest
    ports:
      - "80:8000"
    environment:
      - ENVIRONMENT=production
      - WORKERS=4
    volumes:
      - ./data:/app/data

Kubernetes

# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: invocr
spec:
  replicas: 3
  selector:
    matchLabels:
      app: invocr
  template:
    metadata:
      labels:
        app: invocr
    spec:
      containers:
      - name: invocr
        image: invocr:latest
        ports:
        - containerPort: 8000

๐Ÿค Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Make changes
  4. Add tests
  5. Run tests (poetry run pytest)
  6. Commit changes (git commit -m 'Add amazing feature')
  7. Push to branch (git push origin feature/amazing-feature)
  8. Open Pull Request

Development Setup

# Install development dependencies
poetry install --with dev

# Install pre-commit hooks
poetry run pre-commit install

# Run linting
poetry run black invocr/
poetry run isort invocr/
poetry run flake8 invocr/

# Run type checking
poetry run mypy invocr/

๐Ÿ“ˆ Performance

Benchmarks

Operation Time Memory
PDF โ†’ JSON (1 page) ~2-3s ~50MB
Image OCR โ†’ JSON ~1-2s ~30MB
JSON โ†’ XML ~0.1s ~10MB
JSON โ†’ HTML ~0.2s ~15MB
HTML โ†’ PDF ~1-2s ~40MB

Optimization Tips

  • Use --parallel for batch processing
  • Enable IMAGE_ENHANCEMENT=false for faster OCR
  • Use tesseract engine for better performance
  • Configure MAX_PAGES_PER_PDF for large documents

๐Ÿ”’ Security

  • File upload validation
  • Size limits enforced
  • Input sanitization
  • No execution of uploaded content
  • Rate limiting available
  • CORS configuration

๐Ÿ“‹ Requirements

System Requirements

  • Python: 3.9+
  • Memory: 1GB+ RAM
  • Storage: 500MB+ free space
  • OS: Linux, macOS, Windows (Docker)

Dependencies

  • Tesseract OCR: Text recognition
  • EasyOCR: Neural OCR engine
  • WeasyPrint: HTML to PDF conversion
  • FastAPI: Web framework
  • Pydantic: Data validation

๐Ÿ› Troubleshooting

Common Issues

OCR not working:

# Check Tesseract installation
tesseract --version

# Install missing languages
sudo apt install tesseract-ocr-pol

WeasyPrint errors:

# Install system dependencies
sudo apt install libpango-1.0-0 libharfbuzz0b

Import errors:

# Reinstall dependencies
poetry install --force

Permission errors:

# Fix file permissions
chmod -R 755 uploads/ output/

๐Ÿ“ž Support

๐Ÿ“„ License

This project is licensed under the Apache License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments


Made with โค๏ธ for the open source community

โญ Star this repository if you find it useful!

---

scripts/setup_env.py

#!/usr/bin/env python3 """ Environment setup script for InvOCR Configures environment variables and validates setup """

import os import sys from pathlib import Path

def setup_environment(): """Setup environment variables and directories""" print("๐Ÿ”ง Setting up InvOCR environment...")

# Project root
project_root = Path(__file__).parent.parent
os.chdir(project_root)

# Create directories
directories = ["uploads", "output", "temp", "logs", "static"]
for directory in directories:
    Path(directory).mkdir(exist_ok=True)
    print(f"๐Ÿ“ Created directory: {directory}")

# Setup environment file
env_file = Path(".env")
env_example = Path(".env.example")

if not env_file.exists() and env_example.exists():
    import shutil
    shutil.copy(env_example, env_file)
    print("โœ… Created .env file from template")
elif not env_file.exists():
    # Create basic .env file
    create_basic_env_file(env_file)
    print("โœ… Created basic .env file")

# Validate setup
validate_setup()

print("๐ŸŽ‰ Environment setup completed!")

def create_basic_env_file(env_path: Path): """Create basic environment file""" content = """# InvOCR Environment Configuration ENVIRONMENT=development DEBUG=true LOG_LEVEL=INFO

Server

HOST=0.0.0.0 PORT=8000

Storage

UPLOAD_DIR=./uploads OUTPUT_DIR=./output TEMP_DIR=./temp LOGS_DIR=./logs

OCR

DEFAULT_LANGUAGES=en,pl,de,fr,es,it OCR_CONFIDENCE_THRESHOLD=0.3

Processing

MAX_FILE_SIZE=52428800 PARALLEL_WORKERS=4

Security

SECRET_KEY=change-me-in-production """

with open(env_path, 'w', encoding='utf-8') as f:
    f.write(content)

def validate_setup(): """Validate environment setup""" print("๐Ÿ” Validating setup...")

# Check Python version
if sys.version_info < (3, 9):
    print("โŒ Python 3.9+ required")
    return False

print(f"โœ… Python {sys.version_info.major}.{sys.version_info.minor}")

# Check directories
required_dirs = ["uploads", "output", "temp", "logs"]
for directory in required_dirs:
    if Path(directory).exists():
        print(f"โœ… Directory exists: {directory}")
    else:
        print(f"โŒ Missing directory: {directory}")

# Check environment file
if Path(".env").exists():
    print("โœ… Environment file exists")
else:
    print("โŒ Missing .env file")

# Try importing invocr
try:
    import invocr
    print("โœ… InvOCR package importable")
except ImportError as e:
    print(f"โŒ Cannot import InvOCR: {e}")
    print("Run: poetry install")

return True

if name == "main": setup_environment()

---

docs/api.md

InvOCR API Documentation

Overview

The InvOCR REST API provides endpoints for document conversion and OCR processing.

Base URL

http://localhost:8000

Authentication

Currently no authentication required for local development.

Endpoints

Health Check

GET /health

Returns system health status.

System Information

GET /info

Returns supported formats, languages, and features.

Convert File

POST /convert

Convert uploaded file to specified format.

Parameters:

  • file (file): Input file
  • target_format (string): Output format (json, xml, html, pdf)
  • languages (string): Comma-separated language codes
  • async_processing (boolean): Process in background

Check Job Status

GET /status/{job_id}

Get conversion job status.

Download Result

GET /download/{job_id}

Download conversion result.

Example Usage

# Convert PDF to JSON
curl -X POST "http://localhost:8000/convert" \
  -F "file=@invoice.pdf" \
  -F "target_format=json"

# Check status
curl "http://localhost:8000/status/job-id"

# Download result
curl "http://localhost:8000/download/job-id" -o result.json

---

docs/cli.md

InvOCR CLI Documentation

Installation

poetry install

Usage

Basic Commands

# Show help
invocr --help

# Convert single file
invocr convert input.pdf output.json

# Convert with specific languages
invocr convert -l en,pl,de document.pdf output.json

# Convert PDF to images
invocr pdf2img document.pdf ./images/

# Image to JSON (OCR)
invocr img2json scan.png data.json

# JSON to XML
invocr json2xml data.json invoice.xml

# Batch processing
invocr batch ./pdfs/ ./output/ --format json

# Full pipeline
invocr pipeline document.pdf ./results/

# Start API server
invocr serve

Advanced Options

# Batch processing with parallelization
invocr batch ./input/ ./output/ --parallel 8 --format xml

# Custom OCR languages
invocr img2json scan.png data.json --languages en,pl,de,fr

# Custom templates
invocr convert data.json invoice.html --template classic

# API server with custom host/port
invocr serve --host 0.0.0.0 --port 9000

Examples

# Convert invoice PDF to JSON
invocr convert invoice.pdf invoice.json

# Process receipt image
invocr img2json receipt.jpg receipt.json --doc-type receipt

# Generate EU standard XML
invocr json2xml invoice.json eu_invoice.xml

# Create HTML invoice
invocr json2html invoice.json invoice.html --template modern

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invocr-1.0.2.tar.gz (63.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

invocr-1.0.2-py3-none-any.whl (69.1 kB view details)

Uploaded Python 3

File details

Details for the file invocr-1.0.2.tar.gz.

File metadata

  • Download URL: invocr-1.0.2.tar.gz
  • Upload date:
  • Size: 63.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for invocr-1.0.2.tar.gz
Algorithm Hash digest
SHA256 d5bec2ad2eb2435e12035a96d690b7fb1dc4fd8bfc50e45869048bc22686738b
MD5 38e64e716b0d5bb91ca5d4c6e4720314
BLAKE2b-256 baa41d7963bb1cab606043fd413dfee66c1102a1d0b1bcb9594882840463a9d5

See more details on using hashes here.

File details

Details for the file invocr-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: invocr-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 69.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for invocr-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2f378d82a1453c7e272829eb60143b9a163d82096c3c0ba1af10cedcf427de05
MD5 d5037cb662516b7aa93fb4aa35734af4
BLAKE2b-256 0f264995ce01d48a74ff0d8bee54c3d9932b4490c89c1fd8b7160d7aeeb44fd0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page