Skip to main content

Invoice OCR System - Convert invoices between PDF, JSON, XML, HTML formats using OCR

Project description

InvOCR - Invoice OCR & Conversion System

๐Ÿ” Universal document processing system with OCR capabilities for invoices, receipts, and financial documents

Python 3.9+ FastAPI Docker License: Apache

๐Ÿš€ Features

๐Ÿ“„ Document Processing

  • PDF โ†’ Images (PNG/JPG) with configurable DPI
  • Image โ†’ JSON using advanced OCR (Tesseract + EasyOCR)
  • PDF โ†’ JSON (direct text + OCR fallback)
  • JSON โ†’ XML (EU Invoice standard format)
  • JSON โ†’ HTML (responsive templates)
  • HTML โ†’ PDF (professional output)

๐ŸŒ Multi-language Support

  • English, Polish, German, French, Spanish, Italian
  • Auto-detection of document language
  • Custom language combinations

๐Ÿ“‹ Document Types

  • โœ… Invoices (commercial invoices)
  • โœ… Receipts (retail receipts)
  • โœ… Payment confirmations
  • โœ… Financial documents
  • โœ… Custom business documents

๐Ÿ”ง Interfaces

  • CLI - Command line interface
  • REST API - Web API with OpenAPI docs
  • Docker - Containerized deployment
  • Batch processing - Multiple files

๐Ÿ—๏ธ Project Structure

invocr/
โ”œโ”€โ”€ ๐Ÿ“ invocr/                 # Main package
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ core/               # Core processing modules
โ”‚   โ”‚   โ”œโ”€โ”€ ocr.py            # OCR engine (Tesseract + EasyOCR)
โ”‚   โ”‚   โ”œโ”€โ”€ converter.py      # Universal format converter
โ”‚   โ”‚   โ”œโ”€โ”€ extractor.py      # Data extraction logic
โ”‚   โ”‚   โ””โ”€โ”€ validator.py      # Data validation
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ formats/            # Format-specific handlers
โ”‚   โ”‚   โ”œโ”€โ”€ pdf.py           # PDF operations
โ”‚   โ”‚   โ”œโ”€โ”€ image.py         # Image processing
โ”‚   โ”‚   โ”œโ”€โ”€ json_handler.py  # JSON operations
โ”‚   โ”‚   โ”œโ”€โ”€ xml_handler.py   # EU XML format
โ”‚   โ”‚   โ””โ”€โ”€ html_handler.py  # HTML generation
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ api/               # REST API
โ”‚   โ”‚   โ”œโ”€โ”€ main.py          # FastAPI application
โ”‚   โ”‚   โ”œโ”€โ”€ routes.py        # API endpoints
โ”‚   โ”‚   โ””โ”€โ”€ models.py        # Pydantic models
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ cli/               # Command line interface
โ”‚   โ”‚   โ””โ”€โ”€ commands.py      # CLI commands
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ ๐Ÿ“ utils/             # Utilities
โ”‚       โ”œโ”€โ”€ config.py        # Configuration
โ”‚       โ”œโ”€โ”€ logger.py        # Logging setup
โ”‚       โ””โ”€โ”€ helpers.py       # Helper functions
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ tests/                 # Test suite
โ”œโ”€โ”€ ๐Ÿ“ scripts/               # Installation scripts
โ”œโ”€โ”€ ๐Ÿ“ docs/                  # Documentation
โ”œโ”€โ”€ ๐Ÿณ Dockerfile             # Docker configuration
โ”œโ”€โ”€ ๐Ÿณ docker-compose.yml     # Docker Compose
โ”œโ”€โ”€ ๐Ÿ“‹ pyproject.toml         # Poetry configuration
โ””โ”€โ”€ ๐Ÿ“– README.md              # This file

โšก Quick Start

Option 1: Auto Installation (Recommended)

# Clone repository
git clone https://github.com/your-username/invocr.git
cd invocr

# Run installation script
chmod +x scripts/install.sh
./scripts/install.sh

Option 2: Manual Installation

# Install system dependencies (Ubuntu/Debian)
sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-pol poppler-utils \
    libpango-1.0-0 libharfbuzz0b python3-dev build-essential

# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -

# Install Python dependencies
poetry install

# Setup environment
cp .env.example .env

Option 3: Docker

# Using Docker Compose (easiest)
docker-compose up

# Or build manually
docker build -t invocr .
docker run -p 8000:8000 invocr

๐Ÿ“š Usage Examples

CLI Commands

# Convert PDF to JSON
invocr convert invoice.pdf invoice.json

# Convert with specific languages
invocr convert -l en,pl,de document.pdf output.json

# PDF to images
invocr pdf2img document.pdf ./images/ --format png --dpi 300

# Image to JSON (OCR)
invocr img2json scan.png data.json --doc-type invoice

# JSON to EU XML format
invocr json2xml data.json invoice.xml

# Batch processing
invocr batch ./input_files/ ./output/ --format json --parallel 4

# Full pipeline: PDF โ†’ IMG โ†’ JSON โ†’ XML โ†’ HTML โ†’ PDF
invocr pipeline document.pdf ./results/

# Start API server
invocr serve --host 0.0.0.0 --port 8000

REST API

# Start server
invocr serve

# Convert file
curl -X POST "http://localhost:8000/convert" \
  -F "file=@invoice.pdf" \
  -F "target_format=json" \
  -F "languages=en,pl"

# Check job status
curl "http://localhost:8000/status/{job_id}"

# Download result
curl "http://localhost:8000/download/{job_id}" -o result.json

Python API

from invocr import create_converter

# Create converter instance
converter = create_converter(languages=['en', 'pl', 'de'])

# Convert PDF to JSON
result = converter.pdf_to_json('invoice.pdf')
print(result)

# Convert image to JSON with OCR
data = converter.image_to_json('scan.png', document_type='invoice')

# Convert JSON to EU XML
xml_content = converter.json_to_xml(data, format='eu_invoice')

# Full conversion pipeline
result = converter.convert('input.pdf', 'output.json', 'auto', 'json')

๐ŸŒ API Documentation

When running the API server, visit:

Key Endpoints

  • POST /convert - Convert single file
  • POST /convert/pdf2img - PDF to images
  • POST /convert/img2json - Image OCR to JSON
  • POST /batch/convert - Batch processing
  • GET /status/{job_id} - Job status
  • GET /download/{job_id} - Download result
  • GET /health - Health check
  • GET /info - System information

๐Ÿ”ง Configuration

Environment Variables

Key configuration options in .env:

# OCR Settings
DEFAULT_OCR_ENGINE=auto          # tesseract, easyocr, auto
DEFAULT_LANGUAGES=en,pl,de,fr,es # Supported languages
OCR_CONFIDENCE_THRESHOLD=0.3     # Minimum confidence

# Processing
MAX_FILE_SIZE=52428800          # 50MB limit
PARALLEL_WORKERS=4              # Concurrent processing
MAX_PAGES_PER_PDF=10           # Page limit

# Storage
UPLOAD_DIR=./uploads
OUTPUT_DIR=./output
TEMP_DIR=./temp

Supported Languages

Code Language Tesseract EasyOCR
en English โœ… โœ…
pl Polish โœ… โœ…
de German โœ… โœ…
fr French โœ… โœ…
es Spanish โœ… โœ…
it Italian โœ… โœ…

๐Ÿ“Š Supported Formats

Input Formats

  • PDF (.pdf)
  • Images (.png, .jpg, .jpeg, .tiff, .bmp)
  • JSON (.json)
  • XML (.xml)
  • HTML (.html)

Output Formats

  • JSON - Structured data
  • XML - EU Invoice standard
  • HTML - Responsive templates
  • PDF - Professional documents

๐Ÿงช Testing

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=invocr

# Run specific test file
poetry run pytest tests/test_ocr.py

# Run API tests
poetry run pytest tests/test_api.py

๐Ÿš€ Deployment

Production with Docker

# docker-compose.prod.yml
version: '3.8'
services:
  invocr:
    image: invocr:latest
    ports:
      - "80:8000"
    environment:
      - ENVIRONMENT=production
      - WORKERS=4
    volumes:
      - ./data:/app/data

Kubernetes

# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: invocr
spec:
  replicas: 3
  selector:
    matchLabels:
      app: invocr
  template:
    metadata:
      labels:
        app: invocr
    spec:
      containers:
      - name: invocr
        image: invocr:latest
        ports:
        - containerPort: 8000

๐Ÿค Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Make changes
  4. Add tests
  5. Run tests (poetry run pytest)
  6. Commit changes (git commit -m 'Add amazing feature')
  7. Push to branch (git push origin feature/amazing-feature)
  8. Open Pull Request

Development Setup

# Install development dependencies
poetry install --with dev

# Install pre-commit hooks
poetry run pre-commit install

# Run linting
poetry run black invocr/
poetry run isort invocr/
poetry run flake8 invocr/

# Run type checking
poetry run mypy invocr/

๐Ÿ“ˆ Performance

Benchmarks

Operation Time Memory
PDF โ†’ JSON (1 page) ~2-3s ~50MB
Image OCR โ†’ JSON ~1-2s ~30MB
JSON โ†’ XML ~0.1s ~10MB
JSON โ†’ HTML ~0.2s ~15MB
HTML โ†’ PDF ~1-2s ~40MB

Optimization Tips

  • Use --parallel for batch processing
  • Enable IMAGE_ENHANCEMENT=false for faster OCR
  • Use tesseract engine for better performance
  • Configure MAX_PAGES_PER_PDF for large documents

๐Ÿ”’ Security

  • File upload validation
  • Size limits enforced
  • Input sanitization
  • No execution of uploaded content
  • Rate limiting available
  • CORS configuration

๐Ÿ“‹ Requirements

System Requirements

  • Python: 3.9+
  • Memory: 1GB+ RAM
  • Storage: 500MB+ free space
  • OS: Linux, macOS, Windows (Docker)

Dependencies

  • Tesseract OCR: Text recognition
  • EasyOCR: Neural OCR engine
  • WeasyPrint: HTML to PDF conversion
  • FastAPI: Web framework
  • Pydantic: Data validation

๐Ÿ› Troubleshooting

Common Issues

OCR not working:

# Check Tesseract installation
tesseract --version

# Install missing languages
sudo apt install tesseract-ocr-pol

WeasyPrint errors:

# Install system dependencies
sudo apt install libpango-1.0-0 libharfbuzz0b

Import errors:

# Reinstall dependencies
poetry install --force

Permission errors:

# Fix file permissions
chmod -R 755 uploads/ output/

๐Ÿ“ž Support

๐Ÿ“„ License

This project is licensed under the Apache License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments


Made with โค๏ธ for the open source community

โญ Star this repository if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invocr-1.0.0.tar.gz (41.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

invocr-1.0.0-py3-none-any.whl (45.4 kB view details)

Uploaded Python 3

File details

Details for the file invocr-1.0.0.tar.gz.

File metadata

  • Download URL: invocr-1.0.0.tar.gz
  • Upload date:
  • Size: 41.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for invocr-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c390c4e3cd441366302c10e23552a1ad74cd26666902a2136cf6763973177c6f
MD5 b9d91716055dc36ac7bedc4ec36df80c
BLAKE2b-256 32cff02b40cec8d9ac4f05c789e3f691a5478de49756a9d3150ba12047572255

See more details on using hashes here.

File details

Details for the file invocr-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: invocr-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 45.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for invocr-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a3d7d9943851fa2c52d5a1c46b272ffb20b9279ba8161db8b01776fa76eaf95a
MD5 6b912989cc8b78167dba544a8fe54602
BLAKE2b-256 a1a2951c0f9d2d1338dfe8e09c52369705b06dbc7c87f7d3ac4261bf4533b21b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page