Invoice OCR System - Convert invoices between PDF, JSON, XML, HTML formats using OCR

These details have not been verified by PyPI

Project links

Project description

InvOCR - Invoice OCR & Conversion System

🔍 Universal document processing system with OCR capabilities for invoices, receipts, and financial documents

🚀 Features

📄 Document Processing

PDF → Images (PNG/JPG) with configurable DPI
Image → JSON using advanced OCR (Tesseract + EasyOCR)
PDF → JSON (direct text + OCR fallback)
JSON → XML (EU Invoice standard format)
JSON → HTML (responsive templates)
HTML → PDF (professional output)

🌍 Multi-language Support

English, Polish, German, French, Spanish, Italian
Auto-detection of document language
Custom language combinations

📋 Document Types

✅ Invoices (commercial invoices)
✅ Receipts (retail receipts)
✅ Payment confirmations
✅ Financial documents
✅ Custom business documents

🔧 Interfaces

CLI - Command line interface
REST API - Web API with OpenAPI docs
Docker - Containerized deployment
Batch processing - Multiple files

🏗️ Project Structure

invocr/
├── 📁 invocr/                 # Main package
│   ├── 📁 core/               # Core processing modules
│   │   ├── ocr.py            # OCR engine (Tesseract + EasyOCR)
│   │   ├── converter.py      # Universal format converter
│   │   ├── extractor.py      # Data extraction logic
│   │   └── validator.py      # Data validation
│   │
│   ├── 📁 formats/            # Format-specific handlers
│   │   ├── pdf.py           # PDF operations
│   │   ├── image.py         # Image processing
│   │   ├── json_handler.py  # JSON operations
│   │   ├── xml_handler.py   # EU XML format
│   │   └── html_handler.py  # HTML generation
│   │
│   ├── 📁 api/               # REST API
│   │   ├── main.py          # FastAPI application
│   │   ├── routes.py        # API endpoints
│   │   └── models.py        # Pydantic models
│   │
│   ├── 📁 cli/               # Command line interface
│   │   └── commands.py      # CLI commands
│   │
│   └── 📁 utils/             # Utilities
│       ├── config.py        # Configuration
│       ├── logger.py        # Logging setup
│       └── helpers.py       # Helper functions
│
├── 📁 tests/                 # Test suite
├── 📁 scripts/               # Installation scripts
├── 📁 docs/                  # Documentation
├── 🐳 Dockerfile             # Docker configuration
├── 🐳 docker-compose.yml     # Docker Compose
├── 📋 pyproject.toml         # Poetry configuration
└── 📖 README.md              # This file

⚡ Quick Start

Option 1: Auto Installation (Recommended)

# Clone repository
git clone https://github.com/your-username/invocr.git
cd invocr

# Run installation script
chmod +x scripts/install.sh
./scripts/install.sh

Option 2: Manual Installation

# Install system dependencies (Ubuntu/Debian)
sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-pol poppler-utils \
    libpango-1.0-0 libharfbuzz0b python3-dev build-essential

# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -

# Install Python dependencies
poetry install

# Setup environment
cp .env.example .env

Option 3: Docker

# Using Docker Compose (easiest)
docker-compose up

# Or build manually
docker build -t invocr .
docker run -p 8000:8000 invocr

📚 Usage Examples

CLI Commands

# Convert PDF to JSON
invocr convert invoice.pdf invoice.json

# Convert with specific languages
invocr convert -l en,pl,de document.pdf output.json

# PDF to images
invocr pdf2img document.pdf ./images/ --format png --dpi 300

# Image to JSON (OCR)
invocr img2json scan.png data.json --doc-type invoice

# JSON to EU XML format
invocr json2xml data.json invoice.xml

# Batch processing
invocr batch ./input_files/ ./output/ --format json --parallel 4

# Full pipeline: PDF → IMG → JSON → XML → HTML → PDF
invocr pipeline document.pdf ./results/

# Start API server
invocr serve --host 0.0.0.0 --port 8000

REST API

# Start server
invocr serve

# Convert file
curl -X POST "http://localhost:8000/convert" \
  -F "file=@invoice.pdf" \
  -F "target_format=json" \
  -F "languages=en,pl"

# Check job status
curl "http://localhost:8000/status/{job_id}"

# Download result
curl "http://localhost:8000/download/{job_id}" -o result.json

Python API

from invocr import create_converter

# Create converter instance
converter = create_converter(languages=['en', 'pl', 'de'])

# Convert PDF to JSON
result = converter.pdf_to_json('invoice.pdf')
print(result)

# Convert image to JSON with OCR
data = converter.image_to_json('scan.png', document_type='invoice')

# Convert JSON to EU XML
xml_content = converter.json_to_xml(data, format='eu_invoice')

# Full conversion pipeline
result = converter.convert('input.pdf', 'output.json', 'auto', 'json')

🌐 API Documentation

When running the API server, visit:

Interactive docs: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc
OpenAPI JSON: http://localhost:8000/openapi.json

Key Endpoints

POST /convert - Convert single file
POST /convert/pdf2img - PDF to images
POST /convert/img2json - Image OCR to JSON
POST /batch/convert - Batch processing
GET /status/{job_id} - Job status
GET /download/{job_id} - Download result
GET /health - Health check
GET /info - System information

🔧 Configuration

Environment Variables

Key configuration options in .env:

# OCR Settings
DEFAULT_OCR_ENGINE=auto          # tesseract, easyocr, auto
DEFAULT_LANGUAGES=en,pl,de,fr,es # Supported languages
OCR_CONFIDENCE_THRESHOLD=0.3     # Minimum confidence

# Processing
MAX_FILE_SIZE=52428800          # 50MB limit
PARALLEL_WORKERS=4              # Concurrent processing
MAX_PAGES_PER_PDF=10           # Page limit

# Storage
UPLOAD_DIR=./uploads
OUTPUT_DIR=./output
TEMP_DIR=./temp

Supported Languages

Code	Language	Tesseract	EasyOCR
`en`	English	✅	✅
`pl`	Polish	✅	✅
`de`	German	✅	✅
`fr`	French	✅	✅
`es`	Spanish	✅	✅
`it`	Italian	✅	✅

📊 Supported Formats

Input Formats

PDF (.pdf)
Images (.png, .jpg, .jpeg, .tiff, .bmp)
JSON (.json)
XML (.xml)
HTML (.html)

Output Formats

JSON - Structured data
XML - EU Invoice standard
HTML - Responsive templates
PDF - Professional documents

🧪 Testing

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=invocr

# Run specific test file
poetry run pytest tests/test_ocr.py

# Run API tests
poetry run pytest tests/test_api.py

🚀 Deployment

Production with Docker

# docker-compose.prod.yml
version: '3.8'
services:
  invocr:
    image: invocr:latest
    ports:
      - "80:8000"
    environment:
      - ENVIRONMENT=production
      - WORKERS=4
    volumes:
      - ./data:/app/data

Kubernetes

# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: invocr
spec:
  replicas: 3
  selector:
    matchLabels:
      app: invocr
  template:
    metadata:
      labels:
        app: invocr
    spec:
      containers:
      - name: invocr
        image: invocr:latest
        ports:
        - containerPort: 8000

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Make changes
Add tests
Run tests (poetry run pytest)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

Development Setup

# Install development dependencies
poetry install --with dev

# Install pre-commit hooks
poetry run pre-commit install

# Run linting
poetry run black invocr/
poetry run isort invocr/
poetry run flake8 invocr/

# Run type checking
poetry run mypy invocr/

📈 Performance

Benchmarks

Operation	Time	Memory
PDF → JSON (1 page)	~2-3s	~50MB
Image OCR → JSON	~1-2s	~30MB
JSON → XML	~0.1s	~10MB
JSON → HTML	~0.2s	~15MB
HTML → PDF	~1-2s	~40MB

Optimization Tips

Use --parallel for batch processing
Enable IMAGE_ENHANCEMENT=false for faster OCR
Use tesseract engine for better performance
Configure MAX_PAGES_PER_PDF for large documents

🔒 Security

File upload validation
Size limits enforced
Input sanitization
No execution of uploaded content
Rate limiting available
CORS configuration

📋 Requirements

System Requirements

Python: 3.9+
Memory: 1GB+ RAM
Storage: 500MB+ free space
OS: Linux, macOS, Windows (Docker)

Dependencies

Tesseract OCR: Text recognition
EasyOCR: Neural OCR engine
WeasyPrint: HTML to PDF conversion
FastAPI: Web framework
Pydantic: Data validation

🐛 Troubleshooting

Common Issues

OCR not working:

# Check Tesseract installation
tesseract --version

# Install missing languages
sudo apt install tesseract-ocr-pol

WeasyPrint errors:

# Install system dependencies
sudo apt install libpango-1.0-0 libharfbuzz0b

Import errors:

# Reinstall dependencies
poetry install --force

Permission errors:

# Fix file permissions
chmod -R 755 uploads/ output/

📞 Support

📧 Email: support@invocr.com
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions
📚 Wiki: Project Wiki

📄 License

This project is licensed under the Apache License - see the LICENSE file for details.

🙏 Acknowledgments

Tesseract OCR - OCR engine
EasyOCR - Neural OCR
FastAPI - Web framework
WeasyPrint - HTML/CSS to PDF
Poetry - Dependency management

Made with ❤️ for the open source community

⭐ Star this repository if you find it useful!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.16

Jun 18, 2025

1.0.15

Jun 17, 2025

1.0.14

Jun 17, 2025

1.0.13

Jun 17, 2025

1.0.3

Jun 15, 2025

1.0.2

Jun 15, 2025

1.0.1

Jun 15, 2025

This version

1.0.0

Jun 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invocr-1.0.0.tar.gz (41.6 kB view details)

Uploaded Jun 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

invocr-1.0.0-py3-none-any.whl (45.4 kB view details)

Uploaded Jun 15, 2025 Python 3

File details

Details for the file invocr-1.0.0.tar.gz.

File metadata

Download URL: invocr-1.0.0.tar.gz
Upload date: Jun 15, 2025
Size: 41.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for invocr-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c390c4e3cd441366302c10e23552a1ad74cd26666902a2136cf6763973177c6f`
MD5	`b9d91716055dc36ac7bedc4ec36df80c`
BLAKE2b-256	`32cff02b40cec8d9ac4f05c789e3f691a5478de49756a9d3150ba12047572255`

See more details on using hashes here.

File details

Details for the file invocr-1.0.0-py3-none-any.whl.

File metadata

Download URL: invocr-1.0.0-py3-none-any.whl
Upload date: Jun 15, 2025
Size: 45.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for invocr-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a3d7d9943851fa2c52d5a1c46b272ffb20b9279ba8161db8b01776fa76eaf95a`
MD5	`6b912989cc8b78167dba544a8fe54602`
BLAKE2b-256	`a1a2951c0f9d2d1338dfe8e09c52369705b06dbc7c87f7d3ac4261bf4533b21b`

See more details on using hashes here.

invocr 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

InvOCR - Invoice OCR & Conversion System

🚀 Features

📄 Document Processing

🌍 Multi-language Support

📋 Document Types

🔧 Interfaces

🏗️ Project Structure

⚡ Quick Start

Option 1: Auto Installation (Recommended)

Option 2: Manual Installation

Option 3: Docker

📚 Usage Examples

CLI Commands

REST API

Python API

🌐 API Documentation

Key Endpoints

🔧 Configuration

Environment Variables

Supported Languages

📊 Supported Formats

Input Formats

Output Formats

🧪 Testing

🚀 Deployment

Production with Docker

Kubernetes

🤝 Contributing

Development Setup

📈 Performance

Benchmarks

Optimization Tips

🔒 Security

📋 Requirements

System Requirements

Dependencies

🐛 Troubleshooting

Common Issues

📞 Support

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes