Skip to main content

Modular Python pipeline that converts raw receipt documents (images or PDFs) into structured, analytics-ready JSON.

Project description

Open Receipt Extractor

Python 3.12+ License: MIT Build Coverage

Open Receipt Extractor is an open-source, modular Python library that converts raw receipt documents — images or PDFs, including poor-quality photos (wrinkled, skewed, shadowed, faded) — into structured, analytics-ready JSON. It supports bilingual extraction in English and French, covering common Canadian tax regimes (GST/HST/TVQ/QST) and international variations (VAT, sales tax).

The library is designed to be consumed by other developers. You bring the bytes; Open Receipt Extractor handles the extraction. How those bytes arrive — uploaded via a FastAPI endpoint, received as an email attachment, pulled from a cloud bucket, or read from a folder — is entirely up to your application. The library starts at raw bytes and returns a validated Receipt object; what you do with the result is your call.

The pipeline is also designed to be pluggable: OCR engines, preprocessing strategies, and output backends are all interchangeable without modifying core parsing logic.


What It Produces

For every receipt processed, the pipeline emits:

Output Format Description
Receipt JSON Receipt Pydantic model → JSON Merchant, transaction, amounts, taxes, line items, confidence score
Tabular export receipts + receipt_items rows Flat rows ready for data warehouse ingestion
Debug artifacts Images, OCR JSON, parse trace Stored to a configurable backend for audit and continuous improvement

Architecture Overview

The pipeline runs through seven sequential stages, each encapsulated in its own module:

Input bytes (image or PDF)
        │
        ▼
 1. Document Normalization    ── Detect format; decode bytes into PageImage[]
        │
        ▼
 2. Image Preprocessing       ── Generate up to 6 enhanced variants per page
        │
        ▼
 3. OCR (Pluggable)           ── Extract text + bounding boxes + confidence
        │
        ▼
 4. Layout Reconstruction     ── Group tokens into lines/blocks; detect right-aligned amounts
        │
        ▼
 5. Receipt Parsing           ── Extract merchant, date, totals, taxes, line items, payment
        │
        ▼
 6. Validation & Scoring      ── Cross-check math; compute parse_confidence; flag for review
        │
        ▼
 7. Structured Output         ── Emit validated JSON; persist artifacts
        │
        ▼
     Receipt JSON

For full architectural detail, see ARCHITECTURE.md and the design document.


Quick Start

Installation

Install the core package:

pip install open-receipt-extractor

Install with PDF support (recommended):

pip install "open-receipt-extractor[pdf]"

Install with EasyOCR adapter (primary, recommended):

pip install "open-receipt-extractor[pdf,easyocr]"

Note: EasyOCR downloads model weights on first use (~200 MB). No OS-level binaries required.

Basic Usage

from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter

# Build processor with default settings and EasyOCR
settings = Settings()
ocr = EasyOcrAdapter(settings)
processor = ReceiptProcessor(config=settings, ocr_adapter=ocr)

# Process from bytes
with open("receipt.jpg", "rb") as f:
    receipt = processor.process_bytes(f.read(), filename="receipt.jpg")

# Access structured fields
print(receipt.merchant.name)               # "GROCERY WORLD"
print(receipt.transaction.datetime)        # 2024-03-15T14:32:00
print(receipt.amounts.total)               # Decimal('47.83')
print(receipt.quality.parse_confidence)    # 0.92
print(receipt.quality.needs_review)        # False

# Serialize to JSON
from receipt_processor.output.json_serializer import serialize_receipt
json_output = serialize_receipt(receipt)

Process from a DocumentHandle

from receipt_processor.core.types import DocumentHandle

class FileHandle:
    def __init__(self, path: str) -> None:
        self._path = path

    def get_bytes(self) -> bytes:
        with open(self._path, "rb") as f:
            return f.read()

    def get_metadata(self) -> dict:
        return {"filename": self._path}

receipt = processor.process(FileHandle("receipt.pdf"))

Integration Examples

Because Open Receipt Extractor is a library, ingestion is always your concern. Here are two minimal patterns showing how to connect the library to different delivery channels.

FastAPI file upload

from fastapi import FastAPI, UploadFile
from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter

app = FastAPI()
settings = Settings()
processor = ReceiptProcessor(config=settings, ocr_adapter=EasyOcrAdapter(settings))

@app.post("/receipts/extract")
async def extract_receipt(file: UploadFile):
    data = await file.read()
    # aprocess_bytes is non-blocking — safe to use inside async endpoint handlers
    receipt = await processor.aprocess_bytes(data, filename=file.filename)
    return receipt.model_dump()

Email attachment (imaplib)

import imaplib, email
from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter

settings = Settings()
processor = ReceiptProcessor(config=settings, ocr_adapter=EasyOcrAdapter(settings))

with imaplib.IMAP4_SSL("imap.example.com") as imap:
    imap.login("user@example.com", "password")
    imap.select("INBOX")
    _, ids = imap.search(None, "UNSEEN")
    for num in ids[0].split():
        _, data = imap.fetch(num, "(RFC822)")
        msg = email.message_from_bytes(data[0][1])
        for part in msg.walk():
            if part.get_content_maintype() == "image":
                receipt = processor.process_bytes(
                    part.get_payload(decode=True),
                    filename=part.get_filename() or "receipt",
                )
                print(receipt.amounts.total)

Configuration

YAML file

Create receipt_processor_config.yaml in your working directory or pass an explicit path:

preprocessing:
  enabled_variants: ["V0", "V1", "V2", "V3", "V4", "V5"]
  fast_mode_variants: ["V0", "V1"]
  pdf_render_dpi: 200
  max_image_size: [4096, 4096]

ocr:
  engine: "easyocr"
  languages: ["en", "fr"]
  detect_orientation: true

parsing:
  region_hint: "CA"        # Canadian tax regime defaults
  currency_hint: "CAD"

validation:
  needs_review_threshold: 0.50

pipeline:
  mode: "balanced"          # fast | balanced | accurate

artifacts:
  store_original: true
  store_ocr_json: true
  store_parse_trace: false
  storage_backend: "local"
  local_path: "./artifacts"
settings = Settings.from_file("receipt_processor_config.yaml")

Environment variable overrides

Every configuration key can be overridden with a prefixed environment variable, which is useful for container deployments:

export RECEIPT_PROCESSOR_OCR__ENGINE=easyocr
export RECEIPT_PROCESSOR_PARSING__REGION_HINT=CA
export RECEIPT_PROCESSOR_PIPELINE__MODE=accurate
export RECEIPT_PROCESSOR_ARTIFACTS__STORAGE_BACKEND=none
settings = Settings.from_env()

Documentation

Document Description
docs/receipt-processing-design.md Full pipeline design: data model, algorithm detail, extensibility points
docs/project-deliverables.md Master deliverable list with IDs, owners, priorities, and acceptance criteria
ARCHITECTURE.md Module boundaries, interface contracts, data-flow diagram, design decisions
CONTRIBUTING.md Development setup, code style, testing guide, how to add an OCR adapter

Contributing

Contributions are welcome. Please read CONTRIBUTING.md for the development setup guide, code style requirements, and how to add new features such as OCR adapters or preprocessing variants.


License

MIT License — Copyright © Open Receipt Extractor Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_receipt_extractor-0.1.0.tar.gz (226.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

open_receipt_extractor-0.1.0-py3-none-any.whl (109.4 kB view details)

Uploaded Python 3

File details

Details for the file open_receipt_extractor-0.1.0.tar.gz.

File metadata

  • Download URL: open_receipt_extractor-0.1.0.tar.gz
  • Upload date:
  • Size: 226.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_receipt_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 37dbe2080a733b59abcd64a1c1d7ab40e5eb0aff33527a333f7f20f77579d9ec
MD5 0d675f8daf7dca53988bd0f56d086386
BLAKE2b-256 39a41c1dd9d9ab406cbee00a9ce48a1934c80f3f5f467990f3dc6f416d6745bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_receipt_extractor-0.1.0.tar.gz:

Publisher: publish.yml on malekatwiz/open-receipt-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file open_receipt_extractor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for open_receipt_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 27fc3d74bce6631e22487821161f996e8b0409a18a398d2ae3aba75c368f5510
MD5 1877a15c3c6d7135e59c6215380ae51f
BLAKE2b-256 5937fee4505940deb3724333a2003a4b6c82b6ea7e52cae336eddc1a236ef4ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_receipt_extractor-0.1.0-py3-none-any.whl:

Publisher: publish.yml on malekatwiz/open-receipt-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page