Modular Python pipeline that converts raw receipt documents (images or PDFs) into structured, analytics-ready JSON.

These details have not been verified by PyPI

Project description

Open Receipt Extractor

Open Receipt Extractor is an open-source, modular Python library that converts raw receipt documents — images or PDFs, including poor-quality photos (wrinkled, skewed, shadowed, faded) — into structured, analytics-ready JSON. It supports bilingual extraction in English and French, covering common Canadian tax regimes (GST/HST/TVQ/QST) and international variations (VAT, sales tax).

The library is designed to be consumed by other developers. You bring the bytes; Open Receipt Extractor handles the extraction. How those bytes arrive — uploaded via a FastAPI endpoint, received as an email attachment, pulled from a cloud bucket, or read from a folder — is entirely up to your application. The library starts at raw bytes and returns a validated Receipt object; what you do with the result is your call.

The pipeline is also designed to be pluggable: OCR engines, preprocessing strategies, and output backends are all interchangeable without modifying core parsing logic.

What It Produces

For every receipt processed, the pipeline emits:

Output	Format	Description
Receipt JSON	`Receipt` Pydantic model → JSON	Merchant, transaction, amounts, taxes, line items, confidence score
Tabular export	`receipts` + `receipt_items` rows	Flat rows ready for data warehouse ingestion
Debug artifacts	Images, OCR JSON, parse trace	Stored to a configurable backend for audit and continuous improvement

Architecture Overview

The pipeline runs through seven sequential stages, each encapsulated in its own module:

Input bytes (image or PDF)
        │
        ▼
 1. Document Normalization    ── Detect format; decode bytes into PageImage[]
        │
        ▼
 2. Image Preprocessing       ── Generate up to 6 enhanced variants per page
        │
        ▼
 3. OCR (Pluggable)           ── Extract text + bounding boxes + confidence
        │
        ▼
 4. Layout Reconstruction     ── Group tokens into lines/blocks; detect right-aligned amounts
        │
        ▼
 5. Receipt Parsing           ── Extract merchant, date, totals, taxes, line items, payment
        │
        ▼
 6. Validation & Scoring      ── Cross-check math; compute parse_confidence; flag for review
        │
        ▼
 7. Structured Output         ── Emit validated JSON; persist artifacts
        │
        ▼
     Receipt JSON

For full architectural detail, see ARCHITECTURE.md and the design document.

Quick Start

Installation

Install the core package:

pip install open-receipt-extractor

Install with PDF support (recommended):

pip install "open-receipt-extractor[pdf]"

Install with EasyOCR adapter (primary, recommended):

pip install "open-receipt-extractor[pdf,easyocr]"

Note: EasyOCR downloads model weights on first use (~200 MB). No OS-level binaries required.

Basic Usage

from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter

# Build processor with default settings and EasyOCR
settings = Settings()
ocr = EasyOcrAdapter(settings)
processor = ReceiptProcessor(config=settings, ocr_adapter=ocr)

# Process from bytes
with open("receipt.jpg", "rb") as f:
    receipt = processor.process_bytes(f.read(), filename="receipt.jpg")

# Access structured fields
print(receipt.merchant.name)               # "GROCERY WORLD"
print(receipt.transaction.datetime)        # 2024-03-15T14:32:00
print(receipt.amounts.total)               # Decimal('47.83')
print(receipt.quality.parse_confidence)    # 0.92
print(receipt.quality.needs_review)        # False

# Serialize to JSON
from receipt_processor.output.json_serializer import serialize_receipt
json_output = serialize_receipt(receipt)

Process from a `DocumentHandle`

from receipt_processor.core.types import DocumentHandle

class FileHandle:
    def __init__(self, path: str) -> None:
        self._path = path

    def get_bytes(self) -> bytes:
        with open(self._path, "rb") as f:
            return f.read()

    def get_metadata(self) -> dict:
        return {"filename": self._path}

receipt = processor.process(FileHandle("receipt.pdf"))

Integration Examples

Because Open Receipt Extractor is a library, ingestion is always your concern. Here are two minimal patterns showing how to connect the library to different delivery channels.

FastAPI file upload

from fastapi import FastAPI, UploadFile
from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter

app = FastAPI()
settings = Settings()
processor = ReceiptProcessor(config=settings, ocr_adapter=EasyOcrAdapter(settings))

@app.post("/receipts/extract")
async def extract_receipt(file: UploadFile):
    data = await file.read()
    # aprocess_bytes is non-blocking — safe to use inside async endpoint handlers
    receipt = await processor.aprocess_bytes(data, filename=file.filename)
    return receipt.model_dump()

Email attachment (imaplib)

import imaplib, email
from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter

settings = Settings()
processor = ReceiptProcessor(config=settings, ocr_adapter=EasyOcrAdapter(settings))

with imaplib.IMAP4_SSL("imap.example.com") as imap:
    imap.login("user@example.com", "password")
    imap.select("INBOX")
    _, ids = imap.search(None, "UNSEEN")
    for num in ids[0].split():
        _, data = imap.fetch(num, "(RFC822)")
        msg = email.message_from_bytes(data[0][1])
        for part in msg.walk():
            if part.get_content_maintype() == "image":
                receipt = processor.process_bytes(
                    part.get_payload(decode=True),
                    filename=part.get_filename() or "receipt",
                )
                print(receipt.amounts.total)

Configuration

YAML file

Create receipt_processor_config.yaml in your working directory or pass an explicit path:

preprocessing:
  enabled_variants: ["V0", "V1", "V2", "V3", "V4", "V5"]
  fast_mode_variants: ["V0", "V1"]
  pdf_render_dpi: 200
  max_image_size: [4096, 4096]

ocr:
  engine: "easyocr"
  languages: ["en", "fr"]
  detect_orientation: true

parsing:
  region_hint: "CA"        # Canadian tax regime defaults
  currency_hint: "CAD"

validation:
  needs_review_threshold: 0.50

pipeline:
  mode: "balanced"          # fast | balanced | accurate

artifacts:
  store_original: true
  store_ocr_json: true
  store_parse_trace: false
  storage_backend: "local"
  local_path: "./artifacts"

settings = Settings.from_file("receipt_processor_config.yaml")

Environment variable overrides

Every configuration key can be overridden with a prefixed environment variable, which is useful for container deployments:

export RECEIPT_PROCESSOR_OCR__ENGINE=easyocr
export RECEIPT_PROCESSOR_PARSING__REGION_HINT=CA
export RECEIPT_PROCESSOR_PIPELINE__MODE=accurate
export RECEIPT_PROCESSOR_ARTIFACTS__STORAGE_BACKEND=none

settings = Settings.from_env()

Documentation

Document	Description
docs/receipt-processing-design.md	Full pipeline design: data model, algorithm detail, extensibility points
docs/project-deliverables.md	Master deliverable list with IDs, owners, priorities, and acceptance criteria
ARCHITECTURE.md	Module boundaries, interface contracts, data-flow diagram, design decisions
CONTRIBUTING.md	Development setup, code style, testing guide, how to add an OCR adapter

Contributing

Contributions are welcome. Please read CONTRIBUTING.md for the development setup guide, code style requirements, and how to add new features such as OCR adapters or preprocessing variants.

License

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_receipt_extractor-0.1.0.tar.gz (226.9 kB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

open_receipt_extractor-0.1.0-py3-none-any.whl (109.4 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file open_receipt_extractor-0.1.0.tar.gz.

File metadata

Download URL: open_receipt_extractor-0.1.0.tar.gz
Upload date: Mar 2, 2026
Size: 226.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_receipt_extractor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`37dbe2080a733b59abcd64a1c1d7ab40e5eb0aff33527a333f7f20f77579d9ec`
MD5	`0d675f8daf7dca53988bd0f56d086386`
BLAKE2b-256	`39a41c1dd9d9ab406cbee00a9ce48a1934c80f3f5f467990f3dc6f416d6745bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_receipt_extractor-0.1.0.tar.gz:

Publisher: publish.yml on malekatwiz/open-receipt-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: open_receipt_extractor-0.1.0.tar.gz
- Subject digest: 37dbe2080a733b59abcd64a1c1d7ab40e5eb0aff33527a333f7f20f77579d9ec
- Sigstore transparency entry: 1011614343
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: malekatwiz/open-receipt-extractor@ed5e51948d3fdc9a1dcd5501bab7600f7dd6476d
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/malekatwiz
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ed5e51948d3fdc9a1dcd5501bab7600f7dd6476d
- Trigger Event: push

File details

Details for the file open_receipt_extractor-0.1.0-py3-none-any.whl.

File metadata

Download URL: open_receipt_extractor-0.1.0-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 109.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_receipt_extractor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`27fc3d74bce6631e22487821161f996e8b0409a18a398d2ae3aba75c368f5510`
MD5	`1877a15c3c6d7135e59c6215380ae51f`
BLAKE2b-256	`5937fee4505940deb3724333a2003a4b6c82b6ea7e52cae336eddc1a236ef4ee`

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_receipt_extractor-0.1.0-py3-none-any.whl:

Publisher: publish.yml on malekatwiz/open-receipt-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: open_receipt_extractor-0.1.0-py3-none-any.whl
- Subject digest: 27fc3d74bce6631e22487821161f996e8b0409a18a398d2ae3aba75c368f5510
- Sigstore transparency entry: 1011614418
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: malekatwiz/open-receipt-extractor@ed5e51948d3fdc9a1dcd5501bab7600f7dd6476d
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/malekatwiz
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ed5e51948d3fdc9a1dcd5501bab7600f7dd6476d
- Trigger Event: push

open-receipt-extractor 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Open Receipt Extractor

What It Produces

Architecture Overview

Quick Start

Installation

Basic Usage

Process from a DocumentHandle

Integration Examples

Configuration

YAML file

Environment variable overrides

Documentation

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Process from a `DocumentHandle`