Modular Python pipeline that converts raw receipt documents (images or PDFs) into structured, analytics-ready JSON.
Project description
Open Receipt Extractor
Open Receipt Extractor is an open-source, modular Python library that converts raw receipt documents — images or PDFs, including poor-quality photos (wrinkled, skewed, shadowed, faded) — into structured, analytics-ready JSON. It supports bilingual extraction in English and French, covering common Canadian tax regimes (GST/HST/TVQ/QST) and international variations (VAT, sales tax).
The library is designed to be consumed by other developers. You bring the bytes; Open Receipt Extractor handles the extraction. How those bytes arrive — uploaded via a FastAPI endpoint, received as an email attachment, pulled from a cloud bucket, or read from a folder — is entirely up to your application. The library starts at raw bytes and returns a validated Receipt object; what you do with the result is your call.
The pipeline is also designed to be pluggable: OCR engines, preprocessing strategies, and output backends are all interchangeable without modifying core parsing logic.
What It Produces
For every receipt processed, the pipeline emits:
| Output | Format | Description |
|---|---|---|
| Receipt JSON | Receipt Pydantic model → JSON |
Merchant, transaction, amounts, taxes, line items, confidence score |
| Tabular export | receipts + receipt_items rows |
Flat rows ready for data warehouse ingestion |
| Debug artifacts | Images, OCR JSON, parse trace | Stored to a configurable backend for audit and continuous improvement |
Architecture Overview
The pipeline runs through seven sequential stages, each encapsulated in its own module:
Input bytes (image or PDF)
│
▼
1. Document Normalization ── Detect format; decode bytes into PageImage[]
│
▼
2. Image Preprocessing ── Generate up to 6 enhanced variants per page
│
▼
3. OCR (Pluggable) ── Extract text + bounding boxes + confidence
│
▼
4. Layout Reconstruction ── Group tokens into lines/blocks; detect right-aligned amounts
│
▼
5. Receipt Parsing ── Extract merchant, date, totals, taxes, line items, payment
│
▼
6. Validation & Scoring ── Cross-check math; compute parse_confidence; flag for review
│
▼
7. Structured Output ── Emit validated JSON; persist artifacts
│
▼
Receipt JSON
For full architectural detail, see ARCHITECTURE.md and the design document.
Quick Start
Installation
Install the core package:
pip install open-receipt-extractor
Install with PDF support (recommended):
pip install "open-receipt-extractor[pdf]"
Install with EasyOCR adapter (primary, recommended):
pip install "open-receipt-extractor[pdf,easyocr]"
Note: EasyOCR downloads model weights on first use (~200 MB). No OS-level binaries required.
Basic Usage
from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter
# Build processor with default settings and EasyOCR
settings = Settings()
ocr = EasyOcrAdapter(settings)
processor = ReceiptProcessor(config=settings, ocr_adapter=ocr)
# Process from bytes
with open("receipt.jpg", "rb") as f:
receipt = processor.process_bytes(f.read(), filename="receipt.jpg")
# Access structured fields
print(receipt.merchant.name) # "GROCERY WORLD"
print(receipt.transaction.datetime) # 2024-03-15T14:32:00
print(receipt.amounts.total) # Decimal('47.83')
print(receipt.quality.parse_confidence) # 0.92
print(receipt.quality.needs_review) # False
# Serialize to JSON
from receipt_processor.output.json_serializer import serialize_receipt
json_output = serialize_receipt(receipt)
Process from a DocumentHandle
from receipt_processor.core.types import DocumentHandle
class FileHandle:
def __init__(self, path: str) -> None:
self._path = path
def get_bytes(self) -> bytes:
with open(self._path, "rb") as f:
return f.read()
def get_metadata(self) -> dict:
return {"filename": self._path}
receipt = processor.process(FileHandle("receipt.pdf"))
Integration Examples
Because Open Receipt Extractor is a library, ingestion is always your concern. Here are two minimal patterns showing how to connect the library to different delivery channels.
FastAPI file upload
from fastapi import FastAPI, UploadFile
from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter
app = FastAPI()
settings = Settings()
processor = ReceiptProcessor(config=settings, ocr_adapter=EasyOcrAdapter(settings))
@app.post("/receipts/extract")
async def extract_receipt(file: UploadFile):
data = await file.read()
# aprocess_bytes is non-blocking — safe to use inside async endpoint handlers
receipt = await processor.aprocess_bytes(data, filename=file.filename)
return receipt.model_dump()
Email attachment (imaplib)
import imaplib, email
from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter
settings = Settings()
processor = ReceiptProcessor(config=settings, ocr_adapter=EasyOcrAdapter(settings))
with imaplib.IMAP4_SSL("imap.example.com") as imap:
imap.login("user@example.com", "password")
imap.select("INBOX")
_, ids = imap.search(None, "UNSEEN")
for num in ids[0].split():
_, data = imap.fetch(num, "(RFC822)")
msg = email.message_from_bytes(data[0][1])
for part in msg.walk():
if part.get_content_maintype() == "image":
receipt = processor.process_bytes(
part.get_payload(decode=True),
filename=part.get_filename() or "receipt",
)
print(receipt.amounts.total)
Configuration
YAML file
Create receipt_processor_config.yaml in your working directory or pass an explicit path:
preprocessing:
enabled_variants: ["V0", "V1", "V2", "V3", "V4", "V5"]
fast_mode_variants: ["V0", "V1"]
pdf_render_dpi: 200
max_image_size: [4096, 4096]
ocr:
engine: "easyocr"
languages: ["en", "fr"]
detect_orientation: true
parsing:
region_hint: "CA" # Canadian tax regime defaults
currency_hint: "CAD"
validation:
needs_review_threshold: 0.50
pipeline:
mode: "balanced" # fast | balanced | accurate
artifacts:
store_original: true
store_ocr_json: true
store_parse_trace: false
storage_backend: "local"
local_path: "./artifacts"
settings = Settings.from_file("receipt_processor_config.yaml")
Environment variable overrides
Every configuration key can be overridden with a prefixed environment variable, which is useful for container deployments:
export RECEIPT_PROCESSOR_OCR__ENGINE=easyocr
export RECEIPT_PROCESSOR_PARSING__REGION_HINT=CA
export RECEIPT_PROCESSOR_PIPELINE__MODE=accurate
export RECEIPT_PROCESSOR_ARTIFACTS__STORAGE_BACKEND=none
settings = Settings.from_env()
Documentation
| Document | Description |
|---|---|
| docs/receipt-processing-design.md | Full pipeline design: data model, algorithm detail, extensibility points |
| docs/project-deliverables.md | Master deliverable list with IDs, owners, priorities, and acceptance criteria |
| ARCHITECTURE.md | Module boundaries, interface contracts, data-flow diagram, design decisions |
| CONTRIBUTING.md | Development setup, code style, testing guide, how to add an OCR adapter |
Contributing
Contributions are welcome. Please read CONTRIBUTING.md for the development setup guide, code style requirements, and how to add new features such as OCR adapters or preprocessing variants.
License
MIT License — Copyright © Open Receipt Extractor Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_receipt_extractor-0.1.0.tar.gz.
File metadata
- Download URL: open_receipt_extractor-0.1.0.tar.gz
- Upload date:
- Size: 226.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37dbe2080a733b59abcd64a1c1d7ab40e5eb0aff33527a333f7f20f77579d9ec
|
|
| MD5 |
0d675f8daf7dca53988bd0f56d086386
|
|
| BLAKE2b-256 |
39a41c1dd9d9ab406cbee00a9ce48a1934c80f3f5f467990f3dc6f416d6745bc
|
Provenance
The following attestation bundles were made for open_receipt_extractor-0.1.0.tar.gz:
Publisher:
publish.yml on malekatwiz/open-receipt-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
open_receipt_extractor-0.1.0.tar.gz -
Subject digest:
37dbe2080a733b59abcd64a1c1d7ab40e5eb0aff33527a333f7f20f77579d9ec - Sigstore transparency entry: 1011614343
- Sigstore integration time:
-
Permalink:
malekatwiz/open-receipt-extractor@ed5e51948d3fdc9a1dcd5501bab7600f7dd6476d -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/malekatwiz
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ed5e51948d3fdc9a1dcd5501bab7600f7dd6476d -
Trigger Event:
push
-
Statement type:
File details
Details for the file open_receipt_extractor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: open_receipt_extractor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 109.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27fc3d74bce6631e22487821161f996e8b0409a18a398d2ae3aba75c368f5510
|
|
| MD5 |
1877a15c3c6d7135e59c6215380ae51f
|
|
| BLAKE2b-256 |
5937fee4505940deb3724333a2003a4b6c82b6ea7e52cae336eddc1a236ef4ee
|
Provenance
The following attestation bundles were made for open_receipt_extractor-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on malekatwiz/open-receipt-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
open_receipt_extractor-0.1.0-py3-none-any.whl -
Subject digest:
27fc3d74bce6631e22487821161f996e8b0409a18a398d2ae3aba75c368f5510 - Sigstore transparency entry: 1011614418
- Sigstore integration time:
-
Permalink:
malekatwiz/open-receipt-extractor@ed5e51948d3fdc9a1dcd5501bab7600f7dd6476d -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/malekatwiz
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ed5e51948d3fdc9a1dcd5501bab7600f7dd6476d -
Trigger Event:
push
-
Statement type: