Skip to main content

Python SDK for extracting receiver information from AWB/shipping labels.

Project description

Onflow AWB OCR

Python SDK for extracting receiver information from AWB and shipping label files.

The package supports PDF files with a text layer first, then falls back to OCR for scanned PDFs and image files when OCR dependencies are installed.

Requirements

  • Python 3.8+
  • PyMuPDF for PDF text-layer extraction
  • Optional OCR stack for scanned files and images:
    • Tesseract OCR
    • Vietnamese Tesseract language data
    • Poppler pdftoppm

Installation

Install from PyPI:

pip install onflow-awb-ocr

Install with OCR dependencies:

pip install "onflow-awb-ocr[ocr]"

On Ubuntu, install the native OCR tools:

sudo apt install -y tesseract-ocr tesseract-ocr-vie poppler-utils

For local development:

pip install -e ".[ocr]"

Usage

from onflow_awb_ocr import OnflowAwbOcr

ocr = OnflowAwbOcr(lang="vie+eng")
result = ocr.extract("label.pdf")

print(result)

Example result:

{
    "name": "Nguyen Van A",
    "address": "123 Nguyen Trai\nQuan 1, TP. Ho Chi Minh",
    "strategy": "shopee",
}

If no receiver can be detected, extract() returns None.

Supported Inputs

extract() accepts:

  • Local file path as str
  • Local file path as pathlib.Path
  • HTTP/HTTPS URL
  • bytes
  • bytearray
  • Binary file-like object

Examples:

from pathlib import Path

from onflow_awb_ocr import OnflowAwbOcr

ocr = OnflowAwbOcr()

from_path = ocr.extract(Path("label.pdf"))
from_url = ocr.extract("https://example.com/label.pdf")

with open("label.pdf", "rb") as file:
    from_file = ocr.extract(file)

with open("label.png", "rb") as file:
    from_bytes = ocr.extract(file.read())

Compatibility

The old ReceiverExtractor class name is still available as an alias:

from onflow_awb_ocr import ReceiverExtractor

ocr = ReceiverExtractor()
result = ocr.extract("label.pdf")

Package Structure

  • extractor.py: public OnflowAwbOcr class
  • input.py: input preparation for paths, URLs, bytes, and binary streams
  • text_layer.py: PDF text-layer extraction strategies
  • ocr.py: OCR fallback for scanned PDFs and images
  • postprocess.py: address cleanup
  • types.py, constants.py, utils.py: shared types, constants, and helpers

Publishing

GitHub Actions builds and publishes the package to PyPI on every push to main.

The repository must define this GitHub secret:

PYPI_API_TOKEN

PyPI does not allow replacing an existing version. If a commit on main does not bump project.version in pyproject.toml, the publish step skips the existing distribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onflow_awb_ocr-0.1.0.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onflow_awb_ocr-0.1.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file onflow_awb_ocr-0.1.0.tar.gz.

File metadata

  • Download URL: onflow_awb_ocr-0.1.0.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for onflow_awb_ocr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6cad6d1d820e62c4a38dcfbc52c7e5d0b059ed60d74c26781c74ee5218ddcefd
MD5 cfe522617bb71ea4ece347b617dd5ad0
BLAKE2b-256 ca57593e46204b247ec3e6f09eb49318b034fc4bf49e9af5989ebea1b4e2d269

See more details on using hashes here.

File details

Details for the file onflow_awb_ocr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: onflow_awb_ocr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for onflow_awb_ocr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 024def625b6a8c9e3232b97a6e6bfa64f2ac07a916ab54085ad5e986824c9c07
MD5 a6e21774ebeca1859c2b410851f10687
BLAKE2b-256 a5cacd83b3ee47f97ec5d7e9637d9931cc008aea6b16d38f5461af4d3e133b2b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page