Python SDK for extracting receiver information from AWB/shipping labels.
Project description
Onflow AWB OCR
Python SDK for extracting receiver information from AWB and shipping label files.
The package supports PDF files with a text layer first, then falls back to OCR for scanned PDFs and image files when OCR dependencies are installed.
Requirements
- Python 3.8+
- PyMuPDF for PDF text-layer extraction
- Optional OCR stack for scanned files and images:
- Tesseract OCR
- Vietnamese Tesseract language data
- Poppler
pdftoppm
Installation
Install from PyPI:
pip install onflow-awb-ocr
Install with OCR dependencies:
pip install "onflow-awb-ocr[ocr]"
On Ubuntu, install the native OCR tools:
sudo apt install -y tesseract-ocr tesseract-ocr-vie poppler-utils
For local development:
pip install -e ".[ocr]"
Usage
from onflow_awb_ocr import OnflowAwbOcr
ocr = OnflowAwbOcr(lang="vie+eng")
result = ocr.extract("label.pdf")
print(result)
Example result:
{
"name": "Nguyen Van A",
"address": "123 Nguyen Trai\nQuan 1, TP. Ho Chi Minh",
"strategy": "shopee",
}
If no receiver can be detected, extract() returns None.
Supported Inputs
extract() accepts:
- Local file path as
str - Local file path as
pathlib.Path - HTTP/HTTPS URL
bytesbytearray- Binary file-like object
Examples:
from pathlib import Path
from onflow_awb_ocr import OnflowAwbOcr
ocr = OnflowAwbOcr()
from_path = ocr.extract(Path("label.pdf"))
from_url = ocr.extract("https://example.com/label.pdf")
with open("label.pdf", "rb") as file:
from_file = ocr.extract(file)
with open("label.png", "rb") as file:
from_bytes = ocr.extract(file.read())
Compatibility
The old ReceiverExtractor class name is still available as an alias:
from onflow_awb_ocr import ReceiverExtractor
ocr = ReceiverExtractor()
result = ocr.extract("label.pdf")
Package Structure
extractor.py: publicOnflowAwbOcrclassinput.py: input preparation for paths, URLs, bytes, and binary streamstext_layer.py: PDF text-layer extraction strategiesocr.py: OCR fallback for scanned PDFs and imagespostprocess.py: address cleanuptypes.py,constants.py,utils.py: shared types, constants, and helpers
Publishing
GitHub Actions builds and publishes the package to PyPI on every push to main.
The repository must define this GitHub secret:
PYPI_API_TOKEN
PyPI does not allow replacing an existing version. If a commit on main does not
bump project.version in pyproject.toml, the publish step skips the existing
distribution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file onflow_awb_ocr-0.1.0.tar.gz.
File metadata
- Download URL: onflow_awb_ocr-0.1.0.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cad6d1d820e62c4a38dcfbc52c7e5d0b059ed60d74c26781c74ee5218ddcefd
|
|
| MD5 |
cfe522617bb71ea4ece347b617dd5ad0
|
|
| BLAKE2b-256 |
ca57593e46204b247ec3e6f09eb49318b034fc4bf49e9af5989ebea1b4e2d269
|
File details
Details for the file onflow_awb_ocr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: onflow_awb_ocr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
024def625b6a8c9e3232b97a6e6bfa64f2ac07a916ab54085ad5e986824c9c07
|
|
| MD5 |
a6e21774ebeca1859c2b410851f10687
|
|
| BLAKE2b-256 |
a5cacd83b3ee47f97ec5d7e9637d9931cc008aea6b16d38f5461af4d3e133b2b
|