Extract recipient address from AWB/shipping label PDF using Claude AI
Project description
AWB Extractor
Python SDK for extracting receiver, shipment, carrier, and e-commerce platform information from Vietnamese AWB/shipping label PDF files using Claude AI.
Features
- Extract from PDF bytes, local PDF files, or PDF URLs
- Batch extraction from multiple URLs
- Optional PDF-to-JPEG optimization before sending to Claude
- Cost estimation helper for PDF and optimized image modes
- Optional default HTTP headers for protected AWB URLs
- Typed
AWBResultdataclass output - Custom exceptions for API key, PDF download, and JSON parsing failures
Requirements
- Python 3.9+
- Anthropic API key
pymupdffor optimized PDF-to-JPEG extraction
Installation
Install from PyPI:
pip install awb-extractor
For local development:
pip install -e ".[dev]"
Usage
from awb_extractor import AWBExtractor
extractor = AWBExtractor(api_key="sk-ant-...")
result = extractor.from_file("label.pdf")
print(result.recipient_name)
print(result.carrier)
print(result.platform)
print(result.to_dict())
Example result:
{
"tracking_number": "NHSVC972103440",
"recipient_name": "Nguyen Van A",
"recipient_phone": "(+84)03******37",
"recipient_address": "237 Nguyen Trai",
"recipient_ward": "Phuong Ben Thanh",
"recipient_district": "Quan 1",
"recipient_province": "TP. Ho Chi Minh",
"sender_name": "Onflow",
"sender_address": "TP. Ho Chi Minh",
"cod": "0",
"weight": "0.700 KG",
"order_id": "584425059595159079",
"carrier": "GHN",
"platform": "Shopee",
}
By default, AWBExtractor converts the first PDF page to JPEG before sending it
to Claude:
extractor = AWBExtractor(
api_key="sk-ant-...",
optimize=True,
dpi=200,
)
If you want to send the original PDF document directly, set optimize=False:
extractor = AWBExtractor(api_key="sk-ant-...", optimize=False)
Supported Inputs
PDF bytes
from awb_extractor import AWBExtractor
extractor = AWBExtractor(api_key="sk-ant-...")
with open("label.pdf", "rb") as file:
result = extractor.from_bytes(file.read())
Local PDF file
from awb_extractor import AWBExtractor
extractor = AWBExtractor(api_key="sk-ant-...")
result = extractor.from_file("label.pdf")
PDF URL
from awb_extractor import AWBExtractor
extractor = AWBExtractor(
api_key="sk-ant-...",
http_headers={"Authorization": "Bearer token"},
)
result = extractor.from_url("https://example.com/awb.pdf")
You can pass request-specific headers with extra_headers:
result = extractor.from_url(
"https://example.com/awb.pdf",
extra_headers={"X-Request-ID": "request-123"},
)
Multiple URLs
from_urls() returns a list of dictionaries with url, data, and error.
Failed URLs do not stop the whole batch.
from awb_extractor import AWBExtractor
extractor = AWBExtractor(api_key="sk-ant-...")
results = extractor.from_urls([
"https://example.com/good.pdf",
"https://example.com/bad.pdf",
])
Estimate Cost
estimate_cost() estimates token usage and cost before calling the API.
from pathlib import Path
from awb_extractor import estimate_cost
pdf_bytes = Path("label.pdf").read_bytes()
cost = estimate_cost(pdf_bytes, optimize=True, dpi=200)
print(cost)
Example output:
{
"mode": "image/jpeg",
"input_tokens": 800,
"output_tokens": 150,
"cost_usd": 0.00155,
"awb_per_10_usd": 6451,
}
Result Fields
AWBResult includes:
tracking_numberrecipient_namerecipient_phonerecipient_addressrecipient_wardrecipient_districtrecipient_provincesender_namesender_addresscodweightorder_idcarrierplatform
Use to_dict() or to_json() to serialize the result.
Empty strings returned by Claude are normalized to None.
Exceptions
APIKeyError: missing API keyPDFDownloadError: PDF URL download failedExtractionError: Claude response could not be parsed as JSON
Package Structure
awb_extractor/extractor.py: publicAWBExtractorclassawb_extractor/models.py:AWBResultdataclassawb_extractor/exceptions.py: package exceptions
Development
Install dependencies and run tests:
python3 -m venv .venv
.venv/bin/pip install -e ".[dev]"
.venv/bin/python -m pytest -q
Publishing
GitHub Actions builds and publishes the package to PyPI on every push to main.
The repository must define this GitHub secret:
PYPI_API_TOKEN
PyPI does not allow replacing an existing version. If a commit on main does not
bump project.version in pyproject.toml, the publish step skips the existing
distribution.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file awb_extractor-0.1.6.tar.gz.
File metadata
- Download URL: awb_extractor-0.1.6.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b84c528aaa40d4257ad7132c8ebd2544fe933cac8ff2f534e7f204ef77b1bfd
|
|
| MD5 |
f62a6f57fbc6b8d54f81372a88a9391e
|
|
| BLAKE2b-256 |
88cb8cd94f076af441e9591a7ee8085b98680129f418511904c9a4f291aa98bf
|
File details
Details for the file awb_extractor-0.1.6-py3-none-any.whl.
File metadata
- Download URL: awb_extractor-0.1.6-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
533776dfe7be0de18dc80ee916e899b55d2c4e7300e4c4cfd84ea39446b7dae3
|
|
| MD5 |
06639dffe0f4cfba0ae4fb34833c36e0
|
|
| BLAKE2b-256 |
a067e26d3e12dca778d59fd2150653a47a2591ef5f04c1e1d85d049b21454346
|