Skip to main content

Extract recipient address from AWB/shipping label PDF using Claude AI

Project description

AWB Extractor

Python SDK for extracting receiver, shipment, carrier, and e-commerce platform information from Vietnamese AWB/shipping label PDF files using Claude AI.

Features

  • Extract from PDF bytes, local PDF files, or PDF URLs
  • Batch extraction from multiple URLs
  • Optional PDF-to-JPEG optimization before sending to Claude
  • Cost estimation helper for PDF and optimized image modes
  • Optional default HTTP headers for protected AWB URLs
  • Typed AWBResult dataclass output
  • Custom exceptions for API key, PDF download, and JSON parsing failures

Requirements

  • Python 3.9+
  • Anthropic API key
  • pymupdf for optimized PDF-to-JPEG extraction

Installation

Install from PyPI:

pip install awb-extractor

For local development:

pip install -e ".[dev]"

Usage

from awb_extractor import AWBExtractor

extractor = AWBExtractor(api_key="sk-ant-...")
result = extractor.from_file("label.pdf")

print(result.recipient_name)
print(result.carrier)
print(result.platform)
print(result.to_dict())

Example result:

{
    "tracking_number": "NHSVC972103440",
    "recipient_name": "Nguyen Van A",
    "recipient_phone": "(+84)03******37",
    "recipient_address": "237 Nguyen Trai",
    "recipient_ward": "Phuong Ben Thanh",
    "recipient_district": "Quan 1",
    "recipient_province": "TP. Ho Chi Minh",
    "sender_name": "Onflow",
    "sender_address": "TP. Ho Chi Minh",
    "cod": "0",
    "weight": "0.700 KG",
    "order_id": "584425059595159079",
    "carrier": "GHN",
    "platform": "Shopee",
}

By default, AWBExtractor converts the first PDF page to JPEG before sending it to Claude:

extractor = AWBExtractor(
    api_key="sk-ant-...",
    optimize=True,
    dpi=200,
)

If you want to send the original PDF document directly, set optimize=False:

extractor = AWBExtractor(api_key="sk-ant-...", optimize=False)

Supported Inputs

PDF bytes

from awb_extractor import AWBExtractor

extractor = AWBExtractor(api_key="sk-ant-...")

with open("label.pdf", "rb") as file:
    result = extractor.from_bytes(file.read())

Local PDF file

from awb_extractor import AWBExtractor

extractor = AWBExtractor(api_key="sk-ant-...")
result = extractor.from_file("label.pdf")

PDF URL

from awb_extractor import AWBExtractor

extractor = AWBExtractor(
    api_key="sk-ant-...",
    http_headers={"Authorization": "Bearer token"},
)

result = extractor.from_url("https://example.com/awb.pdf")

You can pass request-specific headers with extra_headers:

result = extractor.from_url(
    "https://example.com/awb.pdf",
    extra_headers={"X-Request-ID": "request-123"},
)

Multiple URLs

from_urls() returns a list of dictionaries with url, data, and error. Failed URLs do not stop the whole batch.

from awb_extractor import AWBExtractor

extractor = AWBExtractor(api_key="sk-ant-...")
results = extractor.from_urls([
    "https://example.com/good.pdf",
    "https://example.com/bad.pdf",
])

Estimate Cost

estimate_cost() estimates token usage and cost before calling the API.

from pathlib import Path
from awb_extractor import estimate_cost

pdf_bytes = Path("label.pdf").read_bytes()
cost = estimate_cost(pdf_bytes, optimize=True, dpi=200)

print(cost)

Example output:

{
    "mode": "image/jpeg",
    "input_tokens": 800,
    "output_tokens": 150,
    "cost_usd": 0.00155,
    "awb_per_10_usd": 6451,
}

Result Fields

AWBResult includes:

  • tracking_number
  • recipient_name
  • recipient_phone
  • recipient_address
  • recipient_ward
  • recipient_district
  • recipient_province
  • sender_name
  • sender_address
  • cod
  • weight
  • order_id
  • carrier
  • platform

Use to_dict() or to_json() to serialize the result.

Empty strings returned by Claude are normalized to None.

Exceptions

  • APIKeyError: missing API key
  • PDFDownloadError: PDF URL download failed
  • ExtractionError: Claude response could not be parsed as JSON

Package Structure

  • awb_extractor/extractor.py: public AWBExtractor class
  • awb_extractor/models.py: AWBResult dataclass
  • awb_extractor/exceptions.py: package exceptions

Development

Install dependencies and run tests:

python3 -m venv .venv
.venv/bin/pip install -e ".[dev]"
.venv/bin/python -m pytest -q

Publishing

GitHub Actions builds and publishes the package to PyPI on every push to main.

The repository must define this GitHub secret:

PYPI_API_TOKEN

PyPI does not allow replacing an existing version. If a commit on main does not bump project.version in pyproject.toml, the publish step skips the existing distribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awb_extractor-0.1.5.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

awb_extractor-0.1.5-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file awb_extractor-0.1.5.tar.gz.

File metadata

  • Download URL: awb_extractor-0.1.5.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for awb_extractor-0.1.5.tar.gz
Algorithm Hash digest
SHA256 d6ab4dbda83aefdeb809a842ffe0e2bc93290e04406539d6aa1903e0cde5c43c
MD5 d6b95121c7b1b7d822c0516a034807e0
BLAKE2b-256 073dd51754dc8eff7c96d188d29e6c3fd81e5651aea676b0eb30da4f6cb66ea2

See more details on using hashes here.

File details

Details for the file awb_extractor-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: awb_extractor-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for awb_extractor-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 bdb0dd05d484194a6c22c80f82934d4d953592ffea804235f06f7b671da1569a
MD5 bee068dcb03a0f8aef5ecf3914224edd
BLAKE2b-256 699d8168a0185383bbaa26c1b76f5f119228f7f921d9b6494e16056f3fd04dd6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page