Skip to main content

Template-aware extractor for scanned Purchase Order / Sales Order PDFs and images (Tally / GST style). OCR + layout + saved YAML templates + optional Claude/Ollama.

Project description

po_extractor

Template-aware extractor for scanned Purchase Order PDFs and images.

po_extractor turns scanned POs (Tally-style forms, vendor sales orders) into clean, schema-aligned JSON. It does not train a model from scratch. It pairs production OCR with geometry-driven layout analysis, persistent YAML templates, deterministic field/table extractors, validators, and an optional Claude Opus 4.7 stage for narrowly-scoped normalization.

Every emitted value carries evidence: the page, bounding box, raw OCR text, and confidence. New formats are remembered as YAML templates that grow via a correction loop.

Highlights

  • OCR-pluggableBaseOCREngine abstraction. Default: rapidocr-onnxruntime (Windows-friendly, pure ONNX). Optional: PaddleOCR via pip install po_extractor[paddle]. Mock engine for tests.
  • Template registry — YAML-defined anchors, label aliases, table-header aliases, field rules, validation rules. Match score is deterministic and inspectable.
  • Table reconstruction — column inference from header bboxes, row clustering by y-gap, multi-line description coalescing, tax sub-row folding, multi-page table stitching.
  • Validation — GSTIN regex+checksum, mobile, HSN length, dates (DMY default), Indian numerics (1,23,456.78 and handled). Cross-row math: qty * rate ≈ amount, sum(items) ≈ totals.
  • LLM (optional) — Claude Opus 4.7 (claude-opus-4-7) used for label mapping and template drafting. Every LLM-emitted value is grounded against OCR before it lands in the result.
  • Correction loopapply-correction writes confirmed aliases back into the matched template, so the next document of the same layout extracts cleanly without re-asking.
  • Pydantic v2 throughout — every output is a typed model with stable JSON.

Install

Once published to PyPI:

pip install po-extractor[rapid]              # default OCR backend
pip install po-extractor[rapid,llm]          # + Anthropic SDK for Claude Opus 4.7
pip install po-extractor[paddle]             # PaddleOCR (heavier; may need toolchain on Windows)

From source (development install):

git clone <repo-url>
cd po-extractor
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .[dev,llm]

Build a wheel locally:

pip install build
python -m build .                            # produces dist/po_extractor-0.1.0-py3-none-any.whl

Quick start

Python API (one-liner)

from po_extractor import extract

result = extract("invoice.pdf")              # any path, bytes, or file-like
print(result["po_number"])                   # subscript access — returns the value
print(result["buyer_gstin"])
print(result.values_dict())                  # {"po_number": "...", "items": [...], ...}
print(result.to_json(indent=2))              # full JSON with values + rich form

Provenance (bbox, confidence, raw OCR text, label seen on page) when you need it:

ev = result.evidence("buyer_gstin")
print(ev.value, ev.bbox, ev.confidence, ev.label_seen, ev.source)

Other input forms — bytes, file-like, str, Path:

data = open("invoice.pdf", "rb").read()
result = extract(data)                       # bytes
result = extract(open("invoice.pdf", "rb"))  # file-like
import io; result = extract(io.BytesIO(data))

Need fine-grained control? Use the class:

from po_extractor import POExtractor

extractor = POExtractor(
    ocr_engine_name="rapid",                 # "rapid" | "paddle" | "mock"
    use_llm=False,                           # disable LLM stages
    llm_only=False,                          # set True to skip templates entirely
)
result = extractor.extract("invoice.pdf")

CLI

Every command takes a verb. Both po-extract (console script) and python -m po_extractor work:

po-extract extract invoice.pdf --out result.json
po-extract extract invoice.pdf --llm-only            # skip templates, use Claude/Ollama
po-extract list-templates
po-extract match invoice.pdf                         # show match score breakdown
po-extract apply-correction --result result.json --correction corrections.json
po-extract learn-template invoice.pdf --format-name "Acme Sales Order"
po-extract validate result.json                      # re-run validators on existing JSON

python -m po_extractor extract invoice.pdf           # equivalent to po-extract extract
po-extract --help                                    # full verb list

Configuration

Settings come from environment variables (see .env.example) or programmatic Settings overrides:

Var Default Meaning
PO_EXTRACTOR_OCR_ENGINE rapid rapid, paddle, or mock
PO_EXTRACTOR_DPI 300 PDF rasterization DPI
PO_EXTRACTOR_LLM_PROVIDER auto auto / claude / ollama / none
ANTHROPIC_API_KEY (unset) Used when provider is claude (or auto if Ollama isn't reachable)
PO_EXTRACTOR_OLLAMA_HOST http://localhost:11434 Ollama server URL
PO_EXTRACTOR_OLLAMA_MODEL qwen2.5:7b-instruct Local model name
PO_EXTRACTOR_REQUIRE_LLM false Hard-fail if LLM unavailable
PO_EXTRACTOR_ALLOW_DRAFTS false Load draft templates from store/drafts/
PO_EXTRACTOR_LOG_LEVEL INFO Logger level

Using a local LLM via Ollama

Both learn-template and the optional in-pipeline label-mapping stage will pick up Ollama automatically when no Anthropic key is set.

# 1) Install Ollama: https://ollama.com/download
# 2) Pull a model good enough for label mapping / template drafting:
ollama pull qwen2.5:7b-instruct          # recommended default (~4.7 GB)
# alternatives: llama3.1:8b, mistral:7b-instruct, qwen2.5:14b-instruct
# 3) Make sure the server is running:
ollama serve                              # usually runs as a service already

# 4) Use po_extractor exactly as you would with Claude:
po-extract learn-template "samples/SO-0005-2026.pdf" --format-name "vendor_xyz_so"

To force one provider regardless of detection:

$env:PO_EXTRACTOR_LLM_PROVIDER = "ollama"      # or "claude" / "none"

Adding a template

Two routes:

  1. By hand — copy po_extractor/templates/store/_default.yaml, edit, drop in store/.
  2. Auto-draft from a samplepo-extract learn-template data/new_vendor.pdf --format-name "new_vendor_po". Requires ANTHROPIC_API_KEY. Writes a draft into store/drafts/.

Templates carry: format_id, format_name, anchors[], label_aliases{}, table_headers[], field_rules[], validation_rules[]. See tstanes_po_v1.yaml for a fully-worked example.

Calibrating a template against a real document

The starter templates ship with draft: true because they were derived from anchor lists, not real samples. To calibrate:

# 1) Extract — likely produces some warnings or missing fields
po-extract extract data\real_tstanes.pdf --out result.json --allow-drafts

# 2) Hand-write a corrections.json with the right values + the labels you saw on the document
# (see docs/corrections-format.md for the schema)

# 3) Apply the correction — adds new aliases / region hints to the matched template
po-extract apply-correction --result result.json --correction corrections.json

# 4) Re-extract — the template now knows the new aliases
po-extract extract data\real_tstanes.pdf --out result2.json --allow-drafts

Once a draft has at least three confirmed aliases per required field via corrections, it is automatically promoted (draft: false) so it participates in normal matching.

Output schema

Every extraction produces an ExtractionResult (Pydantic model). Top-level shape:

{
  "document_type": "purchase_order",
  "source_file": "...",
  "page_count": 1,
  "detected_format_id": "tstanes_po_v1",
  "extraction_status": "success | needs_review | needs_template_review",
  "confidence": 0.0,
  "header": { "po_number": { "value": "...", "raw_value": "...", "label_seen": "...", "page": 1, "bbox": [...], "confidence": 0.0 }, ... },
  "parties": { ... },
  "items": [ { "row_index": 0, "cells": { ... }, "taxes": { ... } }, ... ],
  "terms": { ... },
  "totals": { ... },
  "handwritten_notes": [],
  "unmapped_text": [],
  "validation": { "status": "passed | warning | failed", "issues": [] },
  "raw_ocr": { "pages": [ ... ] },
  "diagnostics": { ... }
}

Testing

pytest -q

Tests run with the MockOCREngine reading canned JSON fixtures under data/fixtures/. No real OCR install or sample PDFs required for CI.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

po_extractor-0.1.0.tar.gz (98.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

po_extractor-0.1.0-py3-none-any.whl (109.2 kB view details)

Uploaded Python 3

File details

Details for the file po_extractor-0.1.0.tar.gz.

File metadata

  • Download URL: po_extractor-0.1.0.tar.gz
  • Upload date:
  • Size: 98.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for po_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 be6526d92d46fc5afa00af3ee8ba976049dfaa8c103a16bc3c5e4c8b7ce21660
MD5 57bf2c1a604757832199c3c5c8570590
BLAKE2b-256 6b4ab1cda9b964ac8fd24d08e02eb9c5909252a4ba0691819ac594699308e0cd

See more details on using hashes here.

File details

Details for the file po_extractor-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: po_extractor-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 109.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for po_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d28d063bb6ea957a34099d3f1e337ac92987a354a8473abc3b622af45ff034d1
MD5 e086cbcd2d303f13e1b11fed9022455b
BLAKE2b-256 6e91b3e40cbf8a1a01d122c22068a6f78a14bba85b1178026c1942aa0d2eb36e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page