Template-aware extractor for scanned Purchase Order / Sales Order PDFs and images (Tally / GST style). OCR + layout + saved YAML templates + optional Claude/Ollama.

These details have not been verified by PyPI

Project links

Project description

po_extractor

Template-aware extractor for scanned Purchase Order PDFs and images.

po_extractor turns scanned POs (Tally-style forms, vendor sales orders) into clean, schema-aligned JSON. It does not train a model from scratch. It pairs production OCR with geometry-driven layout analysis, persistent YAML templates, deterministic field/table extractors, validators, and an optional Claude Opus 4.7 stage for narrowly-scoped normalization.

Every emitted value carries evidence: the page, bounding box, raw OCR text, and confidence. New formats are remembered as YAML templates that grow via a correction loop.

Highlights

OCR-pluggable — BaseOCREngine abstraction. Default: rapidocr-onnxruntime (Windows-friendly, pure ONNX). Optional: PaddleOCR via pip install po_extractor[paddle]. Mock engine for tests.
Template registry — YAML-defined anchors, label aliases, table-header aliases, field rules, validation rules. Match score is deterministic and inspectable.
Table reconstruction — column inference from header bboxes, row clustering by y-gap, multi-line description coalescing, tax sub-row folding, multi-page table stitching.
Validation — GSTIN regex+checksum, mobile, HSN length, dates (DMY default), Indian numerics (1,23,456.78 and ₹ handled). Cross-row math: qty * rate ≈ amount, sum(items) ≈ totals.
LLM (optional) — Claude Opus 4.7 (claude-opus-4-7) used for label mapping and template drafting. Every LLM-emitted value is grounded against OCR before it lands in the result.
Correction loop — apply-correction writes confirmed aliases back into the matched template, so the next document of the same layout extracts cleanly without re-asking.
Pydantic v2 throughout — every output is a typed model with stable JSON.

Install

Once published to PyPI:

pip install po-extractor[rapid]              # default OCR backend
pip install po-extractor[rapid,llm]          # + Anthropic SDK for Claude Opus 4.7
pip install po-extractor[paddle]             # PaddleOCR (heavier; may need toolchain on Windows)

From source (development install):

git clone <repo-url>
cd po-extractor
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .[dev,llm]

Build a wheel locally:

pip install build
python -m build .                            # produces dist/po_extractor-0.1.0-py3-none-any.whl

Quick start

Python API (one-liner)

from po_extractor import extract

result = extract("invoice.pdf")              # any path, bytes, or file-like
print(result["po_number"])                   # subscript access — returns the value
print(result["buyer_gstin"])
print(result.values_dict())                  # {"po_number": "...", "items": [...], ...}
print(result.to_json(indent=2))              # full JSON with values + rich form

Provenance (bbox, confidence, raw OCR text, label seen on page) when you need it:

ev = result.evidence("buyer_gstin")
print(ev.value, ev.bbox, ev.confidence, ev.label_seen, ev.source)

Other input forms — bytes, file-like, str, Path:

data = open("invoice.pdf", "rb").read()
result = extract(data)                       # bytes
result = extract(open("invoice.pdf", "rb"))  # file-like
import io; result = extract(io.BytesIO(data))

Need fine-grained control? Use the class:

from po_extractor import POExtractor

extractor = POExtractor(
    ocr_engine_name="rapid",                 # "rapid" | "paddle" | "mock"
    use_llm=False,                           # disable LLM stages
    llm_only=False,                          # set True to skip templates entirely
)
result = extractor.extract("invoice.pdf")

CLI

Every command takes a verb. Both po-extract (console script) and python -m po_extractor work:

po-extract extract invoice.pdf --out result.json
po-extract extract invoice.pdf --llm-only            # skip templates, use Claude/Ollama
po-extract list-templates
po-extract match invoice.pdf                         # show match score breakdown
po-extract apply-correction --result result.json --correction corrections.json
po-extract learn-template invoice.pdf --format-name "Acme Sales Order"
po-extract validate result.json                      # re-run validators on existing JSON

python -m po_extractor extract invoice.pdf           # equivalent to po-extract extract
po-extract --help                                    # full verb list

Configuration

Settings come from environment variables (see .env.example) or programmatic Settings overrides:

Var	Default	Meaning
`PO_EXTRACTOR_OCR_ENGINE`	`rapid`	`rapid`, `paddle`, or `mock`
`PO_EXTRACTOR_DPI`	`300`	PDF rasterization DPI
`PO_EXTRACTOR_LLM_PROVIDER`	`auto`	`auto` / `claude` / `ollama` / `none`
`ANTHROPIC_API_KEY`	(unset)	Used when provider is `claude` (or `auto` if Ollama isn't reachable)
`PO_EXTRACTOR_OLLAMA_HOST`	`http://localhost:11434`	Ollama server URL
`PO_EXTRACTOR_OLLAMA_MODEL`	`qwen2.5:7b-instruct`	Local model name
`PO_EXTRACTOR_REQUIRE_LLM`	`false`	Hard-fail if LLM unavailable
`PO_EXTRACTOR_ALLOW_DRAFTS`	`false`	Load draft templates from `store/drafts/`
`PO_EXTRACTOR_LOG_LEVEL`	`INFO`	Logger level

Using a local LLM via Ollama

Both learn-template and the optional in-pipeline label-mapping stage will pick up Ollama automatically when no Anthropic key is set.

# 1) Install Ollama: https://ollama.com/download
# 2) Pull a model good enough for label mapping / template drafting:
ollama pull qwen2.5:7b-instruct          # recommended default (~4.7 GB)
# alternatives: llama3.1:8b, mistral:7b-instruct, qwen2.5:14b-instruct
# 3) Make sure the server is running:
ollama serve                              # usually runs as a service already

# 4) Use po_extractor exactly as you would with Claude:
po-extract learn-template "samples/SO-0005-2026.pdf" --format-name "vendor_xyz_so"

To force one provider regardless of detection:

$env:PO_EXTRACTOR_LLM_PROVIDER = "ollama"      # or "claude" / "none"

Adding a template

Two routes:

By hand — copy po_extractor/templates/store/_default.yaml, edit, drop in store/.
Auto-draft from a sample — po-extract learn-template data/new_vendor.pdf --format-name "new_vendor_po". Requires ANTHROPIC_API_KEY. Writes a draft into store/drafts/.

Templates carry: format_id, format_name, anchors[], label_aliases{}, table_headers[], field_rules[], validation_rules[]. See tstanes_po_v1.yaml for a fully-worked example.

Calibrating a template against a real document

The starter templates ship with draft: true because they were derived from anchor lists, not real samples. To calibrate:

# 1) Extract — likely produces some warnings or missing fields
po-extract extract data\real_tstanes.pdf --out result.json --allow-drafts

# 2) Hand-write a corrections.json with the right values + the labels you saw on the document
# (see docs/corrections-format.md for the schema)

# 3) Apply the correction — adds new aliases / region hints to the matched template
po-extract apply-correction --result result.json --correction corrections.json

# 4) Re-extract — the template now knows the new aliases
po-extract extract data\real_tstanes.pdf --out result2.json --allow-drafts

Once a draft has at least three confirmed aliases per required field via corrections, it is automatically promoted (draft: false) so it participates in normal matching.

Output schema

Every extraction produces an ExtractionResult (Pydantic model). Top-level shape:

{
  "document_type": "purchase_order",
  "source_file": "...",
  "page_count": 1,
  "detected_format_id": "tstanes_po_v1",
  "extraction_status": "success | needs_review | needs_template_review",
  "confidence": 0.0,
  "header": { "po_number": { "value": "...", "raw_value": "...", "label_seen": "...", "page": 1, "bbox": [...], "confidence": 0.0 }, ... },
  "parties": { ... },
  "items": [ { "row_index": 0, "cells": { ... }, "taxes": { ... } }, ... ],
  "terms": { ... },
  "totals": { ... },
  "handwritten_notes": [],
  "unmapped_text": [],
  "validation": { "status": "passed | warning | failed", "issues": [] },
  "raw_ocr": { "pages": [ ... ] },
  "diagnostics": { ... }
}

Testing

pytest -q

Tests run with the MockOCREngine reading canned JSON fixtures under data/fixtures/. No real OCR install or sample PDFs required for CI.

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 5, 2026

This version

0.1.0

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

po_extractor-0.1.0.tar.gz (98.2 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

po_extractor-0.1.0-py3-none-any.whl (109.2 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file po_extractor-0.1.0.tar.gz.

File metadata

Download URL: po_extractor-0.1.0.tar.gz
Upload date: May 5, 2026
Size: 98.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for po_extractor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`be6526d92d46fc5afa00af3ee8ba976049dfaa8c103a16bc3c5e4c8b7ce21660`
MD5	`57bf2c1a604757832199c3c5c8570590`
BLAKE2b-256	`6b4ab1cda9b964ac8fd24d08e02eb9c5909252a4ba0691819ac594699308e0cd`

See more details on using hashes here.

File details

Details for the file po_extractor-0.1.0-py3-none-any.whl.

File metadata

Download URL: po_extractor-0.1.0-py3-none-any.whl
Upload date: May 5, 2026
Size: 109.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for po_extractor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d28d063bb6ea957a34099d3f1e337ac92987a354a8473abc3b622af45ff034d1`
MD5	`e086cbcd2d303f13e1b11fed9022455b`
BLAKE2b-256	`6e91b3e40cbf8a1a01d122c22068a6f78a14bba85b1178026c1942aa0d2eb36e`

See more details on using hashes here.

po-extractor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

po_extractor

Highlights

Install

Quick start

Python API (one-liner)

CLI

Configuration

Using a local LLM via Ollama

Adding a template

Calibrating a template against a real document

Output schema

Testing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes