Template-aware extractor for scanned Purchase Order / Sales Order PDFs and images (Tally / GST style). OCR + layout + saved YAML templates + optional Claude/Ollama.
Project description
po_extractor
Template-aware extractor for scanned Purchase Order PDFs and images.
po_extractor turns scanned POs (Tally-style forms, vendor sales orders) into clean, schema-aligned JSON. It does not train a model from scratch. It pairs production OCR with geometry-driven layout analysis, persistent YAML templates, deterministic field/table extractors, validators, and an optional Claude Opus 4.7 stage for narrowly-scoped normalization.
Every emitted value carries evidence: the page, bounding box, raw OCR text, and confidence. New formats are remembered as YAML templates that grow via a correction loop.
Highlights
- OCR-pluggable —
BaseOCREngineabstraction. Default:rapidocr-onnxruntime(Windows-friendly, pure ONNX). Optional: PaddleOCR viapip install po_extractor[paddle]. Mock engine for tests. - Template registry — YAML-defined anchors, label aliases, table-header aliases, field rules, validation rules. Match score is deterministic and inspectable.
- Table reconstruction — column inference from header bboxes, row clustering by y-gap, multi-line description coalescing, tax sub-row folding, multi-page table stitching.
- Validation — GSTIN regex+checksum, mobile, HSN length, dates (DMY default), Indian numerics (
1,23,456.78and₹handled). Cross-row math:qty * rate ≈ amount,sum(items) ≈ totals. - LLM (optional) — Claude Opus 4.7 (
claude-opus-4-7) used for label mapping and template drafting. Every LLM-emitted value is grounded against OCR before it lands in the result. - Correction loop —
apply-correctionwrites confirmed aliases back into the matched template, so the next document of the same layout extracts cleanly without re-asking. - Pydantic v2 throughout — every output is a typed model with stable JSON.
Install
Once published to PyPI:
pip install po-extractor[rapid] # default OCR backend
pip install po-extractor[rapid,llm] # + Anthropic SDK for Claude Opus 4.7
pip install po-extractor[paddle] # PaddleOCR (heavier; may need toolchain on Windows)
From source (development install):
git clone <repo-url>
cd po-extractor
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .[dev,llm]
Build a wheel locally:
pip install build
python -m build . # produces dist/po_extractor-0.1.0-py3-none-any.whl
Quick start
Python API (one-liner)
from po_extractor import extract
result = extract("invoice.pdf") # any path, bytes, or file-like
print(result["po_number"]) # subscript access — returns the value
print(result["buyer_gstin"])
print(result.values_dict()) # {"po_number": "...", "items": [...], ...}
print(result.to_json(indent=2)) # full JSON with values + rich form
Provenance (bbox, confidence, raw OCR text, label seen on page) when you need it:
ev = result.evidence("buyer_gstin")
print(ev.value, ev.bbox, ev.confidence, ev.label_seen, ev.source)
Other input forms — bytes, file-like, str, Path:
data = open("invoice.pdf", "rb").read()
result = extract(data) # bytes
result = extract(open("invoice.pdf", "rb")) # file-like
import io; result = extract(io.BytesIO(data))
Need fine-grained control? Use the class:
from po_extractor import POExtractor
extractor = POExtractor(
ocr_engine_name="rapid", # "rapid" | "paddle" | "mock"
use_llm=False, # disable LLM stages
llm_only=False, # set True to skip templates entirely
)
result = extractor.extract("invoice.pdf")
CLI
Every command takes a verb. Both po-extract (console script) and python -m po_extractor work:
po-extract extract invoice.pdf --out result.json
po-extract extract invoice.pdf --llm-only # skip templates, use Claude/Ollama
po-extract list-templates
po-extract match invoice.pdf # show match score breakdown
po-extract apply-correction --result result.json --correction corrections.json
po-extract learn-template invoice.pdf --format-name "Acme Sales Order"
po-extract validate result.json # re-run validators on existing JSON
python -m po_extractor extract invoice.pdf # equivalent to po-extract extract
po-extract --help # full verb list
Configuration
Settings come from environment variables (see .env.example) or programmatic Settings overrides:
| Var | Default | Meaning |
|---|---|---|
PO_EXTRACTOR_OCR_ENGINE |
rapid |
rapid, paddle, or mock |
PO_EXTRACTOR_DPI |
300 |
PDF rasterization DPI |
PO_EXTRACTOR_LLM_PROVIDER |
auto |
auto / claude / ollama / none |
ANTHROPIC_API_KEY |
(unset) | Used when provider is claude (or auto if Ollama isn't reachable) |
PO_EXTRACTOR_OLLAMA_HOST |
http://localhost:11434 |
Ollama server URL |
PO_EXTRACTOR_OLLAMA_MODEL |
qwen2.5:7b-instruct |
Local model name |
PO_EXTRACTOR_REQUIRE_LLM |
false |
Hard-fail if LLM unavailable |
PO_EXTRACTOR_ALLOW_DRAFTS |
false |
Load draft templates from store/drafts/ |
PO_EXTRACTOR_LOG_LEVEL |
INFO |
Logger level |
Using a local LLM via Ollama
Both learn-template and the optional in-pipeline label-mapping stage will pick up Ollama automatically when no Anthropic key is set.
# 1) Install Ollama: https://ollama.com/download
# 2) Pull a model good enough for label mapping / template drafting:
ollama pull qwen2.5:7b-instruct # recommended default (~4.7 GB)
# alternatives: llama3.1:8b, mistral:7b-instruct, qwen2.5:14b-instruct
# 3) Make sure the server is running:
ollama serve # usually runs as a service already
# 4) Use po_extractor exactly as you would with Claude:
po-extract learn-template "samples/SO-0005-2026.pdf" --format-name "vendor_xyz_so"
To force one provider regardless of detection:
$env:PO_EXTRACTOR_LLM_PROVIDER = "ollama" # or "claude" / "none"
Adding a template
Two routes:
- By hand — copy
po_extractor/templates/store/_default.yaml, edit, drop instore/. - Auto-draft from a sample —
po-extract learn-template data/new_vendor.pdf --format-name "new_vendor_po". RequiresANTHROPIC_API_KEY. Writes a draft intostore/drafts/.
Templates carry: format_id, format_name, anchors[], label_aliases{}, table_headers[], field_rules[], validation_rules[]. See tstanes_po_v1.yaml for a fully-worked example.
Calibrating a template against a real document
The starter templates ship with draft: true because they were derived from anchor lists, not real samples. To calibrate:
# 1) Extract — likely produces some warnings or missing fields
po-extract extract data\real_tstanes.pdf --out result.json --allow-drafts
# 2) Hand-write a corrections.json with the right values + the labels you saw on the document
# (see docs/corrections-format.md for the schema)
# 3) Apply the correction — adds new aliases / region hints to the matched template
po-extract apply-correction --result result.json --correction corrections.json
# 4) Re-extract — the template now knows the new aliases
po-extract extract data\real_tstanes.pdf --out result2.json --allow-drafts
Once a draft has at least three confirmed aliases per required field via corrections, it is automatically promoted (draft: false) so it participates in normal matching.
Output schema
Every extraction produces an ExtractionResult (Pydantic model). Top-level shape:
{
"document_type": "purchase_order",
"source_file": "...",
"page_count": 1,
"detected_format_id": "tstanes_po_v1",
"extraction_status": "success | needs_review | needs_template_review",
"confidence": 0.0,
"header": { "po_number": { "value": "...", "raw_value": "...", "label_seen": "...", "page": 1, "bbox": [...], "confidence": 0.0 }, ... },
"parties": { ... },
"items": [ { "row_index": 0, "cells": { ... }, "taxes": { ... } }, ... ],
"terms": { ... },
"totals": { ... },
"handwritten_notes": [],
"unmapped_text": [],
"validation": { "status": "passed | warning | failed", "issues": [] },
"raw_ocr": { "pages": [ ... ] },
"diagnostics": { ... }
}
Testing
pytest -q
Tests run with the MockOCREngine reading canned JSON fixtures under data/fixtures/. No real OCR install or sample PDFs required for CI.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file po_extractor-0.1.0.tar.gz.
File metadata
- Download URL: po_extractor-0.1.0.tar.gz
- Upload date:
- Size: 98.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be6526d92d46fc5afa00af3ee8ba976049dfaa8c103a16bc3c5e4c8b7ce21660
|
|
| MD5 |
57bf2c1a604757832199c3c5c8570590
|
|
| BLAKE2b-256 |
6b4ab1cda9b964ac8fd24d08e02eb9c5909252a4ba0691819ac594699308e0cd
|
File details
Details for the file po_extractor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: po_extractor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 109.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d28d063bb6ea957a34099d3f1e337ac92987a354a8473abc3b622af45ff034d1
|
|
| MD5 |
e086cbcd2d303f13e1b11fed9022455b
|
|
| BLAKE2b-256 |
6e91b3e40cbf8a1a01d122c22068a6f78a14bba85b1178026c1942aa0d2eb36e
|