Utilities for Veryfi Data Annotations Engineer test
Project description
veryfi_test
Veryfi’s Data Annotations Engineer Test
Contents
- Overview
- Processing Pipeline
- Quick Start
- Credentials Setup
- CLI Usage
- Structured Field Extraction
- Extraction Assumptions
Overview
- What it does – Provides two CLIs (
veryfi-ocrandveryfi-extract) that turn Switch-branded invoices into structured JSON. The first uploads PDFs to Veryfi and stores the raw OCR payload; the second parses that payload into canonical invoice fields. - Why it exists – Automates the Switch use case end to end so annotation engineers can validate layouts, run regression suites, and feed downstream systems without writing bespoke scripts.
- Scope – Focused on Switch invoices only. Non-Switch layouts are explicitly skipped (
layout mismatch). The pipeline expects OCR manifests (JSON) and produces structured fields plus itemized line items. - Further reading – See
README_approach.mdfor architecture/trade-offs andREADME_extractor.mdfor a deep dive into the parsing logic.
Processing Pipeline
documents.json
↓
veryfi-ocr
↓
Raw Veryfi OCR JSON (outputs-ocr/)
↓
veryfi-extract
↓
Structured invoice fields (outputs-extracted/)
Quick Start
git clone https://github.com/omazapa/veryfi_test
cd veryfi_test
pip install -e .
cp .env.example .env
# edit .env with your Veryfi credentials
# drop your Switch PDFs under ./data/ or adjust the manifest paths accordingly
veryfi-ocr examples/documents.json
veryfi-extract outputs-ocr/
Credentials Setup
Store your Veryfi keys in environment variables so they never end up in Git:
cp .env.example .env
Edit .env and fill in the values:
VERYFI_API_URL(defaults tohttps://api.veryfi.com/if omitted)VERYFI_CLIENT_IDVERYFI_CLIENT_SECRETVERYFI_USERNAMEVERYFI_API_KEY
The application can then load them via veryfi_test.config.load_credentials, which prefers the actual environment over the .env file. Keep .env local—.gitignore already excludes it.
CLI Usage
Install the project (editable mode recommended during development):
pip install -e .
After your credentials are set, describe the documents in a JSON manifest:
[
{"path": "invoices/jan.pdf", "categories": ["Food", "Hotel"]},
{"path": "receipts/mar.jpg", "categories": ["Receipts"]}
]
Then run the CLI against that manifest:
veryfi-ocr documents.json
Example output when using examples/documents.json:
(home) ozapatam@tuxito:~/Projects/Veryfi/veryfi_test$ veryfi-ocr examples/documents.json
Processed data/synth-switch_v5-14.pdf (document id: 385142953) -> outputs-ocr/synth-switch_v5-14.json
Processed data/synth-switch_v5-4.pdf (document id: 385142969) -> outputs-ocr/synth-switch_v5-4.json
Processed data/synth-switch_v5-68.pdf (document id: 385142983) -> outputs-ocr/synth-switch_v5-68.json
Processed data/synth-switch_v5-7.pdf (document id: 385142997) -> outputs-ocr/synth-switch_v5-7.json
Processed data/synth-switch_v5-79.pdf (document id: 385143009) -> outputs-ocr/synth-switch_v5-79.json
Options:
--output-ocr-dir processed/stores JSON responses under a custom directory (default:./outputs-ocr).--env-file /custom/path/.envpoints to another dotenv file if needed.
Each manifest entry must include a path and can optionally define categories/topics (string or list). You may also wrap the list in an object with a documents key. Every processed document generates a JSON file whose name matches the original input (e.g., invoice.pdf → invoice.json) and stores the Veryfi response payload. You can also invoke the CLI without installing by running python -m veryfi_test.ocr_cli documents.json.
Structured Field Extraction
Every Veryfi response contains the OCR text under veryfi_response.ocr_text. The helper in veryfi_test/extractor.py uses the following cues to make sense of Switch-branded invoices:
- It first locates the vendor banner (
switch …+PO Box 674592 …). Files that do not match this header are rejected immediately so other layouts are ignored. - The invoice metadata row is parsed with a regex that captures the Invoice Date and Invoice No. fields. The first date becomes
invoice_dateand the number becomesinvoice_number. - The
bill toblock is the text between the invoice metadata row andAccount No.. The first line turns intobill_to_nameand the rest are collapsed (comma‑separated) intobill_to_address. - The vendor address is reconstructed from the city/state line and the
PO Boxline that were previously matched.
Run the extraction CLI against a directory full of Veryfi JSON files:
veryfi-extract outputs-ocr/
This command scans every *.json file under outputs-ocr, extracts the supported
fields, and writes one output file per invoice inside outputs-extracted/
(extracted_<original-name>.json). Each saved file contains the InvoiceFields
payload plus the source path, ready for downstream tooling. The payload now
includes a line_items array where each row captures (assuming SKUs are
represented by the last alphanumeric token of exactly eight characters that
appears inside parentheses):
sku: identifier derived from that eight-character token (uppercased);nullif absentdescription: full text, including any wrapped linesquantity: Quantity column straight from the invoiceprice: Rate columntotal: Amount columntax_rate: null (not derivable from the invoice)
Any JSON that does not match the Switch layout (for example examples/non_switch.json) is listed under skipped with the reason it was ignored. Running veryfi-extract outputs-ocr/ against a mix of approved and unapproved layouts prints a summary like:
{
"processed": 7,
"saved": [
"outputs-ocr/synth-switch_v5-14.json",
"outputs-ocr/synth-switch_v5-4.json",
"outputs-ocr/synth-switch_v5-68.json",
"outputs-ocr/synth-switch_v5-7.json",
"outputs-ocr/synth-switch_v5-79.json"
],
"skipped": {
"outputs-ocr/non-switch-invoice.json": "layout mismatch",
"outputs-ocr/non-switch-invoice2.json": "layout mismatch"
},
"output_dir": "outputs-extracted"
}
The extractor labels each non-Switch document with layout mismatch, so it never produces structured data for layouts we have not vetted.
Extraction Assumptions
- SKU Identification – When parsing line items we assume that the SKU is encoded as the last alphanumeric token with exactly eight characters enclosed in parentheses. The extractor uppercases that token and assigns it to
sku. Items that lack such a token keepsku: null. - Quantity – The quantity we store is exactly the value shown under the
Quantitycolumn (e.g., units, hours, bandwidth). We do not attempt to normalize units or convert them. - Price –
pricecomes directly from theRatecolumn, representing the unit price for the line item. - Total –
totalis copied from theAmountcolumn (the total per line, already calculated by the invoice and inclusive of any adjustments or discounts). - Tax Rate – The invoices do not expose a dedicated tax-rate column and the OCR text does not provide enough information to derive one. Consequently,
tax_rateis assumed to be unavailable and is always set tonull.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file veryfi_test-0.2.0.tar.gz.
File metadata
- Download URL: veryfi_test-0.2.0.tar.gz
- Upload date:
- Size: 17.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a857aaeab7ba2fea8f785a9e89f15e43e40c5ef09067b12493cf14ecadfd114
|
|
| MD5 |
06d68876d0718c3b2a1bae005b794575
|
|
| BLAKE2b-256 |
6586d371019ac68d450c2638b2a7ec972a7942f158b40d83270e885a12294847
|
File details
Details for the file veryfi_test-0.2.0-py3-none-any.whl.
File metadata
- Download URL: veryfi_test-0.2.0-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6505dbd8c882c8fecde6947c117056ca52cc449bb55a6da2439a65036f8859c6
|
|
| MD5 |
ce72d9479f1b6e0cea064295101745ea
|
|
| BLAKE2b-256 |
45965c07d3eb6c6ff651e3e230a9aaf5e8c5ad77c304732d8f7b4832fdcef1a5
|