Haystack components for structured KV extraction from financial PDFs via Azure Document Intelligence

These details have not been verified by PyPI

Project links

Project description

haystack-financial-doc-extractor

Copyright 2026 Ambreen Zaver, Callisto Tech. Licensed under Apache 2.0.

Haystack components for structured key-value extraction from financial documents — IRS Form 1040, W-2, Schedule C/E, K-1 (1065) — via Azure Document Intelligence.

Designed for use cases where extracted values must be compared deterministically against an authoritative reference system (e.g. a financial aid platform, tax reconciliation engine, or audit workflow). All parsing, normalization, and delta computation is done in Python with no LLM involvement.

Why this package

Standard Haystack document loaders treat a PDF as a blob of text. Financial forms are structured: every field has a known label, a line reference, and a numeric value that must round-trip to Decimal without loss. This package handles:

4-stage Azure DI recovery chain — full doc → page splitter → DPI reduction → rotation block
Financial string normalization — $75,000, (12,500), 75000 USD, N/A, 12.5%
Non-negative field protection — W-2 box values printed in parens are positive, not negative
Delta + severity scoring — HIGH / MEDIUM / LOW against a reference value dict
MD5-based cache invalidation — skip Azure DI if the document hasn't changed
FERPA-safe by design — no PII in logs, opaque document IDs, no student data persisted in plaintext

Install

pip install haystack-financial-doc-extractor

Requires Python 3.10+.

Components

Component	Input	Output
`BytesIngestionComponent`	`bytes_list`, `document_ids`, `source_names`	`list[DocumentPayload]`
`DocumentIngestionComponent`	`document_ids` (stub — implement for your DMS)	`list[DocumentPayload]`
`AzureDiExtractor`	`list[DocumentPayload]`	`list[dict]` with `kv_entries`
`KvNormalizer`	`list[dict]` from extractor	`list[ExtractedField]`
`DeltaCalculator`	`list[ExtractedField]` + `reference_values`	`list[ExtractedField]` with delta + severity

Quick start

from haystack_financial_doc_extractor import build_pipeline

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map={"adjusted gross income": "agi", "wages salaries tips": "wages"},
    section="HHA_INCOME",
    source_doc_type="IRS Form 1040",
)

with open("samples/f1040_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {
        "bytes_list": [pdf_bytes],
        "document_ids": ["doc-001"],
        "source_names": ["f1040_filled.pdf"],
    },
    "delta": {
        "reference_values": {"agi": 75000, "wages": 68000},
    },
})

for field in result["delta"]["fields"]:
    print(f"{field.field_name:<30} extracted={field.extracted_value}  delta={field.delta}  severity={field.severity}")

Sample usage by form type

All examples below use the synthetic sample forms in samples/ — all names, SSNs, EINs, and dollar amounts are entirely fictional (see FERPA compliance).

Form 1040

from haystack_financial_doc_extractor import build_pipeline

FIELD_MAP_1040 = {
    "adjusted gross income":    "agi",
    "wages salaries tips":      "wages",
    "total income":             "total_income",
    "taxable interest":         "taxable_interest",
    "ordinary dividends":       "dividends",
    "capital gain or loss":     "capital_gain",
    "total tax":                "total_tax",
    "federal income tax withheld": "tax_withheld",
}

# Reference values from your authoritative system (e.g. PowerFAIDS, FAFSA)
REFERENCE = {"agi": 83200, "wages": 82000, "total_tax": 11500}

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map=FIELD_MAP_1040,
    section="HHA_INCOME",
    source_doc_type="IRS Form 1040",
    # capital gains and losses can legitimately be negative — no non_negative_fields here
)

with open("samples/f1040_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {"bytes_list": [pdf_bytes], "document_ids": ["1040-2023"], "source_names": ["f1040_filled.pdf"]},
    "delta": {"reference_values": REFERENCE},
})

W-2

from haystack_financial_doc_extractor import build_pipeline

FIELD_MAP_W2 = {
    "wages tips other compensation": "wages",
    "federal income tax withheld":   "federal_withheld",
    "social security wages":         "ss_wages",
    "social security tax withheld":  "ss_tax_withheld",
    "medicare wages and tips":       "medicare_wages",
    "medicare tax withheld":         "medicare_tax_withheld",
}

REFERENCE = {"wages": 82000, "federal_withheld": 13200}

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map=FIELD_MAP_W2,
    section="HHA_INCOME",
    source_doc_type="W-2",
    # W-2 box values are never negative — parenthetical notation means something else
    non_negative_fields=["wages", "federal_withheld", "ss_wages", "ss_tax_withheld",
                         "medicare_wages", "medicare_tax_withheld"],
)

with open("samples/fw2_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {"bytes_list": [pdf_bytes], "document_ids": ["w2-2023"], "source_names": ["fw2_filled.pdf"]},
    "delta": {"reference_values": REFERENCE},
})

Schedule C (self-employment)

from haystack_financial_doc_extractor import build_pipeline

FIELD_MAP_SCHEDULE_C = {
    "gross receipts or sales":   "gross_receipts",
    "gross profit":              "gross_profit",
    "gross income":              "gross_income",
    "total expenses":            "total_expenses",
    "tentative profit or loss":  "net_profit",
    "net profit or loss":        "net_profit",
}

REFERENCE = {"gross_receipts": 45000, "net_profit": 37400}

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map=FIELD_MAP_SCHEDULE_C,
    section="HHA_INCOME",
    source_doc_type="Schedule C",
    # net profit CAN be negative (a loss) — do not add to non_negative_fields
)

with open("samples/f1040sc_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {"bytes_list": [pdf_bytes], "document_ids": ["schc-2023"], "source_names": ["f1040sc_filled.pdf"]},
    "delta": {"reference_values": REFERENCE},
})

Schedule E (rental income)

from haystack_financial_doc_extractor import build_pipeline

FIELD_MAP_SCHEDULE_E = {
    "rents received":            "rental_income",
    "royalties received":        "royalties",
    "total rental real estate":  "net_rental",
    "advertising":               "expense_advertising",
    "insurance":                 "expense_insurance",
    "mortgage interest paid":    "expense_mortgage_interest",
}

REFERENCE = {"rental_income": 18000, "net_rental": 16350}

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map=FIELD_MAP_SCHEDULE_E,
    section="HHA_INCOME",
    source_doc_type="Schedule E",
)

with open("samples/f1040se_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {"bytes_list": [pdf_bytes], "document_ids": ["sche-2023"], "source_names": ["f1040se_filled.pdf"]},
    "delta": {"reference_values": REFERENCE},
})

Schedule K-1 (Form 1065 — partnership)

from haystack_financial_doc_extractor import build_pipeline

FIELD_MAP_K1 = {
    "ordinary business income loss":  "ordinary_income",
    "net rental real estate income":  "rental_income",
    "interest income":                "interest_income",
    "ordinary dividends":             "dividends",
    "net short term capital gain":    "st_capital_gain",
    "net long term capital gain":     "lt_capital_gain",
}

REFERENCE = {"ordinary_income": 18400, "interest_income": 320}

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map=FIELD_MAP_K1,
    section="HHA_INCOME",
    source_doc_type="Schedule K-1 (1065)",
    # ordinary income can be a loss — allow negatives
)

with open("samples/f1065sk1_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {"bytes_list": [pdf_bytes], "document_ids": ["k1-2023"], "source_names": ["f1065sk1_filled.pdf"]},
    "delta": {"reference_values": REFERENCE},
})

Persistence (optional)

SQLite store with MD5-based cache invalidation — skips Azure DI on re-runs if the document content hasn't changed:

from haystack_financial_doc_extractor import SqliteExtractionStore

store = SqliteExtractionStore("extractions.db")

if store.is_cached("doc-001", pdf_bytes):
    fields = store.load_cached("doc-001", pdf_bytes)
else:
    result = pipeline.run(...)
    fields = result["delta"]["fields"]
    stage = result["extractor"]["extractions"][0]["stage_used"]
    store.save("doc-001", "f1040_filled.pdf", pdf_bytes, stage, fields)

Extracted values are stored as strings and parsed back to Decimal on load. No raw PII fields (names, SSNs) are stored — only canonical field names and numeric values.

Sections

from haystack_financial_doc_extractor import SectionKey

SectionKey.HHA_INCOME   # Household A income documents
SectionKey.HHB_INCOME   # Household B income documents
SectionKey.STUDENT      # Student income and assets
SectionKey.ASSETS       # Asset documentation
SectionKey.HOUSEHOLD    # Household composition
SectionKey.EXPENSES     # Expense documentation

Running the example script

# Install
pip install -e ".[dev]"

# Set Azure credentials
export AZURE_DI_ENDPOINT="https://<resource>.cognitiveservices.azure.com/"
export AZURE_DI_KEY="<your-key>"

# Run against a sample form
python examples/run_pipeline.py --pdf samples/f1040_filled.pdf --section HHA_INCOME

# Output:
# FIELD                          EXTRACTED    REFERENCE      DELTA SEVERITY
# --------------------------------------------------------------------------------
# agi                             83200.00      83200.00       0.00 LOW
# wages                           82000.00      82000.00       0.00 LOW
# total_tax                       11500.00      11500.00       0.00 LOW

FERPA compliance

This package is designed for deployment in environments that process student financial aid records subject to FERPA (Family Educational Rights and Privacy Act).

What this package does

No PII in logs. The logger emits field names and numeric values only. Raw document content (which may contain names and SSNs) is never logged.
Opaque document IDs. The document_id passed to components is an opaque caller-supplied string. The package does not inspect, store, or log it in a way that exposes student identity.
No cross-document state. Each pipeline.run() call is stateless. No data from one document is accessible during processing of another.
Numeric-only persistence. SqliteExtractionStore persists canonical field names and Decimal values only — not raw document text, not names, not SSNs, not addresses.
Cache keyed by content hash. Cache lookup uses MD5(pdf_bytes) — the hash is a one-way function and reveals nothing about document content.

Sample data

All files in samples/ were generated by samples/generate_samples.py using entirely fictional data:

Names: James Harrington (fictional)
SSNs: XXX-XX-1234 (masked — not a real SSN format)
EINs: 12-3456789, 98-7654321 (fictional)
Addresses: 742 Evergreen Terrace, Springfield IL (fictional)
Dollar amounts: representative but invented

No real taxpayer data was used. Do not commit real tax documents to this repository.

Deployer responsibilities

FERPA compliance of the overall system depends on how you deploy this package:

Concern	Your responsibility
Azure DI data retention	Disable Azure DI input/output logging in your Azure resource
Network boundary	Deploy behind VPN or private endpoint — never expose extraction endpoints publicly
Auth	Protect the endpoints that accept PDF bytes with Bearer JWT or equivalent
SQLite file	Restrict filesystem permissions on `extractions.db` — treat it as sensitive
Blob storage	If storing PDFs in Azure Blob, enable encryption at rest and restrict access

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_di_financial_haystack-0.1.1.tar.gz (3.3 MB view details)

Uploaded Jun 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

azure_di_financial_haystack-0.1.1-py3-none-any.whl (24.1 kB view details)

Uploaded Jun 9, 2026 Python 3

File details

Details for the file azure_di_financial_haystack-0.1.1.tar.gz.

File metadata

Download URL: azure_di_financial_haystack-0.1.1.tar.gz
Upload date: Jun 9, 2026
Size: 3.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for azure_di_financial_haystack-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`90af5bafc5060007a983eab2493adcbf44c2b8097f41ac738b38266dd6743c95`
MD5	`a1ee4f8857baedc78520b5c41a4e581a`
BLAKE2b-256	`b74d541ff4f5627aa58ca65fbe92b8a75ec3ddfef21300e8857cd02265d62d62`

See more details on using hashes here.

File details

Details for the file azure_di_financial_haystack-0.1.1-py3-none-any.whl.

File metadata

Download URL: azure_di_financial_haystack-0.1.1-py3-none-any.whl
Upload date: Jun 9, 2026
Size: 24.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for azure_di_financial_haystack-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`093f6c9bac740a330deee8c6de34ba4bdfb18dd8a0d6702fffdd4a55239bdede`
MD5	`9f10497f663b916db01f3d9768bc93ac`
BLAKE2b-256	`55590cfd2187ac35d90f29ad4b96138d599a54759c36296a02962460f3f54b56`

See more details on using hashes here.

azure-di-financial-haystack 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

haystack-financial-doc-extractor

Why this package

Install

Components

Quick start

Sample usage by form type

Form 1040

W-2

Schedule C (self-employment)

Schedule E (rental income)

Schedule K-1 (Form 1065 — partnership)

Persistence (optional)

Sections

Running the example script

FERPA compliance

What this package does

Sample data

Deployer responsibilities

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes