Skip to main content

Haystack components for structured KV extraction from financial PDFs via Azure Document Intelligence

Project description

haystack-financial-doc-extractor

Copyright 2026 Ambreen Zaver, Callisto Tech. Licensed under Apache 2.0.

Haystack components for structured key-value extraction from financial documents — IRS Form 1040, W-2, Schedule C/E, K-1 (1065) — via Azure Document Intelligence.

Designed for use cases where extracted values must be compared deterministically against an authoritative reference system (e.g. a financial aid platform, tax reconciliation engine, or audit workflow). All parsing, normalization, and delta computation is done in Python with no LLM involvement.


Why this package

Standard Haystack document loaders treat a PDF as a blob of text. Financial forms are structured: every field has a known label, a line reference, and a numeric value that must round-trip to Decimal without loss. This package handles:

  • 4-stage Azure DI recovery chain — full doc → page splitter → DPI reduction → rotation block
  • Financial string normalization$75,000, (12,500), 75000 USD, N/A, 12.5%
  • Non-negative field protection — W-2 box values printed in parens are positive, not negative
  • Delta + severity scoring — HIGH / MEDIUM / LOW against a reference value dict
  • MD5-based cache invalidation — skip Azure DI if the document hasn't changed
  • FERPA-safe by design — no PII in logs, opaque document IDs, no student data persisted in plaintext

Install

pip install haystack-financial-doc-extractor

Requires Python 3.10+.


Components

Component Input Output
BytesIngestionComponent bytes_list, document_ids, source_names list[DocumentPayload]
DocumentIngestionComponent document_ids (stub — implement for your DMS) list[DocumentPayload]
AzureDiExtractor list[DocumentPayload] list[dict] with kv_entries
KvNormalizer list[dict] from extractor list[ExtractedField]
DeltaCalculator list[ExtractedField] + reference_values list[ExtractedField] with delta + severity

Quick start

from haystack_financial_doc_extractor import build_pipeline

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map={"adjusted gross income": "agi", "wages salaries tips": "wages"},
    section="HHA_INCOME",
    source_doc_type="IRS Form 1040",
)

with open("samples/f1040_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {
        "bytes_list": [pdf_bytes],
        "document_ids": ["doc-001"],
        "source_names": ["f1040_filled.pdf"],
    },
    "delta": {
        "reference_values": {"agi": 75000, "wages": 68000},
    },
})

for field in result["delta"]["fields"]:
    print(f"{field.field_name:<30} extracted={field.extracted_value}  delta={field.delta}  severity={field.severity}")

Sample usage by form type

All examples below use the synthetic sample forms in samples/ — all names, SSNs, EINs, and dollar amounts are entirely fictional (see FERPA compliance).

Form 1040

from haystack_financial_doc_extractor import build_pipeline

FIELD_MAP_1040 = {
    "adjusted gross income":    "agi",
    "wages salaries tips":      "wages",
    "total income":             "total_income",
    "taxable interest":         "taxable_interest",
    "ordinary dividends":       "dividends",
    "capital gain or loss":     "capital_gain",
    "total tax":                "total_tax",
    "federal income tax withheld": "tax_withheld",
}

# Reference values from your authoritative system (e.g. PowerFAIDS, FAFSA)
REFERENCE = {"agi": 83200, "wages": 82000, "total_tax": 11500}

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map=FIELD_MAP_1040,
    section="HHA_INCOME",
    source_doc_type="IRS Form 1040",
    # capital gains and losses can legitimately be negative — no non_negative_fields here
)

with open("samples/f1040_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {"bytes_list": [pdf_bytes], "document_ids": ["1040-2023"], "source_names": ["f1040_filled.pdf"]},
    "delta": {"reference_values": REFERENCE},
})

W-2

from haystack_financial_doc_extractor import build_pipeline

FIELD_MAP_W2 = {
    "wages tips other compensation": "wages",
    "federal income tax withheld":   "federal_withheld",
    "social security wages":         "ss_wages",
    "social security tax withheld":  "ss_tax_withheld",
    "medicare wages and tips":       "medicare_wages",
    "medicare tax withheld":         "medicare_tax_withheld",
}

REFERENCE = {"wages": 82000, "federal_withheld": 13200}

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map=FIELD_MAP_W2,
    section="HHA_INCOME",
    source_doc_type="W-2",
    # W-2 box values are never negative — parenthetical notation means something else
    non_negative_fields=["wages", "federal_withheld", "ss_wages", "ss_tax_withheld",
                         "medicare_wages", "medicare_tax_withheld"],
)

with open("samples/fw2_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {"bytes_list": [pdf_bytes], "document_ids": ["w2-2023"], "source_names": ["fw2_filled.pdf"]},
    "delta": {"reference_values": REFERENCE},
})

Schedule C (self-employment)

from haystack_financial_doc_extractor import build_pipeline

FIELD_MAP_SCHEDULE_C = {
    "gross receipts or sales":   "gross_receipts",
    "gross profit":              "gross_profit",
    "gross income":              "gross_income",
    "total expenses":            "total_expenses",
    "tentative profit or loss":  "net_profit",
    "net profit or loss":        "net_profit",
}

REFERENCE = {"gross_receipts": 45000, "net_profit": 37400}

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map=FIELD_MAP_SCHEDULE_C,
    section="HHA_INCOME",
    source_doc_type="Schedule C",
    # net profit CAN be negative (a loss) — do not add to non_negative_fields
)

with open("samples/f1040sc_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {"bytes_list": [pdf_bytes], "document_ids": ["schc-2023"], "source_names": ["f1040sc_filled.pdf"]},
    "delta": {"reference_values": REFERENCE},
})

Schedule E (rental income)

from haystack_financial_doc_extractor import build_pipeline

FIELD_MAP_SCHEDULE_E = {
    "rents received":            "rental_income",
    "royalties received":        "royalties",
    "total rental real estate":  "net_rental",
    "advertising":               "expense_advertising",
    "insurance":                 "expense_insurance",
    "mortgage interest paid":    "expense_mortgage_interest",
}

REFERENCE = {"rental_income": 18000, "net_rental": 16350}

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map=FIELD_MAP_SCHEDULE_E,
    section="HHA_INCOME",
    source_doc_type="Schedule E",
)

with open("samples/f1040se_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {"bytes_list": [pdf_bytes], "document_ids": ["sche-2023"], "source_names": ["f1040se_filled.pdf"]},
    "delta": {"reference_values": REFERENCE},
})

Schedule K-1 (Form 1065 — partnership)

from haystack_financial_doc_extractor import build_pipeline

FIELD_MAP_K1 = {
    "ordinary business income loss":  "ordinary_income",
    "net rental real estate income":  "rental_income",
    "interest income":                "interest_income",
    "ordinary dividends":             "dividends",
    "net short term capital gain":    "st_capital_gain",
    "net long term capital gain":     "lt_capital_gain",
}

REFERENCE = {"ordinary_income": 18400, "interest_income": 320}

pipeline = build_pipeline(
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_api_key="...",
    field_map=FIELD_MAP_K1,
    section="HHA_INCOME",
    source_doc_type="Schedule K-1 (1065)",
    # ordinary income can be a loss — allow negatives
)

with open("samples/f1065sk1_filled.pdf", "rb") as f:
    pdf_bytes = f.read()

result = pipeline.run({
    "ingest": {"bytes_list": [pdf_bytes], "document_ids": ["k1-2023"], "source_names": ["f1065sk1_filled.pdf"]},
    "delta": {"reference_values": REFERENCE},
})

Persistence (optional)

SQLite store with MD5-based cache invalidation — skips Azure DI on re-runs if the document content hasn't changed:

from haystack_financial_doc_extractor import SqliteExtractionStore

store = SqliteExtractionStore("extractions.db")

if store.is_cached("doc-001", pdf_bytes):
    fields = store.load_cached("doc-001", pdf_bytes)
else:
    result = pipeline.run(...)
    fields = result["delta"]["fields"]
    stage = result["extractor"]["extractions"][0]["stage_used"]
    store.save("doc-001", "f1040_filled.pdf", pdf_bytes, stage, fields)

Extracted values are stored as strings and parsed back to Decimal on load. No raw PII fields (names, SSNs) are stored — only canonical field names and numeric values.


Sections

from haystack_financial_doc_extractor import SectionKey

SectionKey.HHA_INCOME   # Household A income documents
SectionKey.HHB_INCOME   # Household B income documents
SectionKey.STUDENT      # Student income and assets
SectionKey.ASSETS       # Asset documentation
SectionKey.HOUSEHOLD    # Household composition
SectionKey.EXPENSES     # Expense documentation

Running the example script

# Install
pip install -e ".[dev]"

# Set Azure credentials
export AZURE_DI_ENDPOINT="https://<resource>.cognitiveservices.azure.com/"
export AZURE_DI_KEY="<your-key>"

# Run against a sample form
python examples/run_pipeline.py --pdf samples/f1040_filled.pdf --section HHA_INCOME

# Output:
# FIELD                          EXTRACTED    REFERENCE      DELTA SEVERITY
# --------------------------------------------------------------------------------
# agi                             83200.00      83200.00       0.00 LOW
# wages                           82000.00      82000.00       0.00 LOW
# total_tax                       11500.00      11500.00       0.00 LOW

FERPA compliance

This package is designed for deployment in environments that process student financial aid records subject to FERPA (Family Educational Rights and Privacy Act).

What this package does

  • No PII in logs. The logger emits field names and numeric values only. Raw document content (which may contain names and SSNs) is never logged.
  • Opaque document IDs. The document_id passed to components is an opaque caller-supplied string. The package does not inspect, store, or log it in a way that exposes student identity.
  • No cross-document state. Each pipeline.run() call is stateless. No data from one document is accessible during processing of another.
  • Numeric-only persistence. SqliteExtractionStore persists canonical field names and Decimal values only — not raw document text, not names, not SSNs, not addresses.
  • Cache keyed by content hash. Cache lookup uses MD5(pdf_bytes) — the hash is a one-way function and reveals nothing about document content.

Sample data

All files in samples/ were generated by samples/generate_samples.py using entirely fictional data:

  • Names: James Harrington (fictional)
  • SSNs: XXX-XX-1234 (masked — not a real SSN format)
  • EINs: 12-3456789, 98-7654321 (fictional)
  • Addresses: 742 Evergreen Terrace, Springfield IL (fictional)
  • Dollar amounts: representative but invented

No real taxpayer data was used. Do not commit real tax documents to this repository.

Deployer responsibilities

FERPA compliance of the overall system depends on how you deploy this package:

Concern Your responsibility
Azure DI data retention Disable Azure DI input/output logging in your Azure resource
Network boundary Deploy behind VPN or private endpoint — never expose extraction endpoints publicly
Auth Protect the endpoints that accept PDF bytes with Bearer JWT or equivalent
SQLite file Restrict filesystem permissions on extractions.db — treat it as sensitive
Blob storage If storing PDFs in Azure Blob, enable encryption at rest and restrict access

License

Copyright 2026 Ambreen Zaver, Callisto Tech. Licensed under the Apache License, Version 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_di_financial_haystack-0.1.1.tar.gz (3.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_di_financial_haystack-0.1.1-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file azure_di_financial_haystack-0.1.1.tar.gz.

File metadata

File hashes

Hashes for azure_di_financial_haystack-0.1.1.tar.gz
Algorithm Hash digest
SHA256 90af5bafc5060007a983eab2493adcbf44c2b8097f41ac738b38266dd6743c95
MD5 a1ee4f8857baedc78520b5c41a4e581a
BLAKE2b-256 b74d541ff4f5627aa58ca65fbe92b8a75ec3ddfef21300e8857cd02265d62d62

See more details on using hashes here.

File details

Details for the file azure_di_financial_haystack-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for azure_di_financial_haystack-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 093f6c9bac740a330deee8c6de34ba4bdfb18dd8a0d6702fffdd4a55239bdede
MD5 9f10497f663b916db01f3d9768bc93ac
BLAKE2b-256 55590cfd2187ac35d90f29ad4b96138d599a54759c36296a02962460f3f54b56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page