Skip to main content

Canadian Financial Document Intelligence Framework

Project description

FinLit ๐Ÿ

Extract structured data from Canadian financial documents โ€” T4s, T5s, SEDAR filings, bank statements โ€” with a compliance audit trail built in.

PyPI version License: Apache 2.0 Python 3.10+ Built on Docling

pip install finlit
python -m spacy download en_core_web_lg   # one-time, required by Presidio
export ANTHROPIC_API_KEY=sk-ant-...
from finlit import DocumentPipeline, schemas

result = DocumentPipeline(schema=schemas.CRA_T4, extractor="claude").run("t4_2024.pdf")

print(result.fields["box_14_employment_income"])     # โ†’ 87500.0
print(result.confidence["box_14_employment_income"]) # โ†’ 0.97
print(result.needs_review)                           # โ†’ False

Who this is for:

  • Canadian fintechs processing user-uploaded T-slips into structured data
  • Banks and credit unions running SEDAR filing and statement pipelines
  • Accounting and tax software pre-filling CRA forms from client documents
  • Any team that needs on-premises extraction with a PIPEDA/OSFI-friendly audit trail

Not a developer? See docs/use-cases.md for business context, compliance framing, and "build vs. buy" math.

In a hurry? Read the 5-minute Quickstart โ€” install, pick a backend, extract your first T4. Come back here when you need the reference.


Contents


Why FinLit

General-purpose extraction tools parse PDFs fine but don't know what a T4 box is, what fields CRA requires, or what a Canadian SIN looks like. FinLit is the Canadian-document layer โ€” pre-built, open-source, on-premises โ€” that you'd otherwise write yourself: versioned CRA schemas, per-field confidence, source traceability, PIPEDA PII detection, and an immutable audit log.

It wraps Docling (IBM's parser) and pydantic-ai (model-agnostic LLM orchestration). Runs entirely inside your infrastructure โ€” with extractor="ollama" even LLM calls stay on-prem, suitable for OSFI-regulated and air-gapped deployments.


Setup

Install:

pip install finlit
python -m spacy download en_core_web_lg

The spaCy model is required by Presidio for PII detection. Skipping it will raise an OSError on first pipeline run.

Pick an extractor backend โ€” set one of:

Backend Env / setup Extractor string
Anthropic Claude (default) export ANTHROPIC_API_KEY=... "claude"
OpenAI export OPENAI_API_KEY=... "openai"
Local Ollama Install Ollama ยท ollama pull llama3.2 "ollama"

Docling pulls its layout models from HuggingFace on first run (~500MB, cached afterwards).


Usage

Extract a T4

from finlit import DocumentPipeline, schemas

pipeline = DocumentPipeline(
    schema=schemas.CRA_T4,
    extractor="claude",       # or "openai" or "ollama"
    audit=True,
    review_threshold=0.85,
)

result = pipeline.run("john_doe_t4_2024.pdf")

# Typed, validated fields
print(result.fields["box_14_employment_income"])      # โ†’ 87500.0
print(result.fields["province_of_employment"])        # โ†’ "ON"

# Per-field confidence โ€” box_52 came back at 71%, below the 0.85 threshold
print(result.confidence["box_52_pension_adjustment"]) # โ†’ 0.71
print(result.needs_review)                            # โ†’ True
print(result.review_fields)
# [{"field": "box_52_pension_adjustment", "confidence": 0.71, "raw": "..."}]

# Trace any value back to its location in the source PDF
print(result.source_ref["box_14_employment_income"])
# {"page": 1, "bbox": [120, 340, 280, 360], "doc": "john_doe_t4_2024.pdf"}

# Audit log โ€” append-only, finalized at end of run
for ev in result.audit_log:
    print(ev["event"], ev.get("ts"))
# document_loaded ...
# pii_detected ...
# extraction_complete ...
# review_flagged ...
# pipeline_complete ...

Batch processing

from finlit import BatchPipeline, schemas
from glob import glob

batch = BatchPipeline(schema=schemas.CRA_T4, extractor="claude", workers=8)

for path in glob("uploads/*.pdf"):
    batch.add(path)

results = batch.run()
results.export_csv("extracted/t4s_2024.csv")

print(f"Processed:    {results.total}")
print(f"Needs review: {results.review_count}")

Vision fallback for scans and forms

Text extraction fails in two cases: image-only PDFs with no text layer, and form-heavy documents (tax slips, invoices) where 2D column alignment carries meaning. For both, enable the vision fallback โ€” it sends rendered page images to any multimodal LLM.

from finlit import DocumentPipeline, VisionExtractor, schemas

pipeline = DocumentPipeline(
    schema=schemas.CRA_T5,
    extractor="claude",                     # text path (cheap, fast)
    vision_extractor=VisionExtractor(),     # vision fallback (accurate)
)
result = pipeline.run("t5_scanned.pdf")
print(result.extraction_path)               # โ†’ "text" or "vision"

By default the vision extractor runs only when the text result has needs_review=True. Pass a custom callback for finer control:

pipeline = DocumentPipeline(
    schema=schemas.CRA_T5,
    extractor="claude",
    vision_extractor=VisionExtractor(model="openai:gpt-4o"),
    vision_fallback_when=lambda r: any(c < 0.80 for c in r.confidence.values()),
)

Vision results replace the text result entirely โ€” result.fields is whatever the vision extractor returned, and result.extraction_path == "vision". If vision fails for any reason (render error, API failure, LLM error) the pipeline keeps the text result and logs a vision_fallback_failed warning.

Fully local with Ollama

No API keys, no external network โ€” suitable for air-gapped and OSFI-regulated deployments. Text and vision can each be local independently.

pipeline = DocumentPipeline(
    schema=schemas.CRA_T5,
    extractor="ollama:llama3.2",
    vision_extractor=VisionExtractor(model="ollama:qwen2.5vl:7b"),
)

Vision models verified against CRA slips:

Model Size Ollama tag Notes
Qwen2.5-VL 7B ollama:qwen2.5vl:7b Strongest on form/document tasks
Llama 3.2 Vision 11B ollama:llama3.2-vision General-purpose, Meta
MiniCPM-V 8B ollama:minicpm-v Fast, OpenBMB

Any pydantic-aiโ€“compatible multimodal model works; these are the ones explicitly tested.

Custom schemas

from finlit import DocumentPipeline, Schema, Field

loan_schema = Schema(
    name="internal_loan_application",
    fields=[
        Field("applicant_name",  dtype=str,   required=True),
        Field("gross_income",    dtype=float, required=True),
        Field("sin_number",      dtype=str,   pii=True),
        Field("loan_amount",     dtype=float, required=True),
    ]
)

result = DocumentPipeline(schema=loan_schema, extractor="claude").run("loan_app.pdf")

Error handling

FinLit does not raise on low-confidence fields โ€” those go into result.review_fields. It does attach structured warnings for document-level problems (sparse OCR, missing required fields, vision fallback failure).

result = pipeline.run("t4.pdf")

if result.needs_review:
    for flagged in result.review_fields:
        queue_for_human_review(flagged)

for warning in result.warnings:
    if warning["code"] == "sparse_document":
        # PDF had very little extractable text โ€” likely a scan
        ...
    elif warning["code"] == "vision_fallback_failed":
        # Vision path was tried and failed; we kept the text result
        log.warn(warning["reason"])

Common warning codes:

Code Meaning
sparse_document Extracted text is very short; likely an image-only PDF
missing_required_fields One or more required=True fields came back empty
vision_fallback_failed Vision path was attempted and failed; text result retained
pii_detected Presidio found PII entities in the source text

LangChain integration

FinLit ships a LangChain BaseLoader so you can drop extracted Canadian financial documents straight into RAG pipelines, retrievers, and agents.

Install the extra:

pip install finlit[langchain]

Load one file:

from finlit.integrations.langchain import FinLitLoader

docs = FinLitLoader("t4.pdf", schema="cra.t4").load()
doc = docs[0]
print(doc.metadata["finlit_fields"]["employer_name"])      # "Acme Corp"
print(doc.metadata["finlit_needs_review"])                  # False

Batch load with compliance-friendly error surfacing:

loader = FinLitLoader(
    ["t4_001.pdf", "t4_002.pdf", "t4_003.pdf"],
    schema="cra.t4",
    on_error="include",  # failures become Documents with finlit_error
)
docs = loader.load()
# Filter out failures before embedding โ€” empty page_content breaks most embedders.
good = [d for d in docs if not d.metadata.get("finlit_error")]

Access the underlying ExtractionResult objects via loader.last_results (same order as the input paths, with None for skipped/included failures).

MCP server

Expose FinLit as a Model Context Protocol server so any MCP-compatible host (Claude Desktop, Claude Code, Cursor, custom agents) can extract documents through tool calls โ€” no Python glue.

Install the extra:

pip install finlit[mcp]

Run the server (two equivalent ways):

# Human-facing
finlit mcp serve --extractor claude

# Claude Desktop mcpServers config
python -m finlit.integrations.mcp

Claude Desktop config example:

{
  "mcpServers": {
    "finlit": {
      "command": "python",
      "args": ["-m", "finlit.integrations.mcp"],
      "env": {
        "ANTHROPIC_API_KEY": "...",
        "FINLIT_EXTRACTOR": "claude",
        "FINLIT_PII_MODE": "redact"
      }
    }
  }
}

Tools exposed:

  • list_schemas() โ€” discover the built-in CRA / banking schemas
  • extract_document(path, schema, ...) โ€” extract one document
  • batch_extract(paths, schema, ...) โ€” extract many in parallel
  • detect_pii(text, ...) โ€” standalone Presidio + Canadian recognizers

PII fields (per schema annotation) are redacted in tool responses by default โ€” appropriate to the chat-transcript trust model. Pass redact_pii=false per call, or start with --pii-mode raw, to opt out.


CLI

finlit extract t4_2024.pdf --schema cra.t4 --extractor claude
finlit extract t4_2024.pdf --schema cra.t4 --output json
finlit extract t5_scan.pdf --schema cra.t5 --extractor claude \
    --vision-extractor claude
finlit schema list

Flags:

Flag Default Description
--schema required Schema name (cra.t4, cra.t5, โ€ฆ)
--extractor claude Text extractor: claude, openai, ollama, or a pydantic-ai model string
--vision-extractor none Enable vision fallback. Accepts claude/openai/ollama or a full model string like ollama:qwen2.5vl:7b
--output table Output format: table, json, csv
--review-threshold 0.85 Confidence below which a field is flagged for review

API reference

DocumentPipeline

DocumentPipeline(
    schema: Schema,
    extractor: str | BaseExtractor = "claude",
    model: str | None = None,
    vision_extractor: BaseVisionExtractor | None = None,
    vision_fallback_when: Callable[[ExtractionResult], bool] | None = None,
    audit: bool = True,
    review_threshold: float = 0.85,
)

run(path: str | Path) -> ExtractionResult โ€” parse, extract, validate, audit. Never raises on low confidence; inspect .needs_review and .warnings instead.

ExtractionResult

result.fields                 # dict[str, Any]   โ€” typed, validated values
result.confidence             # dict[str, float] โ€” 0.0โ€“1.0 per field
result.source_ref             # dict[str, dict]  โ€” {page, bbox, doc} per field
result.pii_entities           # list[dict]       โ€” Presidio detections
result.audit_log              # list[dict]       โ€” immutable event log
result.review_fields          # list[dict]       โ€” fields below threshold
result.needs_review           # bool
result.warnings               # list[dict]       โ€” document-level warnings
result.extracted_field_count  # int
result.extraction_path        # "text" | "vision"

Schema and Field

Schema(name: str, fields: list[Field], version: str = "1")
Field(
    name: str,
    dtype: type,              # str, int, float, bool, date
    required: bool = False,
    pii: bool = False,        # annotation only โ€” not auto-redacted
    regex: str | None = None,
    description: str = "",
)

Extractor strings

extractor= accepts the shorthands "claude", "openai", "ollama", any full pydantic-ai model string ("anthropic:claude-sonnet-4-6", "ollama:llama3.2"), or your own BaseExtractor instance. vision_extractor= takes a VisionExtractor(model=...) or any BaseVisionExtractor subclass.

from finlit.extractors import BaseExtractor
from finlit import BaseVisionExtractor

class MyTextExtractor(BaseExtractor):
    def extract(self, text, schema): ...

class MyVisionExtractor(BaseVisionExtractor):
    def extract(self, images, schema, text=""): ...

DocumentPipeline(
    schema=schemas.CRA_T4,
    extractor=MyTextExtractor(),
    vision_extractor=MyVisionExtractor(),
)

Troubleshooting

OSError: [E050] Can't find model 'en_core_web_lg' Presidio needs the spaCy model. Run python -m spacy download en_core_web_lg once after install.

anthropic.AuthenticationError / openai.AuthenticationError ANTHROPIC_API_KEY / OPENAI_API_KEY is missing or invalid. Check echo $ANTHROPIC_API_KEY. These are only read when extraction actually runs โ€” imports and tests never require them.

httpx.ConnectError when using extractor="ollama" Ollama isn't running or the model isn't pulled. Run ollama serve and ollama pull llama3.2 (or whichever model you passed).

warnings contains sparse_document The PDF had very little extractable text โ€” almost certainly a scan. Enable vision fallback: vision_extractor=VisionExtractor().

warnings contains vision_fallback_failed The vision path was attempted and raised. Check the reason field โ€” common causes are render_failed (pypdfium2 can't rasterize the PDF), api_error (network/auth issue with the vision model), or extraction_failed (LLM returned an unparseable response). The pipeline keeps the text result when this happens.

Box values come back in the wrong fields on tax slips Form-heavy documents rely on 2D layout that text extraction flattens. Enable vision_extractor=VisionExtractor() โ€” the vision model reads the image directly and preserves column alignment.

First run is slow / downloads lots of data Docling pulls ~500MB of layout models from HuggingFace on first use. They are cached locally after that.


Built-in schemas

Schema Document Source
schemas.CRA_T4 T4 Statement of Remuneration Paid CRA XML spec
schemas.CRA_T5 T5 Statement of Investment Income CRA XML spec
schemas.CRA_T4A T4A Pension, Retirement, Annuity CRA XML spec
schemas.CRA_NR4 NR4 Non-Resident Income CRA XML spec
schemas.BANK_STATEMENT Generic Canadian bank statement Community

Each schema is a versioned YAML file inside the package, updated annually when CRA publishes new XML specifications.


Adding a schema

Every schema is a YAML file. To add a new Canadian document type, create the file and register it with one line.

# finlit/schemas/cra/t2202.yaml
name: cra_t2202
version: "2024"
document_type: "CRA T2202 Tuition and Enrolment Certificate"
description: >
  Issued by post-secondary institutions to report eligible tuition
  and months of enrolment.

fields:
  - name: institution_name
    dtype: str
    required: true
    description: "Name of the post-secondary institution"

  - name: student_sin
    dtype: str
    required: true
    pii: true
    regex: '^\d{3}-\d{3}-\d{3}$'
    description: "Student's Social Insurance Number"

  - name: eligible_tuition_fees
    dtype: float
    required: true
    description: "Box 1: Total eligible tuition fees paid"

  - name: full_time_months
    dtype: int
    required: false
    description: "Number of months enrolled full-time"
# finlit/schemas/__init__.py โ€” add one line
CRA_T2202 = _load("cra/t2202.yaml")

Schema contributions are the most useful PRs this project gets. If you know the document, the YAML is the easy part.


Compared to alternatives

FinLit LlamaParse Docling alone Textract
Canadian document schemas โœ… โœ— โœ— โœ—
Runs on-premises โœ… โœ— SaaS only โœ… โœ— AWS only
Confidence per field โœ… Partial โœ— Partial
Source traceability โœ… Partial โœ— Partial
PIPEDA PII detection โœ… โœ— โœ— โœ—
Audit log โœ… โœ— โœ— โœ—
Custom schemas โœ… โœ— โœ— โœ—
Vision fallback for scans โœ… Partial โœ— โœ…
Open-source โœ… โœ— โœ… โœ—

Roadmap

  • Core extraction pipeline (Docling + pydantic-ai)
  • CRA schema registry (T4, T5, T4A, NR4)
  • Source traceability and audit log
  • PIPEDA PII detection โ€” SIN, CRA BNs, postal codes
  • CLI
  • OCR auto-fallback for image-only PDFs (v0.2)
  • Document-level warnings for sparse and missing-required-field results (v0.2)
  • Vision extraction fallback โ€” Claude, OpenAI, Gemini, or local OSS via Ollama (v0.3)
  • SEDAR filing schemas (MD&A, AIF, financial statements)
  • Bank statement schemas (RBC, TD, Scotiabank, BMO, CIBC)
  • Accuracy benchmarks per schema
  • LangChain reader integration
  • LlamaIndex reader integration
  • MCP tool definitions for agentic workflows
  • French CRA form support

Contributing

Open issues and PRs are welcome. If you work in a regulated Canadian industry and need a document type that is not yet here, open an issue with the document name and the fields you need.

See CONTRIBUTING.md for dev setup.


License

Apache 2.0. See LICENSE.


Built by Caseonix ยท Waterloo, Ontario ๐Ÿ

FinLit is the extraction engine inside LocalMind Sovereign, Caseonix's document intelligence platform for Canadian regulated industries.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

finlit-0.4.0.tar.gz (59.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

finlit-0.4.0-py3-none-any.whl (51.3 kB view details)

Uploaded Python 3

File details

Details for the file finlit-0.4.0.tar.gz.

File metadata

  • Download URL: finlit-0.4.0.tar.gz
  • Upload date:
  • Size: 59.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for finlit-0.4.0.tar.gz
Algorithm Hash digest
SHA256 14b9a2ff29ba8c63111fd78532d9d262f725a15a4db485d1cef998ea5b70ed6c
MD5 b207bb196ebd2943ce130b4f6f27253e
BLAKE2b-256 9b0033a0ba7e7c0decb7a189db663c207b3fd9a0cd6a6a55a3a91c5dcd93901b

See more details on using hashes here.

File details

Details for the file finlit-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: finlit-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 51.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for finlit-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 517586fd7027431e7c732612265ef06173cb6a6d8a9e74230ef25511a921a79d
MD5 6a455522d4dc2b0a94adf75df82fb467
BLAKE2b-256 544bed5471ed2891ef9458de8b8344d4b815d560d75d13f8b0167a82af4f1825

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page