Canadian Financial Document Intelligence Framework
Project description
FinLit ๐
Extract structured data from Canadian financial documents โ T4s, T5s, SEDAR filings, bank statements โ with a compliance audit trail built in.
pip install finlit
python -m spacy download en_core_web_lg # one-time, required by Presidio
export ANTHROPIC_API_KEY=sk-ant-...
from finlit import DocumentPipeline, schemas
result = DocumentPipeline(schema=schemas.CRA_T4, extractor="claude").run("t4_2024.pdf")
print(result.fields["box_14_employment_income"]) # โ 87500.0
print(result.confidence["box_14_employment_income"]) # โ 0.97
print(result.needs_review) # โ False
Who this is for:
- Canadian fintechs processing user-uploaded T-slips into structured data
- Banks and credit unions running SEDAR filing and statement pipelines
- Accounting and tax software pre-filling CRA forms from client documents
- Any team that needs on-premises extraction with a PIPEDA/OSFI-friendly audit trail
Not a developer? See docs/use-cases.md for business context, compliance framing, and "build vs. buy" math.
In a hurry? Read the 5-minute Quickstart โ install, pick a backend, extract your first T4. Come back here when you need the reference.
Contents
- Quickstart
- Why FinLit
- Setup
- Usage
- CLI
- API reference
- Troubleshooting
- Built-in schemas
- Adding a schema
- Compared to alternatives
- Roadmap
- Contributing
Why FinLit
General-purpose extraction tools parse PDFs fine but don't know what a T4 box is, what fields CRA requires, or what a Canadian SIN looks like. FinLit is the Canadian-document layer โ pre-built, open-source, on-premises โ that you'd otherwise write yourself: versioned CRA schemas, per-field confidence, source traceability, PIPEDA PII detection, and an immutable audit log.
It wraps Docling (IBM's parser) and pydantic-ai (model-agnostic LLM orchestration). Runs entirely inside your infrastructure โ with extractor="ollama" even LLM calls stay on-prem, suitable for OSFI-regulated and air-gapped deployments.
Setup
Install:
pip install finlit
python -m spacy download en_core_web_lg
The spaCy model is required by Presidio for PII detection. Skipping it will raise an OSError on first pipeline run.
Pick an extractor backend โ set one of:
| Backend | Env / setup | Extractor string |
|---|---|---|
| Anthropic Claude (default) | export ANTHROPIC_API_KEY=... |
"claude" |
| OpenAI | export OPENAI_API_KEY=... |
"openai" |
| Local Ollama | Install Ollama ยท ollama pull llama3.2 |
"ollama" |
Docling pulls its layout models from HuggingFace on first run (~500MB, cached afterwards).
Usage
Extract a T4
from finlit import DocumentPipeline, schemas
pipeline = DocumentPipeline(
schema=schemas.CRA_T4,
extractor="claude", # or "openai" or "ollama"
audit=True,
review_threshold=0.85,
)
result = pipeline.run("john_doe_t4_2024.pdf")
# Typed, validated fields
print(result.fields["box_14_employment_income"]) # โ 87500.0
print(result.fields["province_of_employment"]) # โ "ON"
# Per-field confidence โ box_52 came back at 71%, below the 0.85 threshold
print(result.confidence["box_52_pension_adjustment"]) # โ 0.71
print(result.needs_review) # โ True
print(result.review_fields)
# [{"field": "box_52_pension_adjustment", "confidence": 0.71, "raw": "..."}]
# Trace any value back to its location in the source PDF
print(result.source_ref["box_14_employment_income"])
# {"page": 1, "bbox": [120, 340, 280, 360], "doc": "john_doe_t4_2024.pdf"}
# Audit log โ append-only, finalized at end of run
for ev in result.audit_log:
print(ev["event"], ev.get("ts"))
# document_loaded ...
# pii_detected ...
# extraction_complete ...
# review_flagged ...
# pipeline_complete ...
Batch processing
from finlit import BatchPipeline, schemas
from glob import glob
batch = BatchPipeline(schema=schemas.CRA_T4, extractor="claude", workers=8)
for path in glob("uploads/*.pdf"):
batch.add(path)
results = batch.run()
results.export_csv("extracted/t4s_2024.csv")
print(f"Processed: {results.total}")
print(f"Needs review: {results.review_count}")
Vision fallback for scans and forms
Text extraction fails in two cases: image-only PDFs with no text layer, and form-heavy documents (tax slips, invoices) where 2D column alignment carries meaning. For both, enable the vision fallback โ it sends rendered page images to any multimodal LLM.
from finlit import DocumentPipeline, VisionExtractor, schemas
pipeline = DocumentPipeline(
schema=schemas.CRA_T5,
extractor="claude", # text path (cheap, fast)
vision_extractor=VisionExtractor(), # vision fallback (accurate)
)
result = pipeline.run("t5_scanned.pdf")
print(result.extraction_path) # โ "text" or "vision"
By default the vision extractor runs only when the text result has needs_review=True. Pass a custom callback for finer control:
pipeline = DocumentPipeline(
schema=schemas.CRA_T5,
extractor="claude",
vision_extractor=VisionExtractor(model="openai:gpt-4o"),
vision_fallback_when=lambda r: any(c < 0.80 for c in r.confidence.values()),
)
Vision results replace the text result entirely โ result.fields is whatever the vision extractor returned, and result.extraction_path == "vision". If vision fails for any reason (render error, API failure, LLM error) the pipeline keeps the text result and logs a vision_fallback_failed warning.
Fully local with Ollama
No API keys, no external network โ suitable for air-gapped and OSFI-regulated deployments. Text and vision can each be local independently.
pipeline = DocumentPipeline(
schema=schemas.CRA_T5,
extractor="ollama:llama3.2",
vision_extractor=VisionExtractor(model="ollama:qwen2.5vl:7b"),
)
Vision models verified against CRA slips:
| Model | Size | Ollama tag | Notes |
|---|---|---|---|
| Qwen2.5-VL | 7B | ollama:qwen2.5vl:7b |
Strongest on form/document tasks |
| Llama 3.2 Vision | 11B | ollama:llama3.2-vision |
General-purpose, Meta |
| MiniCPM-V | 8B | ollama:minicpm-v |
Fast, OpenBMB |
Any pydantic-aiโcompatible multimodal model works; these are the ones explicitly tested.
Custom schemas
from finlit import DocumentPipeline, Schema, Field
loan_schema = Schema(
name="internal_loan_application",
fields=[
Field("applicant_name", dtype=str, required=True),
Field("gross_income", dtype=float, required=True),
Field("sin_number", dtype=str, pii=True),
Field("loan_amount", dtype=float, required=True),
]
)
result = DocumentPipeline(schema=loan_schema, extractor="claude").run("loan_app.pdf")
Error handling
FinLit does not raise on low-confidence fields โ those go into result.review_fields. It does attach structured warnings for document-level problems (sparse OCR, missing required fields, vision fallback failure).
result = pipeline.run("t4.pdf")
if result.needs_review:
for flagged in result.review_fields:
queue_for_human_review(flagged)
for warning in result.warnings:
if warning["code"] == "sparse_document":
# PDF had very little extractable text โ likely a scan
...
elif warning["code"] == "vision_fallback_failed":
# Vision path was tried and failed; we kept the text result
log.warn(warning["reason"])
Common warning codes:
| Code | Meaning |
|---|---|
sparse_document |
Extracted text is very short; likely an image-only PDF |
missing_required_fields |
One or more required=True fields came back empty |
vision_fallback_failed |
Vision path was attempted and failed; text result retained |
pii_detected |
Presidio found PII entities in the source text |
LangChain integration
FinLit ships a LangChain BaseLoader so you can drop extracted Canadian
financial documents straight into RAG pipelines, retrievers, and agents.
Install the extra:
pip install finlit[langchain]
Load one file:
from finlit.integrations.langchain import FinLitLoader
docs = FinLitLoader("t4.pdf", schema="cra.t4").load()
doc = docs[0]
print(doc.metadata["finlit_fields"]["employer_name"]) # "Acme Corp"
print(doc.metadata["finlit_needs_review"]) # False
Batch load with compliance-friendly error surfacing:
loader = FinLitLoader(
["t4_001.pdf", "t4_002.pdf", "t4_003.pdf"],
schema="cra.t4",
on_error="include", # failures become Documents with finlit_error
)
docs = loader.load()
# Filter out failures before embedding โ empty page_content breaks most embedders.
good = [d for d in docs if not d.metadata.get("finlit_error")]
Access the underlying ExtractionResult objects via loader.last_results
(same order as the input paths, with None for skipped/included failures).
MCP server
Expose FinLit as a Model Context Protocol server so any MCP-compatible host (Claude Desktop, Claude Code, Cursor, custom agents) can extract documents through tool calls โ no Python glue.
Install the extra:
pip install finlit[mcp]
Run the server (two equivalent ways):
# Human-facing
finlit mcp serve --extractor claude
# Claude Desktop mcpServers config
python -m finlit.integrations.mcp
Claude Desktop config example:
{
"mcpServers": {
"finlit": {
"command": "python",
"args": ["-m", "finlit.integrations.mcp"],
"env": {
"ANTHROPIC_API_KEY": "...",
"FINLIT_EXTRACTOR": "claude",
"FINLIT_PII_MODE": "redact"
}
}
}
}
Tools exposed:
list_schemas()โ discover the built-in CRA / banking schemasextract_document(path, schema, ...)โ extract one documentbatch_extract(paths, schema, ...)โ extract many in paralleldetect_pii(text, ...)โ standalone Presidio + Canadian recognizers
PII fields (per schema annotation) are redacted in tool responses by
default โ appropriate to the chat-transcript trust model. Pass
redact_pii=false per call, or start with --pii-mode raw, to opt out.
CLI
finlit extract t4_2024.pdf --schema cra.t4 --extractor claude
finlit extract t4_2024.pdf --schema cra.t4 --output json
finlit extract t5_scan.pdf --schema cra.t5 --extractor claude \
--vision-extractor claude
finlit schema list
Flags:
| Flag | Default | Description |
|---|---|---|
--schema |
required | Schema name (cra.t4, cra.t5, โฆ) |
--extractor |
claude |
Text extractor: claude, openai, ollama, or a pydantic-ai model string |
--vision-extractor |
none | Enable vision fallback. Accepts claude/openai/ollama or a full model string like ollama:qwen2.5vl:7b |
--output |
table |
Output format: table, json, csv |
--review-threshold |
0.85 |
Confidence below which a field is flagged for review |
API reference
DocumentPipeline
DocumentPipeline(
schema: Schema,
extractor: str | BaseExtractor = "claude",
model: str | None = None,
vision_extractor: BaseVisionExtractor | None = None,
vision_fallback_when: Callable[[ExtractionResult], bool] | None = None,
audit: bool = True,
review_threshold: float = 0.85,
)
run(path: str | Path) -> ExtractionResult โ parse, extract, validate, audit. Never raises on low confidence; inspect .needs_review and .warnings instead.
ExtractionResult
result.fields # dict[str, Any] โ typed, validated values
result.confidence # dict[str, float] โ 0.0โ1.0 per field
result.source_ref # dict[str, dict] โ {page, bbox, doc} per field
result.pii_entities # list[dict] โ Presidio detections
result.audit_log # list[dict] โ immutable event log
result.review_fields # list[dict] โ fields below threshold
result.needs_review # bool
result.warnings # list[dict] โ document-level warnings
result.extracted_field_count # int
result.extraction_path # "text" | "vision"
Schema and Field
Schema(name: str, fields: list[Field], version: str = "1")
Field(
name: str,
dtype: type, # str, int, float, bool, date
required: bool = False,
pii: bool = False, # annotation only โ not auto-redacted
regex: str | None = None,
description: str = "",
)
Extractor strings
extractor= accepts the shorthands "claude", "openai", "ollama", any full pydantic-ai model string ("anthropic:claude-sonnet-4-6", "ollama:llama3.2"), or your own BaseExtractor instance. vision_extractor= takes a VisionExtractor(model=...) or any BaseVisionExtractor subclass.
from finlit.extractors import BaseExtractor
from finlit import BaseVisionExtractor
class MyTextExtractor(BaseExtractor):
def extract(self, text, schema): ...
class MyVisionExtractor(BaseVisionExtractor):
def extract(self, images, schema, text=""): ...
DocumentPipeline(
schema=schemas.CRA_T4,
extractor=MyTextExtractor(),
vision_extractor=MyVisionExtractor(),
)
Troubleshooting
OSError: [E050] Can't find model 'en_core_web_lg'
Presidio needs the spaCy model. Run python -m spacy download en_core_web_lg once after install.
anthropic.AuthenticationError / openai.AuthenticationError
ANTHROPIC_API_KEY / OPENAI_API_KEY is missing or invalid. Check echo $ANTHROPIC_API_KEY. These are only read when extraction actually runs โ imports and tests never require them.
httpx.ConnectError when using extractor="ollama"
Ollama isn't running or the model isn't pulled. Run ollama serve and ollama pull llama3.2 (or whichever model you passed).
warnings contains sparse_document
The PDF had very little extractable text โ almost certainly a scan. Enable vision fallback: vision_extractor=VisionExtractor().
warnings contains vision_fallback_failed
The vision path was attempted and raised. Check the reason field โ common causes are render_failed (pypdfium2 can't rasterize the PDF), api_error (network/auth issue with the vision model), or extraction_failed (LLM returned an unparseable response). The pipeline keeps the text result when this happens.
Box values come back in the wrong fields on tax slips
Form-heavy documents rely on 2D layout that text extraction flattens. Enable vision_extractor=VisionExtractor() โ the vision model reads the image directly and preserves column alignment.
First run is slow / downloads lots of data Docling pulls ~500MB of layout models from HuggingFace on first use. They are cached locally after that.
Built-in schemas
| Schema | Document | Source |
|---|---|---|
schemas.CRA_T4 |
T4 Statement of Remuneration Paid | CRA XML spec |
schemas.CRA_T5 |
T5 Statement of Investment Income | CRA XML spec |
schemas.CRA_T4A |
T4A Pension, Retirement, Annuity | CRA XML spec |
schemas.CRA_NR4 |
NR4 Non-Resident Income | CRA XML spec |
schemas.BANK_STATEMENT |
Generic Canadian bank statement | Community |
Each schema is a versioned YAML file inside the package, updated annually when CRA publishes new XML specifications.
Adding a schema
Every schema is a YAML file. To add a new Canadian document type, create the file and register it with one line.
# finlit/schemas/cra/t2202.yaml
name: cra_t2202
version: "2024"
document_type: "CRA T2202 Tuition and Enrolment Certificate"
description: >
Issued by post-secondary institutions to report eligible tuition
and months of enrolment.
fields:
- name: institution_name
dtype: str
required: true
description: "Name of the post-secondary institution"
- name: student_sin
dtype: str
required: true
pii: true
regex: '^\d{3}-\d{3}-\d{3}$'
description: "Student's Social Insurance Number"
- name: eligible_tuition_fees
dtype: float
required: true
description: "Box 1: Total eligible tuition fees paid"
- name: full_time_months
dtype: int
required: false
description: "Number of months enrolled full-time"
# finlit/schemas/__init__.py โ add one line
CRA_T2202 = _load("cra/t2202.yaml")
Schema contributions are the most useful PRs this project gets. If you know the document, the YAML is the easy part.
Compared to alternatives
| FinLit | LlamaParse | Docling alone | Textract | |
|---|---|---|---|---|
| Canadian document schemas | โ | โ | โ | โ |
| Runs on-premises | โ | โ SaaS only | โ | โ AWS only |
| Confidence per field | โ | Partial | โ | Partial |
| Source traceability | โ | Partial | โ | Partial |
| PIPEDA PII detection | โ | โ | โ | โ |
| Audit log | โ | โ | โ | โ |
| Custom schemas | โ | โ | โ | โ |
| Vision fallback for scans | โ | Partial | โ | โ |
| Open-source | โ | โ | โ | โ |
Roadmap
- Core extraction pipeline (Docling + pydantic-ai)
- CRA schema registry (T4, T5, T4A, NR4)
- Source traceability and audit log
- PIPEDA PII detection โ SIN, CRA BNs, postal codes
- CLI
- OCR auto-fallback for image-only PDFs (v0.2)
- Document-level warnings for sparse and missing-required-field results (v0.2)
- Vision extraction fallback โ Claude, OpenAI, Gemini, or local OSS via Ollama (v0.3)
- SEDAR filing schemas (MD&A, AIF, financial statements)
- Bank statement schemas (RBC, TD, Scotiabank, BMO, CIBC)
- Accuracy benchmarks per schema
- LangChain reader integration
- LlamaIndex reader integration
- MCP tool definitions for agentic workflows
- French CRA form support
Contributing
Open issues and PRs are welcome. If you work in a regulated Canadian industry and need a document type that is not yet here, open an issue with the document name and the fields you need.
See CONTRIBUTING.md for dev setup.
License
Apache 2.0. See LICENSE.
Built by Caseonix ยท Waterloo, Ontario ๐
FinLit is the extraction engine inside LocalMind Sovereign, Caseonix's document intelligence platform for Canadian regulated industries.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file finlit-0.4.0.tar.gz.
File metadata
- Download URL: finlit-0.4.0.tar.gz
- Upload date:
- Size: 59.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14b9a2ff29ba8c63111fd78532d9d262f725a15a4db485d1cef998ea5b70ed6c
|
|
| MD5 |
b207bb196ebd2943ce130b4f6f27253e
|
|
| BLAKE2b-256 |
9b0033a0ba7e7c0decb7a189db663c207b3fd9a0cd6a6a55a3a91c5dcd93901b
|
File details
Details for the file finlit-0.4.0-py3-none-any.whl.
File metadata
- Download URL: finlit-0.4.0-py3-none-any.whl
- Upload date:
- Size: 51.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
517586fd7027431e7c732612265ef06173cb6a6d8a9e74230ef25511a921a79d
|
|
| MD5 |
6a455522d4dc2b0a94adf75df82fb467
|
|
| BLAKE2b-256 |
544bed5471ed2891ef9458de8b8344d4b815d560d75d13f8b0167a82af4f1825
|