AI-powered financial document parsing SDK. Extract structured data from financial statements, bank statements, invoices, and more.

These details have not been verified by PyPI

Project links

Project description

fin-doc-parser

AI-powered financial document parsing SDK

Extract structured JSON from financial statements, bank statements, invoices, business licenses, and more.

English | 中文

Why fin-doc-parser?

Financial documents are messy — scanned PDFs, inconsistent Excel formats, images of licenses. Extracting structured data from them typically requires weeks of custom code.

fin-doc-parser solves this in 3 lines:

from findocparser import parse

result = parse("财务报表2024.pdf", doc_type="financial_statement")
print(result["data"]["balance_sheet"]["total_assets"])  # 125000000.0

Features

13 document types — financial statements, bank statements, business licenses, audit reports, credit reports, shareholder info, financial notes, MD&A, guarantees, equity changes, tax invoices, and more
Pluggable OCR — PaddleOCR (local, free), Prismer (GPU service), or text-only extraction
Pluggable LLM — DeepSeek, OpenAI, or any OpenAI-compatible API (Ollama, vLLM, etc.)
Bring your own client — pass a pre-configured LLMClient instance directly
Excel support — xlsx, xls, csv with automatic markdown conversion
Auto-detection — file type and document type detected from filename and content
Generic fallback — unknown document types get a best-effort extraction
Multi-period comparison — compare_periods() computes period-over-period changes with significant change detection
Async-first — parse_async() for high-throughput pipelines
Minimal core — only httpx required; OCR, Excel, PDF are optional

Quick Start

Install

pip install fin-doc-parser

# With Excel support (xlsx/xls)
pip install "fin-doc-parser[excel]"

# With PDF text extraction (PyMuPDF)
pip install "fin-doc-parser[pdf]"

# With local OCR (PaddleOCR, no external service)
pip install "fin-doc-parser[ocr]"

# Everything
pip install "fin-doc-parser[all]"

Set API key

# Pick one:
export DEEPSEEK_API_KEY="sk-..."    # Recommended (cheap + good at Chinese)
export OPENAI_API_KEY="sk-..."       # Also works

Parse a document

from findocparser import parse

# Financial statement (PDF or image)
result = parse("资产负债表2024.pdf")
balance_sheet = result["data"]["balance_sheet"]
print(f"Total assets: {balance_sheet['total_assets']}")
print(f"Total liabilities: {balance_sheet['total_liabilities']}")

# Bank statement
result = parse("银行流水_2024.pdf")
for txn in result["data"]["transactions"][:5]:
    print(f"{txn['date']}  {txn['counterparty']}  {txn['amount']}")

# Business license (image)
result = parse("营业执照.jpg")
print(f"Company: {result['data']['company_name']}")
print(f"Credit code: {result['data']['unified_social_credit_code']}")

# Excel file
result = parse("固定资产清单.xlsx", doc_type="fixed_asset")

# Auto-detect document type
result = parse("some_unknown_document.pdf")
print(f"Detected type: {result['doc_type']}")

Async usage

import asyncio
from findocparser import parse_async

async def main():
    result = await parse_async("report.pdf", llm_provider="deepseek")
    print(result["data"])

asyncio.run(main())

Custom LLM endpoint

from findocparser import parse, parse_async, OpenAIClient

# Option 1: Pass config through parse()
result = parse(
    "report.pdf",
    llm_base_url="http://localhost:11434/v1",  # Ollama
    llm_api_key="ollama",
    llm_model="qwen2.5:14b",
)

# Option 2: Bring your own LLM client
client = OpenAIClient(
    provider="openai",
    base_url="http://localhost:11434/v1",
    api_key="ollama",
    model="qwen2.5:14b",
)
result = parse("report.pdf", llm_client=client)

快速开始

安装

pip install fin-doc-parser

# 带 Excel 支持
pip install "fin-doc-parser[excel]"

# 带 PDF 文本提取
pip install "fin-doc-parser[pdf]"

# 带本地 OCR（无需外部服务）
pip install "fin-doc-parser[ocr]"

配置 API 密钥

export DEEPSEEK_API_KEY="sk-..."    # 推荐（便宜 + 中文能力强）

解析文档

from findocparser import parse

# 一行代码解析财务报表
result = parse("资产负债表2024.pdf")
print(result["data"]["balance_sheet"]["total_assets"])

# 解析银行流水
result = parse("银行流水.pdf")
print(result["data"]["transactions"])

# 解析营业执照（图片）
result = parse("营业执照.jpg")
print(result["data"]["company_name"])

# 自定义 LLM 端点（如 Ollama）
result = parse(
    "report.pdf",
    llm_base_url="http://localhost:11434/v1",
    llm_api_key="ollama",
    llm_model="qwen2.5:14b",
)

多期对比

from findocparser import parse, compare_periods

# 解析两期财报
r2023 = parse("财务报表2023.pdf")
r2024 = parse("财务报表2024.pdf")

# 自动计算同比变动
diff = compare_periods([r2023, r2024])

# 查看资产变动
assets = diff["comparisons"][0]["balance_sheet"]["total_assets"]
print(f"总资产变动: {assets['change_pct']:+.1f}%")  # +25.0%

# 查看重大变动（默认 ±20%）
for item in diff["significant_changes"]:
    print(f"{item['field']}: {item['change_pct']:+.1f}%")

# 三期趋势分析
r2022 = parse("财务报表2022.pdf")
diff = compare_periods([r2022, r2023, r2024])  # 返回 2 组逐期对比

Supported Document Types

Document Type	`doc_type`	Input Formats	Output
Financial Statement	`financial_statement`	PDF, image, Excel	Balance sheet, income statement, cash flow
Bank Statement	`bank_statement`	PDF, image	Transaction list with counterparty & amounts
Business License	`business_license`	PDF, image	Company name, credit code, legal rep, scope
Audit Report	`audit_report`	PDF	Opinion type, going concern, key audit matters, signatories
Credit Report	`credit_report`	PDF	Credit lines, overdue records, utilization
Shareholder Info	`shareholder_info`	PDF, image	Shareholder names, ratios, capital
Financial Notes	`financial_notes`	PDF	Accounting policies, related party txns, contingent liabilities
MD&A	`md_and_a`	PDF	Business overview, operating results, risk factors, outlook
Guarantee Disclosure	`guarantee`	PDF	Guarantee summary, details, violation guarantees
Equity Changes Stmt	`equity_changes_stmt`	PDF	Opening/closing balance, changes, profit distribution
Tax Invoice	`tax_invoice`	PDF, image, Excel	Invoice items, amounts, tax rates
Fixed Asset	`fixed_asset`	Excel	Asset list with depreciation
Lease Contract	`lease_contract`	PDF	Terms, amounts, maturity dates
Property Cert	`property_cert`	PDF, image	Owner, location, area, registration
(any other)	`generic`	PDF, image, Excel	Auto-extracted key entities & numbers

Architecture

parse("document.pdf")
    │
    ├─ detect_file_type()      →  pdf / image / excel
    │
    ├─ OCR or Excel Parser     →  raw text (markdown)
    │   ├─ PaddleOCR (local)        [ocr]
    │   ├─ Prismer (GPU service)    env: PRISMER_OCR_BASE_URL
    │   ├─ PyMuPDF (text-only)      [pdf]
    │   └─ openpyxl / xlrd          [excel]
    │
    ├─ detect_doc_type()       →  financial_statement / bank_statement / ...
    │
    └─ LLM Extractor           →  structured JSON
        ├─ DeepSeek (default)
        ├─ OpenAI
        └─ Any OpenAI-compatible API

API Reference

`parse(file_path, **kwargs)`

Parameter	Type	Default	Description
`file_path`	`str \| Path`	required	Path to document
`doc_type`	`str \| None`	`None`	Document type (auto-detect if None)
`llm_provider`	`str`	`"deepseek"`	LLM provider name
`llm_client`	`LLMClient \| None`	`None`	Pre-configured client (overrides provider)
`llm_base_url`	`str \| None`	`None`	Override provider base URL
`llm_api_key`	`str \| None`	`None`	Override API key
`llm_model`	`str \| None`	`None`	Override model name
`ocr_backend`	`str`	`"auto"`	OCR backend: auto, paddleocr, prismer, none

Returns dict with keys: doc_type, file_name, file_type, data.

`parse_async(...)` — same parameters, returns coroutine.

`compare_periods(results, *, significant_change_pct=20.0)`

Compare parse() results across multiple reporting periods.

Parameter	Type	Default	Description
`results`	`list[dict]`	required	List of `parse()` results, ordered earliest → latest
`significant_change_pct`	`float`	`20.0`	Threshold (%) for flagging significant changes

Returns dict with keys: doc_type, period_count, periods, comparisons, significant_changes.

Two-period comparison

from findocparser import parse, compare_periods

r2023 = parse("财务报表2023.pdf")
r2024 = parse("财务报表2024.pdf")
diff = compare_periods([r2023, r2024])

# Numeric fields get absolute and percentage changes
assets = diff["comparisons"][0]["balance_sheet"]["total_assets"]
print(assets)
# {"previous": 100000000, "current": 125000000, "change": 25000000, "change_pct": 25.0}

# String fields show before/after when different
opinion = diff["comparisons"][0]["opinion_type"]
# {"previous": "标准无保留意见", "current": "保留意见"}

# List fields show count changes
txns = diff["comparisons"][0]["transactions"]
# {"previous_count": 120, "current_count": 185}

Significant change detection

# Flag fields with ≥20% change (default threshold)
for item in diff["significant_changes"]:
    print(f"{item['field']}: {item['change_pct']:+.1f}%")
# balance_sheet.inventory: +60.0%
# income_statement.net_income: -35.2%

# Custom threshold (e.g., 10%)
diff = compare_periods([r2023, r2024], significant_change_pct=10.0)

Three-period trend

r2022 = parse("财务报表2022.pdf")
r2023 = parse("财务报表2023.pdf")
r2024 = parse("财务报表2024.pdf")

diff = compare_periods([r2022, r2023, r2024])
print(diff["period_count"])  # 3
print(len(diff["comparisons"]))  # 2 (pairwise: 2022→2023, 2023→2024)

# Track revenue trend across 3 years
for comp in diff["comparisons"]:
    rev = comp["income_statement"]["revenue"]
    print(f"{comp['from_period']} → {comp['to_period']}: {rev['change_pct']:+.1f}%")

Configuration

OCR Backend

# Auto (default): try text extraction first, fall back to PaddleOCR
parse("doc.pdf", ocr_backend="auto")

# Local PaddleOCR (no external service)
parse("doc.pdf", ocr_backend="paddleocr")

# Prismer service (requires PRISMER_OCR_BASE_URL env var)
parse("doc.pdf", ocr_backend="prismer")

# Text-only (PDF with selectable text, no OCR)
parse("doc.pdf", ocr_backend="none")

LLM Provider

# DeepSeek (default, recommended for Chinese documents)
parse("doc.pdf", llm_provider="deepseek")

# OpenAI
parse("doc.pdf", llm_provider="openai")

Privacy & Data Security

Important: Document content is sent to the configured LLM API (DeepSeek, OpenAI, etc.) for structured extraction. This includes any PII present in the documents — ID numbers, bank account numbers, financial figures, credit records, etc.

For sensitive documents (credit reports, bank statements, shareholder info with ID numbers):

# Use a self-hosted model to keep data on-premise
result = parse(
    "征信报告.pdf",
    llm_base_url="http://your-vllm-server:8000/v1",  # Self-hosted
    llm_api_key="local",
    llm_model="Qwen/Qwen2.5-14B",
)

Recommendations:

Use self-hosted LLM (Ollama, vLLM, TGI) for documents containing PII
Review your LLM provider's data retention policy before processing sensitive data
In China, processing credit reports and ID numbers via cloud APIs may conflict with the Personal Information Protection Law (个人信息保护法) and Regulation on Credit Information Industry (征信业管理条例)

Contributing

Contributions welcome! Areas that need help:

More extractors (tax invoice, fixed asset, lease, property, land cert)
Better prompt templates for higher extraction accuracy
More OCR backends (Surya, EasyOCR, Tesseract)
More LLM providers (Claude, Gemini, Kimi)
Test coverage

git clone https://github.com/willamhou/fin-doc-parser.git
cd fin-doc-parser
pip install -e ".[dev]"
pytest

License

Apache License 2.0

Related Projects

FinSight — AI-powered stock analysis tool built on fin-doc-parser

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fin_doc_parser-0.1.0.tar.gz (167.2 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fin_doc_parser-0.1.0-py3-none-any.whl (32.6 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file fin_doc_parser-0.1.0.tar.gz.

File metadata

Download URL: fin_doc_parser-0.1.0.tar.gz
Upload date: Mar 16, 2026
Size: 167.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for fin_doc_parser-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9182d304d61fd7ba4d1238d6eec986ba240a0360da9fecc1d0c4c2fea9216a4b`
MD5	`eb9a5e075d18f2dd69cb7b7057d3274d`
BLAKE2b-256	`5b2e0d22bb245ac2e3c5dfc4bdc3c3c59d0865f7079596bdc838e5af1092c465`

See more details on using hashes here.

File details

Details for the file fin_doc_parser-0.1.0-py3-none-any.whl.

File metadata

Download URL: fin_doc_parser-0.1.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 32.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for fin_doc_parser-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea547fe62b8ad034646666739af9724bf0f0d7b5eb3017f89bd592500375c7c4`
MD5	`4c5d8bd827144c6775be0dd9784ef88a`
BLAKE2b-256	`b4ff3695a3b9b8503efd63c4e0f959aa654d6e6b52fb24544cf9532a0496d55d`

See more details on using hashes here.

fin-doc-parser 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fin-doc-parser

Why fin-doc-parser?

Features

Quick Start

Install

Set API key

Parse a document

Async usage

Custom LLM endpoint

快速开始

安装

配置 API 密钥

解析文档

多期对比

Supported Document Types

Architecture

API Reference

parse(file_path, **kwargs)

parse_async(...) — same parameters, returns coroutine.

compare_periods(results, *, significant_change_pct=20.0)

Two-period comparison

Significant change detection

Three-period trend

Configuration

OCR Backend

LLM Provider

Privacy & Data Security

Contributing

License

Related Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`parse(file_path, **kwargs)`

`parse_async(...)` — same parameters, returns coroutine.

`compare_periods(results, *, significant_change_pct=20.0)`