AI-powered financial document parsing SDK. Extract structured data from financial statements, bank statements, invoices, and more.
Project description
fin-doc-parser
AI-powered financial document parsing SDK
Extract structured JSON from financial statements, bank statements, invoices, business licenses, and more.
Why fin-doc-parser?
Financial documents are messy — scanned PDFs, inconsistent Excel formats, images of licenses. Extracting structured data from them typically requires weeks of custom code.
fin-doc-parser solves this in 3 lines:
from findocparser import parse
result = parse("财务报表2024.pdf", doc_type="financial_statement")
print(result["data"]["balance_sheet"]["total_assets"]) # 125000000.0
Features
- 13 document types — financial statements, bank statements, business licenses, audit reports, credit reports, shareholder info, financial notes, MD&A, guarantees, equity changes, tax invoices, and more
- Pluggable OCR — PaddleOCR (local, free), Prismer (GPU service), or text-only extraction
- Pluggable LLM — DeepSeek, OpenAI, or any OpenAI-compatible API (Ollama, vLLM, etc.)
- Bring your own client — pass a pre-configured
LLMClientinstance directly - Excel support — xlsx, xls, csv with automatic markdown conversion
- Auto-detection — file type and document type detected from filename and content
- Generic fallback — unknown document types get a best-effort extraction
- Multi-period comparison —
compare_periods()computes period-over-period changes with significant change detection - Async-first —
parse_async()for high-throughput pipelines - Minimal core — only
httpxrequired; OCR, Excel, PDF are optional
Quick Start
Install
pip install fin-doc-parser
# With Excel support (xlsx/xls)
pip install "fin-doc-parser[excel]"
# With PDF text extraction (PyMuPDF)
pip install "fin-doc-parser[pdf]"
# With local OCR (PaddleOCR, no external service)
pip install "fin-doc-parser[ocr]"
# Everything
pip install "fin-doc-parser[all]"
Set API key
# Pick one:
export DEEPSEEK_API_KEY="sk-..." # Recommended (cheap + good at Chinese)
export OPENAI_API_KEY="sk-..." # Also works
Parse a document
from findocparser import parse
# Financial statement (PDF or image)
result = parse("资产负债表2024.pdf")
balance_sheet = result["data"]["balance_sheet"]
print(f"Total assets: {balance_sheet['total_assets']}")
print(f"Total liabilities: {balance_sheet['total_liabilities']}")
# Bank statement
result = parse("银行流水_2024.pdf")
for txn in result["data"]["transactions"][:5]:
print(f"{txn['date']} {txn['counterparty']} {txn['amount']}")
# Business license (image)
result = parse("营业执照.jpg")
print(f"Company: {result['data']['company_name']}")
print(f"Credit code: {result['data']['unified_social_credit_code']}")
# Excel file
result = parse("固定资产清单.xlsx", doc_type="fixed_asset")
# Auto-detect document type
result = parse("some_unknown_document.pdf")
print(f"Detected type: {result['doc_type']}")
Async usage
import asyncio
from findocparser import parse_async
async def main():
result = await parse_async("report.pdf", llm_provider="deepseek")
print(result["data"])
asyncio.run(main())
Custom LLM endpoint
from findocparser import parse, parse_async, OpenAIClient
# Option 1: Pass config through parse()
result = parse(
"report.pdf",
llm_base_url="http://localhost:11434/v1", # Ollama
llm_api_key="ollama",
llm_model="qwen2.5:14b",
)
# Option 2: Bring your own LLM client
client = OpenAIClient(
provider="openai",
base_url="http://localhost:11434/v1",
api_key="ollama",
model="qwen2.5:14b",
)
result = parse("report.pdf", llm_client=client)
快速开始
安装
pip install fin-doc-parser
# 带 Excel 支持
pip install "fin-doc-parser[excel]"
# 带 PDF 文本提取
pip install "fin-doc-parser[pdf]"
# 带本地 OCR(无需外部服务)
pip install "fin-doc-parser[ocr]"
配置 API 密钥
export DEEPSEEK_API_KEY="sk-..." # 推荐(便宜 + 中文能力强)
解析文档
from findocparser import parse
# 一行代码解析财务报表
result = parse("资产负债表2024.pdf")
print(result["data"]["balance_sheet"]["total_assets"])
# 解析银行流水
result = parse("银行流水.pdf")
print(result["data"]["transactions"])
# 解析营业执照(图片)
result = parse("营业执照.jpg")
print(result["data"]["company_name"])
# 自定义 LLM 端点(如 Ollama)
result = parse(
"report.pdf",
llm_base_url="http://localhost:11434/v1",
llm_api_key="ollama",
llm_model="qwen2.5:14b",
)
多期对比
from findocparser import parse, compare_periods
# 解析两期财报
r2023 = parse("财务报表2023.pdf")
r2024 = parse("财务报表2024.pdf")
# 自动计算同比变动
diff = compare_periods([r2023, r2024])
# 查看资产变动
assets = diff["comparisons"][0]["balance_sheet"]["total_assets"]
print(f"总资产变动: {assets['change_pct']:+.1f}%") # +25.0%
# 查看重大变动(默认 ±20%)
for item in diff["significant_changes"]:
print(f"{item['field']}: {item['change_pct']:+.1f}%")
# 三期趋势分析
r2022 = parse("财务报表2022.pdf")
diff = compare_periods([r2022, r2023, r2024]) # 返回 2 组逐期对比
Supported Document Types
| Document Type | doc_type |
Input Formats | Output |
|---|---|---|---|
| Financial Statement | financial_statement |
PDF, image, Excel | Balance sheet, income statement, cash flow |
| Bank Statement | bank_statement |
PDF, image | Transaction list with counterparty & amounts |
| Business License | business_license |
PDF, image | Company name, credit code, legal rep, scope |
| Audit Report | audit_report |
Opinion type, going concern, key audit matters, signatories | |
| Credit Report | credit_report |
Credit lines, overdue records, utilization | |
| Shareholder Info | shareholder_info |
PDF, image | Shareholder names, ratios, capital |
| Financial Notes | financial_notes |
Accounting policies, related party txns, contingent liabilities | |
| MD&A | md_and_a |
Business overview, operating results, risk factors, outlook | |
| Guarantee Disclosure | guarantee |
Guarantee summary, details, violation guarantees | |
| Equity Changes Stmt | equity_changes_stmt |
Opening/closing balance, changes, profit distribution | |
| Tax Invoice | tax_invoice |
PDF, image, Excel | Invoice items, amounts, tax rates |
| Fixed Asset | fixed_asset |
Excel | Asset list with depreciation |
| Lease Contract | lease_contract |
Terms, amounts, maturity dates | |
| Property Cert | property_cert |
PDF, image | Owner, location, area, registration |
| (any other) | generic |
PDF, image, Excel | Auto-extracted key entities & numbers |
Architecture
parse("document.pdf")
│
├─ detect_file_type() → pdf / image / excel
│
├─ OCR or Excel Parser → raw text (markdown)
│ ├─ PaddleOCR (local) [ocr]
│ ├─ Prismer (GPU service) env: PRISMER_OCR_BASE_URL
│ ├─ PyMuPDF (text-only) [pdf]
│ └─ openpyxl / xlrd [excel]
│
├─ detect_doc_type() → financial_statement / bank_statement / ...
│
└─ LLM Extractor → structured JSON
├─ DeepSeek (default)
├─ OpenAI
└─ Any OpenAI-compatible API
API Reference
parse(file_path, **kwargs)
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str | Path |
required | Path to document |
doc_type |
str | None |
None |
Document type (auto-detect if None) |
llm_provider |
str |
"deepseek" |
LLM provider name |
llm_client |
LLMClient | None |
None |
Pre-configured client (overrides provider) |
llm_base_url |
str | None |
None |
Override provider base URL |
llm_api_key |
str | None |
None |
Override API key |
llm_model |
str | None |
None |
Override model name |
ocr_backend |
str |
"auto" |
OCR backend: auto, paddleocr, prismer, none |
Returns dict with keys: doc_type, file_name, file_type, data.
parse_async(...) — same parameters, returns coroutine.
compare_periods(results, *, significant_change_pct=20.0)
Compare parse() results across multiple reporting periods.
| Parameter | Type | Default | Description |
|---|---|---|---|
results |
list[dict] |
required | List of parse() results, ordered earliest → latest |
significant_change_pct |
float |
20.0 |
Threshold (%) for flagging significant changes |
Returns dict with keys: doc_type, period_count, periods, comparisons, significant_changes.
Two-period comparison
from findocparser import parse, compare_periods
r2023 = parse("财务报表2023.pdf")
r2024 = parse("财务报表2024.pdf")
diff = compare_periods([r2023, r2024])
# Numeric fields get absolute and percentage changes
assets = diff["comparisons"][0]["balance_sheet"]["total_assets"]
print(assets)
# {"previous": 100000000, "current": 125000000, "change": 25000000, "change_pct": 25.0}
# String fields show before/after when different
opinion = diff["comparisons"][0]["opinion_type"]
# {"previous": "标准无保留意见", "current": "保留意见"}
# List fields show count changes
txns = diff["comparisons"][0]["transactions"]
# {"previous_count": 120, "current_count": 185}
Significant change detection
# Flag fields with ≥20% change (default threshold)
for item in diff["significant_changes"]:
print(f"{item['field']}: {item['change_pct']:+.1f}%")
# balance_sheet.inventory: +60.0%
# income_statement.net_income: -35.2%
# Custom threshold (e.g., 10%)
diff = compare_periods([r2023, r2024], significant_change_pct=10.0)
Three-period trend
r2022 = parse("财务报表2022.pdf")
r2023 = parse("财务报表2023.pdf")
r2024 = parse("财务报表2024.pdf")
diff = compare_periods([r2022, r2023, r2024])
print(diff["period_count"]) # 3
print(len(diff["comparisons"])) # 2 (pairwise: 2022→2023, 2023→2024)
# Track revenue trend across 3 years
for comp in diff["comparisons"]:
rev = comp["income_statement"]["revenue"]
print(f"{comp['from_period']} → {comp['to_period']}: {rev['change_pct']:+.1f}%")
Configuration
OCR Backend
# Auto (default): try text extraction first, fall back to PaddleOCR
parse("doc.pdf", ocr_backend="auto")
# Local PaddleOCR (no external service)
parse("doc.pdf", ocr_backend="paddleocr")
# Prismer service (requires PRISMER_OCR_BASE_URL env var)
parse("doc.pdf", ocr_backend="prismer")
# Text-only (PDF with selectable text, no OCR)
parse("doc.pdf", ocr_backend="none")
LLM Provider
# DeepSeek (default, recommended for Chinese documents)
parse("doc.pdf", llm_provider="deepseek")
# OpenAI
parse("doc.pdf", llm_provider="openai")
Privacy & Data Security
Important: Document content is sent to the configured LLM API (DeepSeek, OpenAI, etc.) for structured extraction. This includes any PII present in the documents — ID numbers, bank account numbers, financial figures, credit records, etc.
For sensitive documents (credit reports, bank statements, shareholder info with ID numbers):
# Use a self-hosted model to keep data on-premise
result = parse(
"征信报告.pdf",
llm_base_url="http://your-vllm-server:8000/v1", # Self-hosted
llm_api_key="local",
llm_model="Qwen/Qwen2.5-14B",
)
Recommendations:
- Use self-hosted LLM (Ollama, vLLM, TGI) for documents containing PII
- Review your LLM provider's data retention policy before processing sensitive data
- In China, processing credit reports and ID numbers via cloud APIs may conflict with the Personal Information Protection Law (个人信息保护法) and Regulation on Credit Information Industry (征信业管理条例)
Contributing
Contributions welcome! Areas that need help:
- More extractors (tax invoice, fixed asset, lease, property, land cert)
- Better prompt templates for higher extraction accuracy
- More OCR backends (Surya, EasyOCR, Tesseract)
- More LLM providers (Claude, Gemini, Kimi)
- Test coverage
git clone https://github.com/willamhou/fin-doc-parser.git
cd fin-doc-parser
pip install -e ".[dev]"
pytest
License
Related Projects
- FinSight — AI-powered stock analysis tool built on fin-doc-parser
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fin_doc_parser-0.1.0.tar.gz.
File metadata
- Download URL: fin_doc_parser-0.1.0.tar.gz
- Upload date:
- Size: 167.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9182d304d61fd7ba4d1238d6eec986ba240a0360da9fecc1d0c4c2fea9216a4b
|
|
| MD5 |
eb9a5e075d18f2dd69cb7b7057d3274d
|
|
| BLAKE2b-256 |
5b2e0d22bb245ac2e3c5dfc4bdc3c3c59d0865f7079596bdc838e5af1092c465
|
File details
Details for the file fin_doc_parser-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fin_doc_parser-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea547fe62b8ad034646666739af9724bf0f0d7b5eb3017f89bd592500375c7c4
|
|
| MD5 |
4c5d8bd827144c6775be0dd9784ef88a
|
|
| BLAKE2b-256 |
b4ff3695a3b9b8503efd63c4e0f959aa654d6e6b52fb24544cf9532a0496d55d
|