Parse credit card PDF statements from Indian banks (HDFC, ICICI, SBI, Axis, Kotak + 4 more) into structured data. Offline, privacy-first, with Streamlit dashboard.
Project description
StmtForge — Credit Card Statement Parser & Analyzer
Open-source, offline-first Python tool to parse credit card PDF statements from Indian banks into structured data.
Install · Quick Start · Supported Banks · Dashboard · Docs
Why StmtForge?
Indian bank credit card statements are password-protected PDFs with inconsistent formats — making expense tracking painful. StmtForge solves this:
- Parse PDF statements from Indian banks (Currently implemented: HDFC, ICICI, SBI, Axis, Kotak, Yes, CSB, Federal, IDFC First)
- 100% offline — no data leaves your machine, no cloud APIs, no telemetry
- Hybrid extraction — deterministic regex parsers → table extraction → OCR → local LLM (Ollama)
- One command —
pip install stmtforgeand start analyzing your credit card spend
Built for anyone in India who wants to track credit card expenses without trusting third-party apps with their financial data.
Dashboard Preview
StmtForge includes a Streamlit analytics dashboard with interactive charts, filters, and CSV export.
Analytics: total spend, monthly trends, category breakdown, top merchants, bank & card comparison, daily heatmap, drill-downs
Key Features
| Feature | Description |
|---|---|
| 9 bank-specific parsers | Dedicated parsers for HDFC, ICICI, SBI, Axis, Kotak, Yes, CSB, Federal, IDFC First |
| PDF unlock & parse | Auto-decrypts password-protected statements (DOB, PAN, custom patterns) |
| Hybrid extraction pipeline | Deterministic → table → OCR → local LLM fallback chain |
| Local LLM via Ollama | Qwen / Mistral / Llama3 for unstructured statement parsing |
| Gmail auto-fetch | Read-only OAuth2 — downloads statement PDFs from Gmail automatically |
| Multi-card tracking | Track spend across multiple cards and banks |
| Auto-categorization | Rule-based merchant classification (Shopping, Food, Travel, EMI, etc.) |
| Transaction deduplication | Hash-based dedup with incremental processing |
| Streamlit dashboard | Interactive Plotly charts, sidebar filters, CSV export |
| Privacy-first design | PII redacted from logs, HMAC pseudonymization, DPDP-aligned |
| CLI interface | stmtforge run, stmtforge dashboard, stmtforge init |
Installation
From PyPI
pip install stmtforge
Optional extras:
pip install "stmtforge[gmail]" # Gmail fetch support
pip install "stmtforge[ocr]" # OCR fallback support
pip install "stmtforge[all]" # Gmail + OCR extras
From Source
git clone https://github.com/madhav921/stmt-forge.git
cd stmt-forge
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
pip install -e ".[dev]" # developer tools only
pip install -e ".[dev,all]" # developer tools + Gmail + OCR extras
Requirements
| Requirement | Purpose |
|---|---|
| Python 3.11+ | Runtime |
| Ollama (optional) | Local LLM for unstructured PDF parsing |
| Google Cloud project (optional) | Gmail API — not needed for manual PDF import |
| Tesseract OCR binary (optional) | Required by OCR fallback when stmtforge[ocr] is installed |
| qpdf (optional) | Fallback PDF decryption |
Quick Start
# 1. Set up project
mkdir ~/my-statements && cd ~/my-statements
stmtforge init # creates config.yaml, .env.example, data/
# 2. Configure PDF passwords
cp .env.example .env # then edit .env with your passwords
# 3. (Optional) Set up local LLM
ollama pull qwen2.5:3b
# 4. Run the pipeline
stmtforge run --local # parse local PDFs
stmtforge run --full # Gmail fetch + parse
stmtforge run --folder path/to/pdfs # specific folder
# 5. View insights
stmtforge dashboard
Manual PDF import: Drop PDFs into data/raw_pdfs/<bank>/ and run stmtforge run --local. No Gmail setup needed.
Supported Banks
| Bank | Parser | Card Detection |
|---|---|---|
| HDFC Bank | hdfc_parser |
Swiggy, Tata Neu, Millennia, etc. |
| ICICI Bank | icici_parser |
Amazon Pay, Coral, Platinum, etc. |
| SBI Card | sbi_parser |
Cashback, Elite, SimplyCLICK, etc. |
| Axis Bank | axis_parser |
Neo, Flipkart, Ace, etc. |
| Kotak Mahindra | kotak_parser |
811, League Platinum, etc. |
| Yes Bank | yes_parser |
Marquee, Prosperity, etc. |
| CSB Bank | csb_parser |
Edge, etc. |
| Federal Bank | federal_parser |
Signet, Scapia, etc. |
| IDFC First Bank | idfc_first_parser |
First Select, Classic, WOW, etc. |
| Any other bank | generic_parser + LLM |
Auto-detected |
Statement formats change over time. Open an issue if a parser produces incorrect results.
How It Works
PDF → Unlock → Bank Parser → Table Extraction → OCR → LLM → Validate → SQLite → Dashboard
- PDF Unlock — Tries password combos (DOB, PAN, custom) via pikepdf
- Bank Parser — Bank-specific regex parser extracts transactions directly
- Fallback Chain — Table extraction (pdfplumber) → Layout text → OCR (Tesseract) → Local LLM (Ollama)
- Validation — Date normalization, amount bounds, dedup, confidence scoring
- Categorization — Rule-based merchant → category mapping
- Storage — SQLite with transaction-level deduplication and incremental processing
Privacy & Security
StmtForge is built around a local-first, zero-upload architecture.
| Processing | 100% local — no cloud, no external APIs |
| Storage | Local SQLite + local files only |
| Telemetry | None — no analytics, no phone-home |
| Log privacy | PII auto-redacted (emails, phones, PAN, card numbers) |
| PDF passwords | .env → memory only; never logged or stored in DB |
| Gmail | Optional, read-only OAuth2; revoke anytime at Google Permissions |
See SECURITY.md for vulnerability reporting and full security policy.
Configuration
stmtforge init creates a config.yaml with these sections:
| Section | Purpose |
|---|---|
gmail |
Sender domains, search keywords, attachment filters |
credit_cards |
Your banks and card names |
pdf_passwords |
Password patterns (from .env) |
parsers |
Email/filename → bank mapping, card identifiers |
categories |
Merchant → category rules |
database |
SQLite path |
llm |
Ollama model, URL, temperature |
Adding a New Bank Parser
from stmtforge.parsers.base_parser import BaseParser, parse_date, parse_amount
class MyBankParser(BaseParser):
BANK_NAME = "mybank"
def parse(self, pdf_path):
records = [...] # Extract transactions
return self._get_standard_df(records)
Register in src/stmtforge/parsers/registry.py and add mappings in config_template.yaml.
See CONTRIBUTING.md for details.
Project Structure
stmt-forge/
├── src/stmtforge/ # Package source
│ ├── cli.py # CLI entry point
│ ├── run_pipeline.py # Pipeline orchestrator
│ ├── hybrid_pipeline.py # Hybrid extraction engine
│ ├── parsers/ # 9 bank parsers + generic + categorizer
│ ├── dashboard/ # Streamlit analytics app
│ ├── pdf_processing/ # PDF unlock & text extraction
│ ├── llm/ # Ollama client & prompts
│ ├── gmail/ # Gmail OAuth & fetcher
│ ├── database/ # SQLite layer
│ ├── validator/ # Transaction validation
│ └── utils/ # Config, logging, privacy, hashing
├── tests/ # Test suite
├── pyproject.toml # Build config
└── README.md
Contributing
Bug reports, new bank parsers, and code fixes welcome. See CONTRIBUTING.md.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stmtforge-0.1.1.tar.gz.
File metadata
- Download URL: stmtforge-0.1.1.tar.gz
- Upload date:
- Size: 3.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da34bdd69cd3d09028f0948f27ef4c8cfafbf46daf0e84e349d1e532db36bb88
|
|
| MD5 |
2a28ef655ab7d68388b3db4431d4f069
|
|
| BLAKE2b-256 |
20dab76bfc167d4c07a9305230b0a1c240c75f0954db66b0dc4b7982821fde8a
|
File details
Details for the file stmtforge-0.1.1-py3-none-any.whl.
File metadata
- Download URL: stmtforge-0.1.1-py3-none-any.whl
- Upload date:
- Size: 90.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e1cc28d3b5b8588a8354ec34726778344d2b429dc2f0d3be70a59a7bb944aa4
|
|
| MD5 |
1cd2a2dee92e8323cb06dbbd63041ec0
|
|
| BLAKE2b-256 |
fdfbc4a2dbe4a15069bbcce8b5b1eb8b904d16419e4f039cdf0aee13b13df7fa
|