Credit Card Statement Parser & Analyzer for Indian Banks
Project description
StmtForge
A fully local, privacy-first credit card statement parser & analyzer for Indian banks.
StmtForge extracts transactions from bank-issued PDF statements using a hybrid pipeline (deterministic parsers → table extraction → OCR → local LLM), stores them in a local SQLite database, and presents insights through an interactive Streamlit dashboard.
All data stays on your machine. No cloud uploads. No external API calls. The optional LLM runs locally via Ollama.
Features
| Capability | Details |
|---|---|
| Gmail integration | Fetches statement PDFs via Gmail API (read-only scope) |
| PDF unlocking | pikepdf / qpdf with configurable password patterns |
| Hybrid extraction | Deterministic → table → layout text → OCR → LLM fallback |
| Local LLM | Ollama (Qwen / Mistral / Llama3) for unstructured extraction |
| 9 bank parsers | HDFC · ICICI · SBI · Axis · Kotak · Yes · CSB · Federal · IDFC First |
| Multi-card support | Per-card tracking across banks |
| Auto-categorization | Rule-based merchant categorization |
| Validation | Deduplication, date/amount checks, confidence scoring |
| SQLite storage | Local DB with incremental processing |
| Dashboard | Streamlit + Plotly charts, filters, CSV export |
| Privacy logging | DPDP-aligned pseudonymization; PII redacted from all logs |
Installation
From PyPI
pip install stmtforge
From source (for development)
git clone https://github.com/madhav921/stmt-forge.git
cd stmt-forge
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
pip install -e ".[dashboard,dev]"
Prerequisites
| Requirement | Purpose |
|---|---|
| Python 3.11+ | Runtime |
| Ollama | Local LLM (optional but recommended) |
| Google Cloud project | Gmail API access (optional — manual PDF import works without it) |
| qpdf | Fallback PDF decryption (optional) |
Quick Start
# 1. Initialize a project directory
mkdir ~/my-statements && cd ~/my-statements
stmtforge init # creates config.yaml, .env.example, and data/ directories
# 2. (Optional) Set up Gmail — see docs below
# 3. Add PDF passwords to .env
# Windows (PowerShell)
Copy-Item .env.example .env
# macOS / Linux
# cp .env.example .env
# then edit .env
# 4. Pull an Ollama model
ollama pull qwen2.5:3b
# 5. Run the pipeline
stmtforge run --local # local PDFs only
stmtforge run --full # full Gmail fetch + parse
stmtforge run --folder path/to/pdfs # specific folder
# 6. Launch dashboard
stmtforge dashboard
Privacy & Data Handling
StmtForge is designed around a local-first, zero-upload architecture.
| Concern | How it's handled |
|---|---|
| Where data is processed | Entirely on your local machine |
| What data is accessed | PDF files (local or Gmail), extracted transactions |
| External network calls | Gmail API (opt-in, read-only) and local Ollama only |
| Analytics / tracking | None — no telemetry, no phone-home |
| Data storage | Local SQLite database + local files only |
| Log privacy | All PII (emails, phones, PAN, card numbers) is automatically redacted from logs |
This tool processes credit card statements locally on your machine.
- No data is uploaded to any external server.
- No user data is stored or logged beyond your local project directory.
- No analytics or tracking is performed.
- All parsing and analysis happens entirely offline unless you explicitly
enable Gmail integration.
Gmail Integration (optional)
If you enable Gmail integration:
- The tool uses read-only access via the Gmail API
(
gmail.readonlyscope). - Only emails matching your configured filters (sender domains, keywords like "credit card statement") are accessed.
- Attachments are downloaded to your local
data/raw_pdfs/directory. - No email content is stored or transmitted externally.
- You can revoke access at any time from Google Account Permissions.
Gmail is entirely optional. You can drop PDFs into
data/raw_pdfs/<bank>/manually and runstmtforge run --local.
Security Practices
| Practice | Implementation |
|---|---|
| PDF passwords | Loaded from .env into memory; never written to logs, database, or disk |
| Log redaction | RedactionFilter strips emails, phones, PAN numbers, card numbers from every log line |
| Privacy logging | HMAC pseudonymization for event logs (DPDP-aligned) |
| OAuth tokens | Stored locally (token.json); git-ignored by default |
| Sensitive config | config.yaml and .env are git-ignored; only sanitized templates are committed |
| Temporary files | Unlocked PDFs are kept in data/unlocked_pdfs/; no stray temp files |
| Dependencies | Minimum versions pinned; no unnecessary or exotic packages |
For full details and how to report vulnerabilities, see SECURITY.md.
Supported Banks & Formats
StmtForge includes dedicated parsers for the following banks. Other formats fall back to the generic parser + LLM extraction.
| Bank | Parser | Status |
|---|---|---|
| HDFC Bank | hdfc_parser |
Tested |
| ICICI Bank | icici_parser |
Tested |
| SBI Card | sbi_parser |
Tested |
| Axis Bank | axis_parser |
Tested |
| Kotak Mahindra | kotak_parser |
Tested |
| Yes Bank | yes_parser |
Tested |
| CSB Bank | csb_parser |
Tested |
| Federal Bank | federal_parser |
Tested |
| IDFC First Bank | idfc_first_parser |
Tested |
| (other) | generic_parser + LLM |
Best-effort |
Note: Statement formats change over time. If a parser produces incorrect results for a recent statement, please open an issue.
How It Works
PDF ─► Unlock ─► Deterministic Parser ─► Multi-Stage Extraction ─► LLM ─► Validation ─► SQLite
- PDF Unlock — Tries password combinations (DOB, PAN, custom) via pikepdf.
- Deterministic Parser — Bank-specific regex parser runs first. If ≥ 3 transactions are found, done.
- Multi-Stage Extraction (fallback):
- Stage 1 — Table extraction (pdfplumber)
- Stage 2 — Layout text (pdfplumber / pdftotext)
- Stage 3 — OCR (pdf2image + Tesseract, optional)
- LLM Structuring — Local Ollama with primary → hard-mode → validation prompts.
- Validation — Date normalization, amount bounds, deduplication, confidence scoring.
- Categorization — Rule-based merchant classification.
- Storage — SQLite with transaction-level deduplication.
Project Structure
stmt-forge/
├── src/stmtforge/ # Package source (src-layout)
│ ├── __init__.py
│ ├── cli.py # CLI entry point
│ ├── run_pipeline.py # Pipeline orchestrator
│ ├── hybrid_pipeline.py # Hybrid extraction engine
│ ├── config_template.yaml # Default config template
│ ├── database/ # SQLite layer
│ ├── dashboard/ # Streamlit app
│ ├── extractor/ # Multi-stage text extraction
│ ├── gmail/ # Gmail OAuth & fetcher
│ ├── llm/ # Ollama client & prompts
│ ├── parsers/ # Bank-specific parsers
│ ├── pdf_processing/ # PDF unlock & extraction
│ ├── utils/ # Config, logging, privacy, hashing
│ └── validator/ # Transaction validation
├── tests/ # Test suite
├── .github/workflows/ # CI (GitHub Actions)
├── pyproject.toml # Build configuration
├── .env.example # Environment variable template
├── LICENSE # MIT
├── SECURITY.md # Security policy
├── CONTRIBUTING.md # Contributor guide
├── CODE_OF_CONDUCT.md # Community standards
└── README.md # This file
Adding a New Bank Parser
# src/stmtforge/parsers/mybank_parser.py
from stmtforge.parsers.base_parser import BaseParser, parse_date, parse_amount
class MyBankParser(BaseParser):
BANK_NAME = "mybank"
def parse(self, pdf_path):
records = [...] # Extract transactions
return self._get_standard_df(records)
Then register in src/stmtforge/parsers/registry.py and add email / filename
mappings in config_template.yaml. See CONTRIBUTING.md for
details.
Testing
pytest # run full suite
pytest -v # verbose output
pytest tests/test_scope_filter.py # single file
Configuration
stmtforge init copies a sanitized config_template.yaml into your project
directory as config.yaml. Key sections:
| Section | Purpose |
|---|---|
gmail |
Sender domains, search keywords, attachment filters |
credit_cards |
Your banks and card names |
pdf_passwords |
Password patterns (auto-filled from .env) |
parsers |
Email → bank mapping, filename → bank mapping, card identifiers |
categories |
Merchant → category rules |
database |
SQLite path |
llm |
Ollama model, URL, temperature |
privacy_logging |
Retention period, pseudonymization salt |
Disclaimer
This tool is intended for personal use and convenience.
While care has been taken to ensure accuracy:
- Parsing errors may occur depending on statement format changes by banks.
- Users should verify extracted data before making financial decisions.
- This is not a bank-grade or auditor-certified system.
- The authors assume no liability for incorrect transaction data.
Contributing
We welcome contributions of all kinds — bug reports, new bank parsers, documentation improvements, and code fixes. See CONTRIBUTING.md for guidelines.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stmtforge-0.1.0.tar.gz.
File metadata
- Download URL: stmtforge-0.1.0.tar.gz
- Upload date:
- Size: 70.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cad3c8c4a045e5fab62c071779d431f434d9278eda5220c77f5cac6777b9e303
|
|
| MD5 |
b255dcb898fb1c7b844233d56cac6abf
|
|
| BLAKE2b-256 |
bf67689e5097d22a347a765f813f6d7a2f9858efe19395ddd20282574db45ee2
|
File details
Details for the file stmtforge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: stmtforge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 88.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
caacc9f4340dfafe5906c9cfa6a2bb7243da4dd38d8d56cbc6ccba41497c1c68
|
|
| MD5 |
510df2d92c7f63b7e798344fde0509a6
|
|
| BLAKE2b-256 |
c21fcb8eec28b4118d480d171a836f4bab7617b452fb9fe56a8845ec7a2419f1
|