Skip to main content

Parse credit card PDF statements from Indian banks (HDFC, ICICI, SBI, Axis, Kotak + 4 more) into structured data. Offline, privacy-first, with Streamlit dashboard.

Project description

StmtForge — Credit Card Statement Parser for Indian Banks

StmtForge — Credit Card Statement Parser & Analyzer

Open-source, offline-first Python tool to parse credit card PDF statements from Indian banks into structured data.

PyPI version Downloads Python 3.11+ License: MIT

Install · Quick Start · Supported Banks · Dashboard · Docs


Why StmtForge?

Indian bank credit card statements are password-protected PDFs with inconsistent formats — making expense tracking painful. StmtForge solves this:

  • Parse PDF statements from Indian banks (Currently implemented: HDFC, ICICI, SBI, Axis, Kotak, Yes, CSB, Federal, IDFC First)
  • 100% offline — no data leaves your machine, no cloud APIs, no telemetry
  • Hybrid extraction — deterministic regex parsers → table extraction → OCR → local LLM (Ollama)
  • One commandpip install stmtforge and start analyzing your credit card spend

Built for anyone in India who wants to track credit card expenses without trusting third-party apps with their financial data.


Dashboard Preview

StmtForge includes a Streamlit analytics dashboard with interactive charts, filters, and CSV export.

StmtForge Dashboard — monthly spend trend, category breakdown, top merchants, bank-wise breakdown

Analytics: total spend, monthly trends, category breakdown, top merchants, bank & card comparison, daily heatmap, drill-downs


Key Features

Feature Description
9 bank-specific parsers Dedicated parsers for HDFC, ICICI, SBI, Axis, Kotak, Yes, CSB, Federal, IDFC First
PDF unlock & parse Auto-decrypts password-protected statements (DOB, PAN, custom patterns)
Hybrid extraction pipeline Deterministic → table → OCR → local LLM fallback chain
Local LLM via Ollama Qwen / Mistral / Llama3 for unstructured statement parsing
Gmail auto-fetch Read-only OAuth2 — downloads statement PDFs from Gmail automatically
Multi-card tracking Track spend across multiple cards and banks
Auto-categorization Rule-based merchant classification (Shopping, Food, Travel, EMI, etc.)
Transaction deduplication Hash-based dedup with incremental processing
Streamlit dashboard Interactive Plotly charts, sidebar filters, CSV export
Privacy-first design PII redacted from logs, HMAC pseudonymization, DPDP-aligned
CLI interface stmtforge run, stmtforge dashboard, stmtforge init

Installation

From PyPI

pip install stmtforge

Optional extras:

pip install "stmtforge[gmail]"      # Gmail fetch support
pip install "stmtforge[ocr]"        # OCR fallback support
pip install "stmtforge[all]"        # Gmail + OCR extras

From Source

git clone https://github.com/madhav921/stmt-forge.git
cd stmt-forge
python -m venv .venv
.venv\Scripts\activate        # Windows
# source .venv/bin/activate   # macOS / Linux
pip install -e ".[dev]"             # developer tools only
pip install -e ".[dev,all]"         # developer tools + Gmail + OCR extras

Requirements

Requirement Purpose
Python 3.11+ Runtime
Ollama (optional) Local LLM for unstructured PDF parsing
Google Cloud project (optional) Gmail API — not needed for manual PDF import
Tesseract OCR binary (optional) Required by OCR fallback when stmtforge[ocr] is installed
qpdf (optional) Fallback PDF decryption

Quick Start

# 1. Set up project
mkdir ~/my-statements && cd ~/my-statements
stmtforge init          # creates config.yaml, .env.example, data/

# 2. Configure PDF passwords
cp .env.example .env    # then edit .env with your passwords

# 3. (Optional) Set up local LLM
ollama pull qwen2.5:3b

# 4. Run the pipeline
stmtforge run --local               # parse local PDFs
stmtforge run --full                 # Gmail fetch + parse
stmtforge run --folder path/to/pdfs  # specific folder

# 5. View insights
stmtforge dashboard

Manual PDF import: Drop PDFs into data/raw_pdfs/<bank>/ and run stmtforge run --local. No Gmail setup needed.


Supported Banks

Bank Parser Card Detection
HDFC Bank hdfc_parser Swiggy, Tata Neu, Millennia, etc.
ICICI Bank icici_parser Amazon Pay, Coral, Platinum, etc.
SBI Card sbi_parser Cashback, Elite, SimplyCLICK, etc.
Axis Bank axis_parser Neo, Flipkart, Ace, etc.
Kotak Mahindra kotak_parser 811, League Platinum, etc.
Yes Bank yes_parser Marquee, Prosperity, etc.
CSB Bank csb_parser Edge, etc.
Federal Bank federal_parser Signet, Scapia, etc.
IDFC First Bank idfc_first_parser First Select, Classic, WOW, etc.
Any other bank generic_parser + LLM Auto-detected

Statement formats change over time. Open an issue if a parser produces incorrect results.


How It Works

PDF → Unlock → Bank Parser → Table Extraction → OCR → LLM → Validate → SQLite → Dashboard
  1. PDF Unlock — Tries password combos (DOB, PAN, custom) via pikepdf
  2. Bank Parser — Bank-specific regex parser extracts transactions directly
  3. Fallback Chain — Table extraction (pdfplumber) → Layout text → OCR (Tesseract) → Local LLM (Ollama)
  4. Validation — Date normalization, amount bounds, dedup, confidence scoring
  5. Categorization — Rule-based merchant → category mapping
  6. Storage — SQLite with transaction-level deduplication and incremental processing

Privacy & Security

StmtForge is built around a local-first, zero-upload architecture.

Processing 100% local — no cloud, no external APIs
Storage Local SQLite + local files only
Telemetry None — no analytics, no phone-home
Log privacy PII auto-redacted (emails, phones, PAN, card numbers)
PDF passwords .env → memory only; never logged or stored in DB
Gmail Optional, read-only OAuth2; revoke anytime at Google Permissions

See SECURITY.md for vulnerability reporting and full security policy.


Configuration

stmtforge init creates a config.yaml with these sections:

Section Purpose
gmail Sender domains, search keywords, attachment filters
credit_cards Your banks and card names
pdf_passwords Password patterns (from .env)
parsers Email/filename → bank mapping, card identifiers
categories Merchant → category rules
database SQLite path
llm Ollama model, URL, temperature

Adding a New Bank Parser

from stmtforge.parsers.base_parser import BaseParser, parse_date, parse_amount

class MyBankParser(BaseParser):
    BANK_NAME = "mybank"

    def parse(self, pdf_path):
        records = [...]  # Extract transactions
        return self._get_standard_df(records)

Register in src/stmtforge/parsers/registry.py and add mappings in config_template.yaml. See CONTRIBUTING.md for details.


Project Structure

stmt-forge/
├── src/stmtforge/           # Package source
│   ├── cli.py               # CLI entry point
│   ├── run_pipeline.py      # Pipeline orchestrator
│   ├── hybrid_pipeline.py   # Hybrid extraction engine
│   ├── parsers/             # 9 bank parsers + generic + categorizer
│   ├── dashboard/           # Streamlit analytics app
│   ├── pdf_processing/      # PDF unlock & text extraction
│   ├── llm/                 # Ollama client & prompts
│   ├── gmail/               # Gmail OAuth & fetcher
│   ├── database/            # SQLite layer
│   ├── validator/           # Transaction validation
│   └── utils/               # Config, logging, privacy, hashing
├── tests/                   # Test suite
├── pyproject.toml           # Build config
└── README.md

Contributing

Bug reports, new bank parsers, and code fixes welcome. See CONTRIBUTING.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stmtforge-0.1.1.tar.gz (3.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stmtforge-0.1.1-py3-none-any.whl (90.2 kB view details)

Uploaded Python 3

File details

Details for the file stmtforge-0.1.1.tar.gz.

File metadata

  • Download URL: stmtforge-0.1.1.tar.gz
  • Upload date:
  • Size: 3.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for stmtforge-0.1.1.tar.gz
Algorithm Hash digest
SHA256 da34bdd69cd3d09028f0948f27ef4c8cfafbf46daf0e84e349d1e532db36bb88
MD5 2a28ef655ab7d68388b3db4431d4f069
BLAKE2b-256 20dab76bfc167d4c07a9305230b0a1c240c75f0954db66b0dc4b7982821fde8a

See more details on using hashes here.

File details

Details for the file stmtforge-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: stmtforge-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 90.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for stmtforge-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6e1cc28d3b5b8588a8354ec34726778344d2b429dc2f0d3be70a59a7bb944aa4
MD5 1cd2a2dee92e8323cb06dbbd63041ec0
BLAKE2b-256 fdfbc4a2dbe4a15069bbcce8b5b1eb8b904d16419e4f039cdf0aee13b13df7fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page