Skip to main content

Parse credit card PDF statements from Indian banks (HDFC, ICICI, SBI, Axis, Kotak + 4 more) into structured data. Offline, privacy-first, with Streamlit dashboard.

Project description

StmtForge — Credit Card Statement Parser for Indian Banks

StmtForge — Credit Card Statement Parser & Analyzer

Open-source, offline-first Python tool to parse credit card PDF statements from Indian banks into structured data.

PyPI version Downloads Python 3.11+ License: MIT

Install · Quick Start · Supported Banks · Dashboard · Docs


Why StmtForge?

Indian bank credit card statements are password-protected PDFs with inconsistent formats — making expense tracking painful. StmtForge solves this:

  • Parse PDF statements from Indian banks (Currently implemented: HDFC, ICICI, SBI, Axis, Kotak, Yes, CSB, Federal, IDFC First)
  • 100% offline — no data leaves your machine, no cloud APIs, no telemetry
  • Hybrid extraction — deterministic regex parsers → table extraction → OCR → local LLM (Ollama)
  • One commandpip install stmtforge and start analyzing your credit card spend

Built for anyone in India who wants to track credit card expenses without trusting third-party apps with their financial data.


Dashboard Preview

StmtForge includes a Streamlit analytics dashboard with interactive charts, filters, and CSV export.

StmtForge Dashboard — monthly spend trend, category breakdown, top merchants, bank-wise breakdown

Analytics: total spend, monthly trends, category breakdown, top merchants, bank & card comparison, daily heatmap, drill-downs

Card Optimizer

The Card Optimizer page provides advanced credit card selection intelligence:

  • Scope selector — filter by All Market Cards, My Cards Only, or a custom selection
  • Card recommendations — rank all cards by net annual value against your actual spend profile
  • Spend Profile with per-transaction advice — for every debit transaction, see which of your cards and which market card earns the most rewards
  • N-card portfolio — use the slider to pick the number of cards (1–8); a greedy algorithm finds the combination that maximises combined net annual rewards while minimising overlap
  • Simulator — adjust hypothetical spend amounts category-by-category and watch rankings update live

Key Features

Feature Description
9 bank-specific parsers Dedicated parsers for HDFC, ICICI, SBI, Axis, Kotak, Yes, CSB, Federal, IDFC First
PDF unlock & parse Auto-decrypts password-protected statements (DOB, PAN, custom patterns)
Hybrid extraction pipeline Deterministic → table → OCR → local LLM fallback chain
Local LLM via Ollama Qwen / Mistral / Llama3 for unstructured statement parsing
Gmail auto-fetch Read-only OAuth2 — downloads statement PDFs from Gmail automatically
User card registry Auto-detects your cards from transaction history and stores them in user_cards DB table
Per-transaction card advisor For each debit transaction, shows which of your cards and which market card earns the most
N-card portfolio optimizer Greedy algorithm finds the optimal set of N cards that maximise combined annual rewards
Scope selector Card Optimizer supports All Market Cards / My Cards Only / Custom Selection scopes
Multi-card tracking Track spend across multiple cards and banks
Auto-categorization Rule-based merchant classification (Shopping, Food, Travel, EMI, etc.)
Transaction deduplication Hash-based dedup with incremental processing
Streamlit dashboard Interactive Plotly charts, sidebar filters, CSV export
Privacy-first design PII redacted from logs, HMAC pseudonymization, DPDP-aligned
CLI interface stmtforge run, stmtforge dashboard, stmtforge init

Public Card Rules Database

StmtForge now uses a separate, public rules repository for card benefits data:

  • Repository: https://github.com/madhav921/stmtforge-cards-db
  • Purpose: transparent, versioned, community-contributable card rules
  • Runtime model: stmtforge init clones the rules repo locally into data/cards-db, then seeds data/cards/ for offline evaluation

The core package is the only PyPI install target. Rules updates happen through the local git clone.

See architecture blueprint: docs/PRODUCTION_ARCHITECTURE.md


Installation

From PyPI

pip install stmtforge

Optional extras:

pip install "stmtforge[gmail]"      # Gmail fetch support
pip install "stmtforge[ocr]"        # OCR fallback support
pip install "stmtforge[all]"        # Gmail + OCR extras

From Source

git clone https://github.com/madhav921/stmt-forge.git
cd stmt-forge
python -m venv .venv
.venv\Scripts\activate        # Windows
# source .venv/bin/activate   # macOS / Linux
pip install -e ".[dev]"             # developer tools only
pip install -e ".[dev,all]"         # developer tools + Gmail + OCR extras

Requirements

Requirement Purpose
Python 3.11+ Runtime
Ollama (optional) Local LLM for unstructured PDF parsing
Google Cloud project (optional) Gmail API — not needed for manual PDF import
Tesseract OCR binary (optional) Required by OCR fallback when stmtforge[ocr] is installed
qpdf (optional) Fallback PDF decryption

Quick Start

# 1. Set up project
mkdir ~/my-statements && cd ~/my-statements
stmtforge init          # creates config.yaml, .env.example, data/, and clones the cards repo

# 2. Configure PDF passwords
cp .env.example .env    # then edit .env with your passwords

# 3. (Optional) Set up local LLM
ollama pull qwen2.5:3b

# 4. Run the pipeline
stmtforge run --local               # parse local PDFs
stmtforge run --full                 # Gmail fetch + parse
stmtforge run --folder path/to/pdfs  # specific folder

# 5. View insights
stmtforge dashboard

To refresh the local rules snapshot later:

git -C data/cards-db pull
stmtforge init

Manual PDF import: Drop PDFs into data/raw_pdfs/<bank>/ and run stmtforge run --local. No Gmail setup needed.

Where do my PDFs go?
Place your password-protected statement PDFs inside data/raw_pdfs/<bank>/ — for example:

  • data/raw_pdfs/hdfc/5268XXXXXXXXXX38_19-11-2025.pdf
  • data/raw_pdfs/sbi/7411XXXXXXXXXXXX_15062024.pdf
  • data/raw_pdfs/idfc/601000XXXXXXXX_24112025_110414520.pdf

The <bank> folder name must match one of the supported bank keys: hdfc, sbi, icici, axis, kotak, yes, csb, federal, idfc. StmtForge will auto-unlock and parse all PDFs found recursively. You can also sub-organise by month (data/raw_pdfs/hdfc/2025_11/) — both flat and nested layouts are supported.


Supported Banks

Bank Parser Card Detection
HDFC Bank hdfc_parser Swiggy, Tata Neu, Millennia, etc.
ICICI Bank icici_parser Amazon Pay, Coral, Platinum, etc.
SBI Card sbi_parser Cashback, Elite, SimplyCLICK, etc.
Axis Bank axis_parser Neo, Flipkart, Ace, etc.
Kotak Mahindra kotak_parser 811, League Platinum, etc.
Yes Bank yes_parser Marquee, Prosperity, etc.
CSB Bank csb_parser Edge, etc.
Federal Bank federal_parser Signet, Scapia, etc.
IDFC First Bank idfc_first_parser First Select, Classic, WOW, etc.
Any other bank generic_parser + LLM Auto-detected

Statement formats change over time. Open an issue if a parser produces incorrect results.


How It Works

PDF → Unlock → Bank Parser → Table Extraction → OCR → LLM → Validate → SQLite → Dashboard
  1. PDF Unlock — Tries password combos (DOB, PAN, custom) via pikepdf
  2. Bank Parser — Bank-specific regex parser extracts transactions directly
  3. Fallback Chain — Table extraction (pdfplumber) → Layout text → OCR (Tesseract) → Local LLM (Ollama)
  4. Validation — Date normalization, amount bounds, dedup, confidence scoring
  5. Categorization — Rule-based merchant → category mapping
  6. Storage — SQLite with 6 tables: transactions, statements_metadata, user_cards, gmail_messages, extraction_log, pipeline_state
  7. User Card Registryuser_cards table auto-populated from transaction history; maps card name fragments to 29 YAML card definitions
  8. Card OptimizerCardAdvisor and best_n_card_combo consume the user_cards registry for scope-aware recommendations

Privacy & Security

StmtForge is built around a local-first, zero-upload architecture.

Processing 100% local — no cloud, no external APIs
Storage Local SQLite + local files only
Telemetry None — no analytics, no phone-home
Log privacy PII auto-redacted (emails, phones, PAN, card numbers)
PDF passwords .env → memory only; never logged or stored in DB
Gmail Optional, read-only OAuth2; revoke anytime at Google Permissions

See SECURITY.md for vulnerability reporting and full security policy.


Configuration

stmtforge init creates a config.yaml with these sections:

Section Purpose
gmail Sender domains, search keywords, attachment filters
credit_cards Your banks and card names
pdf_passwords Password patterns (from .env)
parsers Email/filename → bank mapping, card identifiers
categories Merchant → category rules
database SQLite path
llm Ollama model, URL, temperature

Adding a New Bank Parser

from stmtforge.parsers.base_parser import BaseParser, parse_date, parse_amount

class MyBankParser(BaseParser):
    BANK_NAME = "mybank"

    def parse(self, pdf_path):
        records = [...]  # Extract transactions
        return self._get_standard_df(records)

Register in src/stmtforge/parsers/registry.py and add mappings in config_template.yaml. See CONTRIBUTING.md for details.


Project Structure

stmt-forge/
├── src/stmtforge/           # Package source
│   ├── cli.py               # CLI entry point
│   ├── run_pipeline.py      # Pipeline orchestrator
│   ├── hybrid_pipeline.py   # Hybrid extraction engine
│   ├── parsers/             # 9 bank parsers + generic + categorizer
│   ├── dashboard/           # Streamlit analytics app
│   │   ├── app.py           # Main dashboard (Analytics, Statements, Parse PDF)
│   │   └── pages/
│   │       └── 2_Card_Optimizer.py  # Card Optimizer with scope selector, N-card portfolio
│   ├── pdf_processing/      # PDF unlock & text extraction
│   ├── llm/                 # Ollama client & prompts
│   ├── gmail/               # Gmail OAuth & fetcher
│   ├── database/            # SQLite layer (user_cards table, sync methods)
│   ├── suggestor/           # Card recommendation engine
│   │   ├── card_db.py       # 29-card YAML database loader
│   │   ├── card_advisor.py  # Per-transaction best-card advisor (CardAdvisor)
│   │   ├── optimizer.py     # rank_cards + best_n_card_combo (greedy N-card portfolio)
│   │   ├── spend_vector.py  # SpendVector builder from DB transactions
│   │   └── report.py        # HTML report exporter
│   ├── validator/           # Transaction validation
│   └── utils/               # Config, logging, privacy, hashing
├── data/
│   ├── cards/               # 29 YAML card definition files
│   └── ccanalyser.db        # SQLite DB (transactions, user_cards, metadata)
├── tests/                   # Test suite (80 tests)
├── pyproject.toml           # Build config
└── README.md

Contributing

Bug reports, new bank parsers, and code fixes welcome. See CONTRIBUTING.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stmtforge-1.0.1.tar.gz (140.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stmtforge-1.0.1-py3-none-any.whl (179.6 kB view details)

Uploaded Python 3

File details

Details for the file stmtforge-1.0.1.tar.gz.

File metadata

  • Download URL: stmtforge-1.0.1.tar.gz
  • Upload date:
  • Size: 140.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for stmtforge-1.0.1.tar.gz
Algorithm Hash digest
SHA256 8234e8c3018fb44fa49897edaeafe0426e1f3273679c572017c51c6ca47eb226
MD5 b74fc703a434ea2cebf64f7a0098c01a
BLAKE2b-256 26169861c164bc090601bc4412e551c8a4c773a943e4abe6e6a594fc01fb114b

See more details on using hashes here.

File details

Details for the file stmtforge-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: stmtforge-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 179.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for stmtforge-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a47e8f2e1331e6a266cdf1f365602c6ac95407bac13e2175294d70f00340c438
MD5 d5b17aba0c18ef3e6736cf5e4906e39b
BLAKE2b-256 13b9b5ec0ed45fcaaadb405a85bd63c46af5569020336ebe5e33ee000b2603cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page