Parse bank statement PDFs, extract transactions, and persist to Parquet and SQLite.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

bank-statement-parser

Parse bank statement PDFs, extract structured transaction data, validate financial information through checks and balances, and persist results to Parquet files and a SQLite star-schema data mart. Export reports as Excel workbooks or CSV files.

Features

PDF extraction — configurable pattern-based parsing of bank statement PDFs using pdfplumber.
Checks and balances — automatic validation of opening/closing balances, payment totals, and running balances against statement header values.
Dual persistence — write results to Parquet files, a SQLite database, or both.
Star-schema data mart — automatically builds dimension and fact tables (DimTime, DimAccount, DimStatement, FactTransaction, FactBalance) plus a GapReport for detecting missing statements.
Dual report backends — read the same report classes from either Parquet or SQLite, with identical schemas.
Export — single flat transactions table (default) or separate star-schema tables, as Excel and/or CSV.
PDF anonymisation — redact personally identifiable information from statement PDFs using a user-supplied mapping file. Transaction descriptions are scrambled so merchant names cannot be recovered.
Parallel processing — async + multiprocess batch mode for large PDF sets.
Cross-platform — pure Python with no OS-specific dependencies.

Installation

Requires Python 3.14 or later.

From PyPI

The recommended way to install for most users. Both pipx and uv tool create an isolated virtualenv and put bsp on your $PATH.

# Using pipx
pipx install uk-bank-statement-parser

# Using uv (faster)
uv tool install uk-bank-statement-parser

To upgrade later:

pipx upgrade uk-bank-statement-parser   # or
uv tool upgrade uk-bank-statement-parser

Debian / Ubuntu (.deb)

Download the .deb from the latest GitHub Release, then install:

sudo dpkg -i uk-bank-statement-parser_0.2.0a1_all.deb

This installs a self-contained virtualenv to /opt/uk-bank-statement-parser/ and a bsp wrapper to /usr/bin/bsp. Uninstall with sudo dpkg -r uk-bank-statement-parser.

Fedora / RHEL (.rpm)

Download the .rpm from the latest GitHub Release, then install:

sudo rpm -i uk-bank-statement-parser-0.2.0a1-1.noarch.rpm

Uninstall with sudo rpm -e uk-bank-statement-parser.

From source

git clone https://github.com/boscorat/bank_statement_parser.git
cd bank_statement_parser
uv sync

Quick Start

Command line

Process all PDFs in a folder and export an Excel workbook and CSV file:

bsp process --pdfs ~/statements/

This creates a bsp_project/ directory in your current working directory containing the SQLite database, Parquet files, and exported reports.

Python API

import bank_statement_parser as bsp
from pathlib import Path

# Process a batch of PDFs
batch = bsp.StatementBatch(pdfs=sorted(Path("~/statements").expanduser().glob("*.pdf")))

# Persist to Parquet + SQLite
batch.update_data()

# Export a flat transactions table as Excel and CSV
batch.export(filetype="both")

# Copy source PDFs into the project tree
batch.copy_statements_to_project()

# Clean up temporary files
batch.delete_temp_files()

Read reports directly:

import bank_statement_parser as bsp

# From the SQLite backend
flat = bsp.db.FlatTransaction().all.collect()

# From the Parquet backend
flat = bsp.parquet.FlatTransaction().all.collect()

Both backends return Polars LazyFrames with identical schemas.

CLI Reference

`bsp process`

Parse bank statement PDFs, persist data, and export reports.

bsp process [OPTIONS]

Option	Default	Description
`--project PATH`	`./bsp_project/`	Project folder path. Created if absent.
`--pdfs PATH`	Current directory	Folder to scan for PDF files.
`--pattern GLOB`	`*/.pdf`	Glob pattern for PDF discovery.
`--no-turbo`	Off	Disable parallel processing.
`--company KEY`	Auto-detect	Company key for config lookup.
`--account KEY`	Auto-detect	Account key for config lookup.
`--data {parquet,database,both}`	`both`	Where to persist extracted data.
`--export-data {parquet,database}`	`database`	Which backend to read when exporting.
`--export-format {excel,csv,both}`	`both`	Output file format.
`--export-type {full,simple}`	`simple`	Export preset (see Export Options).
`--no-export`	Off	Skip the export step entirely.
`--no-copy`	Off	Skip copying source PDFs into the project.

Examples:

# Process PDFs in ~/statements, write to a specific project folder
bsp process --pdfs ~/statements --project ~/my_project

# Process only top-level PDFs (no subdirectories), export CSV only
bsp process --pdfs ~/statements --pattern "*.pdf" --export-format csv

# Process without exporting (data only)
bsp process --pdfs ~/statements --no-export

# Full star-schema export for loading into an external database
bsp process --pdfs ~/statements --export-type full

`bsp anonymise`

Replace personally identifiable information in bank statement PDFs with dummy values. The anonymiser physically rewrites the PDF content stream — sensitive text is removed from the file, not merely covered with a rectangle. Transaction descriptions are also scrambled (each letter replaced with a random different letter) so that merchant names and references cannot be recovered.

Setting up your anonymise config

Anonymisation is driven by a TOML config file (anonymise.toml) that maps your real personal details to dummy replacements. This file is never included in the default project because it contains PII and is excluded from source control via .gitignore.

When you create a project (via bsp process or validate_or_initialise_project()), an example template is copied into the project config directory:

bsp_project/config/anonymise_example.toml

To set up anonymisation:

Copy anonymise_example.toml to anonymise.toml in the same directory (or any location you prefer).
Edit anonymise.toml — replace the left-hand (search) values with the real text as it appears in your PDFs, and the right-hand (replacement) values with the dummy text you want rendered instead.
Pass the path to your anonymise.toml via the --config flag, since the default project directory will never contain one.

The config has two sections:

[global_replacements] — applied on every page across the full page area. Use for names, account numbers, sort codes, IBANs, and card numbers.
[address_replacements] — applied on page 1 only, within the personal address block at the top-left corner. Use for address lines, city names, and postcodes that might also appear as merchant/location names in transaction descriptions (where you would not want them replaced).

Ordering matters: within each section, entries are applied top-to-bottom. Always place longer, more specific strings before shorter fragments. For example, list "John William Surname" before "Surname" — otherwise the fragment match fires first and corrupts the full-name replacement.

Checking your output

Anonymised PDFs should always be reviewed carefully before sharing. The anonymiser cannot guarantee perfect results in every case — font encoding differences, unusual character spacing, or layout variations may cause some replacements to render incorrectly or miss certain occurrences. Open each output file and verify that:

All personal details (names, addresses, account numbers) have been replaced.
Replacement text renders correctly and is the expected length.
No sensitive information remains in headers, footers, or transaction descriptions.

You may need to make manual edits to the PDF or adjust your anonymise.toml mappings and re-run.

Command reference

bsp anonymise PATH [OPTIONS]

Option	Default	Description
`PATH`	(required)	PDF file or folder (with `--folder`).
`--folder`	Off	Treat PATH as a directory.
`--pattern GLOB`	`*.pdf`	Glob for PDF discovery in folder mode.
`--output OUT_FILE`	`<stem>_anonymised.pdf`	Output path (single-file mode).
`--output-dir OUT_DIR`	Alongside source	Output directory (folder mode).
`--config CONFIG_TOML`	Project config	Path to a custom anonymise.toml.

Examples:

# Anonymise a single PDF using a config in your home directory
bsp anonymise statement.pdf --config ~/anonymise.toml

# Anonymise all PDFs in a folder
bsp anonymise ~/statements --folder --config ~/anonymise.toml

# Anonymise to a specific output directory
bsp anonymise ~/statements --folder --output-dir ~/anonymised --config ~/anonymise.toml

Python API Reference

Statement Processing

`StatementBatch`

The main entry point for processing PDFs. Extraction starts on construction.

batch = bsp.StatementBatch(
    pdfs=[Path("a.pdf"), Path("b.pdf")],  # list of PDF paths
    company_key="hsbc_uk",                 # optional — auto-detected if omitted
    account_key="current_account",         # optional — auto-detected if omitted
    turbo=True,                            # parallel processing (default: True)
    project_path=Path("my_project"),       # optional — uses default project if omitted
)

Method	Description
`update_data(datadestination="both")`	Persist results. `"parquet"`, `"database"`, or `"both"`.
`export(...)`	Export reports. See Export Options.
`copy_statements_to_project()`	Copy source PDFs into `project/statements/{year}/{account}/`.
`delete_temp_files()`	Remove temporary per-PDF Parquet files.
`debug(project_path=None)`	Re-process failing PDFs and write diagnostic JSON.

Property	Description
`pdf_count`	Number of PDFs in the batch.
`errors`	Number of PDFs that failed processing.
`duration_secs`	Wall-clock processing time in seconds.
`processed_pdfs`	List of processed `Statement` objects.
`ID_BATCH`	Unique batch identifier (UUID).

Report Backends

Both bsp.db (SQLite) and bsp.parquet (Parquet files) expose identical report classes. Each class has an .all attribute that returns a Polars LazyFrame.

# All classes accept an optional project_path keyword argument.
# When omitted, the default project directory is used.
flat  = bsp.db.FlatTransaction(project_path=Path("my_project"))
df    = flat.all.collect()

Class	Description
`FlatTransaction`	Denormalised transactions with account and statement details.
`FactTransaction`	Transaction fact table (one row per transaction line).
`FactBalance`	Daily balance series per account (fills gaps between statements).
`DimTime`	Date dimension with calendar attributes (year, quarter, month, weekday, etc.).
`DimAccount`	Account dimension (company, account type, number, sort code, holder).
`DimStatement`	Statement dimension (statement date, filename, batch timestamp).
`GapReport`	Statement continuity check — flags gaps where closing/opening balances disagree.

Export Helpers

Module-level functions on both backends:

# Export from the SQLite backend
bsp.db.export_csv(folder=None, type="simple", project_path=None)
bsp.db.export_excel(path=None, type="simple", project_path=None)

# Export from the Parquet backend
bsp.parquet.export_csv(folder=None, type="simple", project_path=None)
bsp.parquet.export_excel(path=None, type="simple", project_path=None)

When folder / path is omitted, files are written to the project's export/csv/ or export/excel/ sub-directory automatically.

Database Utilities

Function / Class	Description
`build_datamart(db_path)`	Drop and rebuild all star-schema mart tables.
`create_db(db_path)`	Create (or recreate) the raw SQLite schema.
`Housekeeping(db_path)`	Orphan detection and cascaded delete utilities.

Project Scaffolding

Function	Description
`validate_or_initialise_project(path)`	Validate an existing project or scaffold a new one.
`copy_project_folders(dest)`	Copy the project directory structure (directories only).
`copy_default_config(dest)`	Copy shipped TOML config files to a directory.

PDF Anonymisation

bsp.anonymise_pdf(input_path, output_path=None, config_path=None, scramble_descriptions=True)
bsp.anonymise_folder(folder_path, pattern="*.pdf", output_dir=None, config_path=None, scramble_descriptions=True)

Both functions require a path to your anonymise.toml via config_path. There is no default anonymise.toml in the project — you must create one from the anonymise_example.toml template (see Setting up your anonymise config above). If config_path is omitted, the function looks in the default project config directory and raises FileNotFoundError with instructions if the file is missing.

Set scramble_descriptions=False to disable the random letter substitution of transaction descriptions (enabled by default).

Always review the output files before sharing — see Checking your output above.

Project Structure

Running bsp process creates the following project layout:

bsp_project/
├── config/              # TOML configuration files
│   └── anonymise_example.toml  # Template — copy to anonymise.toml and edit
├── database/
│   └── project.db       # SQLite database (raw tables + star-schema mart)
├── export/
│   ├── csv/             # Exported CSV files
│   └── excel/           # Exported Excel workbooks
├── log/                 # Processing logs and debug output
├── parquet/             # Parquet data files
│   ├── batch_lines.parquet
│   ├── statement_heads.parquet
│   └── statement_lines.parquet
└── statements/          # Archived source PDFs (organised by year/account)

The SQLite database contains both raw extraction tables (statement_heads, statement_lines) and a full star-schema data mart that is rebuilt automatically on each update_data() call.

Export Options

The type parameter controls what gets exported:

`simple` (default)

Exports a single flat transactions table — one row per transaction with account and statement details denormalised into each row. This is the most useful format for analysis in Excel, Google Sheets, or Pandas/Polars.

CSV: transactions_table.csv
Excel: single transactions_table sheet in transactions.xlsx

`full`

Exports separate star-schema tables intended for loading into an external database or BI tool. Since the SQLite database is already available in the project folder, this is mainly useful when you need the data in a different database system.

CSV: statement.csv, account.csv, calendar.csv, transactions.csv, balances.csv, gaps.csv
Excel: one sheet per table in transactions.xlsx

Contributing

Developer guidelines, architecture notes, code style rules, and test commands are documented in AGENTS.md.

# Run the test suite
pytest -v

# Lint and format
ruff check .
ruff format .

Releasing a new version

Bump the version in pyproject.toml (the single source of truth).

Commit and tag:

git add pyproject.toml uv.lock
git commit -m "release: v0.2.0"
git tag -a v0.2.0 -m "v0.2.0"
git push origin main --tags

The release.yml workflow runs automatically — builds and publishes to PyPI, builds .deb and .rpm packages, and creates a GitHub Release with all assets attached.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

boscorat

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1b5 pre-release

May 21, 2026

0.2.1b4 pre-release

May 17, 2026

0.2.1b3 pre-release

May 12, 2026

0.2.1b2 pre-release

May 6, 2026

0.2.1b1 pre-release

Apr 22, 2026

0.2.1a9 pre-release

Apr 21, 2026

0.2.1a8 pre-release

Apr 15, 2026

0.2.1a7 pre-release

Mar 31, 2026

0.2.1a6 pre-release

Mar 21, 2026

0.2.1a5 pre-release

Mar 17, 2026

0.2.1a4 pre-release

Mar 13, 2026

0.2.1a3 pre-release

Mar 13, 2026

0.2.1a2 pre-release

Mar 12, 2026

0.2.1a1 pre-release

Mar 6, 2026

0.2.0a9 pre-release

Mar 4, 2026

0.2.0a8 pre-release

Mar 3, 2026

0.2.0a7 pre-release

Mar 1, 2026

0.2.0a6 pre-release

Mar 1, 2026

0.2.0a5 pre-release

Mar 1, 2026

This version

0.2.0a4 pre-release

Mar 1, 2026

0.2.0a3 pre-release

Mar 1, 2026

0.2.0a2 pre-release

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_bank_statement_parser-0.2.0a4.tar.gz (91.7 kB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uk_bank_statement_parser-0.2.0a4-py3-none-any.whl (111.3 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file uk_bank_statement_parser-0.2.0a4.tar.gz.

File metadata

Download URL: uk_bank_statement_parser-0.2.0a4.tar.gz
Upload date: Mar 1, 2026
Size: 91.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_bank_statement_parser-0.2.0a4.tar.gz
Algorithm	Hash digest
SHA256	`ec8bf97b0055e91731bee47e698cba74035c676d4985a3e4203fa04f4425ed96`
MD5	`0b8c0a485005e06df401d07cf216ea00`
BLAKE2b-256	`5b35b3941dbf74c71cb747eeef505df8e4477c420f6f3e73e73dbc24df9cad31`

See more details on using hashes here.

File details

Details for the file uk_bank_statement_parser-0.2.0a4-py3-none-any.whl.

File metadata

Download URL: uk_bank_statement_parser-0.2.0a4-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 111.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_bank_statement_parser-0.2.0a4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef8056dbca0277bcb3107f987322ef07ba3513d5c5bd0409deb3780b7f3e1ecf`
MD5	`de5b8d4aad118667efa63c7f5a5b70d3`
BLAKE2b-256	`5f677846cf03daffbcff62537996c4049b3593af44ec244afe0a8e8a43feb28f`

See more details on using hashes here.

uk-bank-statement-parser 0.2.0a4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

bank-statement-parser

Features

Installation

From PyPI

Debian / Ubuntu (.deb)

Fedora / RHEL (.rpm)

From source

Quick Start

Command line

Python API

CLI Reference

bsp process

bsp anonymise

Setting up your anonymise config

Checking your output

Command reference

Python API Reference

Statement Processing

StatementBatch

Report Backends

Export Helpers

Database Utilities

Project Scaffolding

PDF Anonymisation

Project Structure

Export Options

simple (default)

full

Contributing

Releasing a new version

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`bsp process`

`bsp anonymise`

`StatementBatch`

`simple` (default)

`full`