Convert Nigerian bank PDF statements into structured CSV.
Project description
bankstract
Convert Nigerian bank PDF statements into structured CSV. Plugin architecture — one parser per bank.
pip install bankstract
bankstract palmpay statement.pdf -o out.csv
bankstract auto unknown.pdf -o out.csv
bankstract list
Status
| Bank | Status |
|---|---|
| PalmPay | v0.7 — alpha |
| First Bank | v0.7 — alpha |
| Zenith | v0.7 — alpha |
Install
pip install bankstract
Optional extras:
pip install "bankstract[ocr]" # pytesseract for scanned PDFs
pip install "bankstract[camelot]" # camelot lattice fallback
Develop
Project uses uv for dependency + venv management.
uv sync --all-extras # create .venv, install deps + extras from uv.lock
uv run pre-commit install # one-time: enable the pre-commit hook
uv run pytest # run tests
uv run ruff check src tests
uv run pyright src tests # strict type check (see CLAUDE.md directive 8)
uv run bankstract list # invoke CLI
Add a dependency with uv add <pkg> (dev: uv add --dev <pkg>). Commit uv.lock.
The pre-commit hook runs ruff check, ruff format --check, pyright (strict), and pytest before every commit. Bypass only in genuine emergencies with git commit --no-verify; the same checks run again in CI.
Releasing
CI publishes to PyPI automatically on push to main via .github/workflows/publish.yml. The workflow runs the full gate (ruff + pyright + pytest), and if the current pyproject.toml version already exists on PyPI it auto-bumps the minor component and commits the bump before publishing. PyPI auth uses OIDC trusted publishing — no token in repo or CI secrets.
To prepare a release locally:
scripts/bump-version.sh # patch bump
scripts/bump-version.sh minor # 0.2.x -> 0.3.0
scripts/bump-version.sh major # 0.x.x -> 1.0.0
scripts/bump-version.sh 0.3.0 # exact set
uv build # dist/*.whl + dist/*.tar.gz
uv publish dist/* # only if not using the GH workflow; needs --token or UV_PUBLISH_TOKEN
Trusted-publisher setup (one-time, owner only): create a publisher at https://pypi.org/manage/account/publishing/ with workflow publish.yml, repo logickoder/bankstract.
Usage
bankstract <bank> <pdf> -o <out> # explicit parser
bankstract auto <pdf> -o <out> # auto-detect via Parser.detect_confidence()
bankstract list # show registered parsers
bankstract <bank> <pdf> -o out.json -f json # JSON instead of CSV
cat statement.pdf | bankstract auto - -o - # stdin / stdout pipeline
Pass - as the PDF arg to read from stdin, or - to -o to write to stdout. When stdout is the data sink, informational messages go to stderr so the data stream stays clean.
Unparseable blocks are written to a .log sidecar next to the output file.
Python API
import bankstract
bankstract.list_parsers() # ['fbn', 'palmpay', 'zenith']
bankstract.detect("statement.pdf") # 'palmpay' | None
result = bankstract.parse("statement.pdf") # auto-detect
result = bankstract.parse(fp, bank="fbn") # explicit; fp is BytesIO
result.metadata.account_holder
result.metadata.statement_period_start
result.transactions[0].balance
result.format_version
Public surface (semver-locked)
Only the names re-exported from bankstract are part of the semver contract:
| Name | Kind | Purpose |
|---|---|---|
parse |
function | parse(source, *, bank=None) -> ParseResult |
detect |
function | detect(source) -> str | None (max-score bank) |
list_parsers |
function | sorted bank names |
Parser |
ABC | base class for new parsers |
Transaction |
pydantic | row schema |
StatementMetadata |
dataclass | account holder / period / opening + closing balance |
ParseResult |
dataclass | transactions[], totals, format_version, metadata |
ParseError |
exception | layout mismatch |
ReconciliationError |
exception | invariant break |
__version__ |
str | package version |
source accepts pathlib.Path, a string path (treated as a path), or a seekable binary stream (e.g. io.BytesIO). Auto-detection picks the parser with the highest detect_confidence score — ties resolve to registration order. Anything imported from a submodule prefixed with _ (bankstract._api, bankstract._pdfplumber, bankstract._layout) is internal and may change in any release.
Reconciliation invariant
Two complementary checks; the CLI picks whichever applies per bank.
- Row-wise (banks that print a running balance):
prev.balance ± debit/credit == curr.balance. Mismatch raisesReconciliationErrorwith the row index. - Totals-based (banks like PalmPay that omit a balance column): the parser reads
Total Money In/Total Money Outfrom the statement header and the CLI asserts that the sum of parsed credits/debits equals those totals.
Both modes exist to catch silently-dropped rows — the failure mode of naive PDF parsers.
Contributing a bank parser
- Copy
src/bankstract/parsers/palmpay.pytosrc/bankstract/parsers/<bank>.py. - Implement
detect()andparse() -> ParseResultfromparsers/base.py. Populatetotal_credit/total_debitif the statement only ships header totals. - Add a
Redactorsubclass undersrc/bankstract/redactors/<bank>.pyfor the fixture pipeline. - Drop the raw statement at
tests/<bank>/fixtures/_local/(gitignored), thenuv run bankstract redact <bank> <raw> tests/<bank>/fixtures/sample.pdfto produce the committable fixture. - Add tests under
tests/<bank>/test_parser.pyandtests/<bank>/test_redactor.py.
CI runs ruff + pyright (strict) + pytest. All three must pass clean. Reconciliation invariant must hold on every fixture.
Fixture PDFs must be redacted: account numbers, names, addresses, transaction IDs scrubbed. Never commit unredacted statements.
License
MIT. Author: logickoder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bankstract-0.7.0.tar.gz.
File metadata
- Download URL: bankstract-0.7.0.tar.gz
- Upload date:
- Size: 100.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93822c55f24c0b355ee919ebcbd1ff3ad14b523b6bfd0b7d5124997cda0919e4
|
|
| MD5 |
f61011e6cf6e703f61e6b1b12dc072cf
|
|
| BLAKE2b-256 |
2ce0349ca9cc290a4ee8b456f95230d2245d72974cd45b57eda42eecf6f6ebbc
|
Provenance
The following attestation bundles were made for bankstract-0.7.0.tar.gz:
Publisher:
publish.yml on logickoder/bankstract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bankstract-0.7.0.tar.gz -
Subject digest:
93822c55f24c0b355ee919ebcbd1ff3ad14b523b6bfd0b7d5124997cda0919e4 - Sigstore transparency entry: 1858473428
- Sigstore integration time:
-
Permalink:
logickoder/bankstract@534f5511c6dbb2422c0454a27eb92f5b3fff4d34 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/logickoder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@534f5511c6dbb2422c0454a27eb92f5b3fff4d34 -
Trigger Event:
push
-
Statement type:
File details
Details for the file bankstract-0.7.0-py3-none-any.whl.
File metadata
- Download URL: bankstract-0.7.0-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
692be82a32036e85bb13fe3ff32cb42c29384022381675b8b84460faf6500540
|
|
| MD5 |
4a3def8b8c29926de8dc43474001080c
|
|
| BLAKE2b-256 |
21df2dbc8496f18f9eaa0af64c14d358dafd4f246b4b654c5318d59e98081247
|
Provenance
The following attestation bundles were made for bankstract-0.7.0-py3-none-any.whl:
Publisher:
publish.yml on logickoder/bankstract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bankstract-0.7.0-py3-none-any.whl -
Subject digest:
692be82a32036e85bb13fe3ff32cb42c29384022381675b8b84460faf6500540 - Sigstore transparency entry: 1858473557
- Sigstore integration time:
-
Permalink:
logickoder/bankstract@534f5511c6dbb2422c0454a27eb92f5b3fff4d34 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/logickoder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@534f5511c6dbb2422c0454a27eb92f5b3fff4d34 -
Trigger Event:
push
-
Statement type: