Skip to main content

Convert Nigerian bank PDF statements into structured CSV.

Project description

bankstract

Convert Nigerian bank PDF + XLSX statements into structured CSV or JSON. Plugin architecture — one parser per bank, formats declared per parser.

pip install bankstract

bankstract palmpay statement.pdf -o out.csv
bankstract opay statement.xlsx -o out.json -f json
bankstract auto unknown.pdf -o out.csv
bankstract list                                # bank (formats)

Status

Bank Formats Status
PalmPay PDF v0.13 — alpha
First Bank PDF v0.13 — alpha
Zenith PDF v0.13 — alpha
OPay PDF, XLSX v0.13 — alpha

Install

pip install bankstract

Optional extras:

pip install "bankstract[ocr]"      # pytesseract for scanned PDFs
pip install "bankstract[camelot]"  # camelot lattice fallback

Develop

Project uses uv for dependency + venv management.

uv sync --all-extras       # create .venv, install deps + extras from uv.lock
uv run pre-commit install  # one-time: enable the pre-commit hook
uv run pytest              # run tests
uv run ruff check src tests
uv run pyright src tests   # strict type check (see CLAUDE.md directive 8)
uv run bankstract list     # invoke CLI

Add a dependency with uv add <pkg> (dev: uv add --dev <pkg>). Commit uv.lock.

The pre-commit hook runs ruff check, ruff format --check, pyright (strict), and pytest before every commit. Bypass only in genuine emergencies with git commit --no-verify; the same checks run again in CI.

Releasing

CI publishes to PyPI automatically on push to main via .github/workflows/publish.yml. The workflow runs the full gate (ruff + pyright + pytest), and if the current pyproject.toml version already exists on PyPI it auto-bumps the minor component and commits the bump before publishing. PyPI auth uses OIDC trusted publishing — no token in repo or CI secrets.

To prepare a release locally:

scripts/bump-version.sh                 # patch bump
scripts/bump-version.sh minor           # 0.2.x -> 0.3.0
scripts/bump-version.sh major           # 0.x.x -> 1.0.0
scripts/bump-version.sh 0.3.0           # exact set
uv build                                # dist/*.whl + dist/*.tar.gz
uv publish dist/*                       # only if not using the GH workflow; needs --token or UV_PUBLISH_TOKEN

Trusted-publisher setup (one-time, owner only): create a publisher at https://pypi.org/manage/account/publishing/ with workflow publish.yml, repo logickoder/bankstract.

Usage

bankstract <bank> <pdf> -o <out>               # explicit parser
bankstract auto <pdf> -o <out>                 # auto-detect via Parser.detect_confidence()
bankstract list                                # show registered parsers
bankstract <bank> <pdf> -o out.json -f json    # JSON instead of CSV
cat statement.pdf | bankstract auto - -o -     # stdin / stdout pipeline

Pass - as the PDF arg to read from stdin, or - to -o to write to stdout. When stdout is the data sink, informational messages go to stderr so the data stream stays clean.

Unparseable blocks are written to a .log sidecar next to the output file.

Python API

import bankstract

bankstract.list_parsers()             # ['fbn', 'opay', 'palmpay', 'zenith']
bankstract.list_redactors()           # ['fbn', 'opay', 'palmpay', 'zenith']
bankstract.detect("statement.pdf")    # 'palmpay' | None

result = bankstract.parse("statement.pdf")            # auto-detect
result = bankstract.parse(fp, bank="fbn")             # explicit; fp is BytesIO

result.metadata.account_holder
result.metadata.statement_period_start
result.transactions[0].balance
result.format_version

# Parse + serialize in one call — byte-identical to the CLI's output.
csv_bytes  = bankstract.parse_to("statement.pdf")                   # default format="csv"
json_bytes = bankstract.parse_to(fp, format="json", bank="opay")    # explicit
debug_bytes = bankstract.parse_to(fp, reconcile=False)              # skip invariant

# Low-level writers — if you already hold a ParseResult.
from pathlib import Path
bankstract.write_csv(result.transactions, Path("out.csv"))
bankstract.write_json(result, Path("out.json"))

# Redact PII in-memory (no disk write); .data is the redacted file bytes.
redacted = bankstract.redact("statement.pdf")         # auto-detect bank
redacted = bankstract.redact(fp, bank="opay")         # explicit, stream input
redacted.data                                         # bytes — stream to HTTP / write to disk
redacted.report.redactions                            # count

Public surface (semver-locked)

Only the names re-exported from bankstract are part of the semver contract:

Name Kind Purpose
parse function parse(source, *, bank=None) -> ParseResult
parse_to function parse_to(source, *, format="csv", bank=None, reconcile=True) -> bytes — byte-identical to CLI
detect function detect(source) -> str | None (max-score bank)
list_parsers function sorted bank names (parsers)
write_csv function write_csv(transactions, target: Path | TextIO) -> int
write_json function write_json(result, target: Path | TextIO) -> int
redact function redact(source, *, bank=None) -> RedactResult — in-memory bytes
list_redactors function sorted bank names (redactors)
Parser ABC base class for new parsers
Redactor ABC base class for new redactors
Transaction pydantic row schema
StatementMetadata dataclass account holder / period / opening + closing balance
ParseResult dataclass transactions[], totals, format_version, metadata
RedactResult dataclass data: bytes, bank, format, format_version, report
RedactReport dataclass bank, pages, redactions, audit
Format type alias Literal["pdf", "xlsx"]
ParseError exception base — undiagnosable parse failure
EncryptedSourceError exception source PDF / XLSX is password-protected
EmptyStatementError exception parser ran clean, zero rows; .marker_coverage field
LayoutDriftError exception anchor missing / column shifted post-detect
ReconciliationError exception invariant break
__version__ str package version

source accepts pathlib.Path, a string path (treated as a path), or a seekable binary stream (e.g. io.BytesIO). Auto-detection picks the parser / redactor with the highest detect_confidence score — ties resolve to registration order. redact() returns bytes in-memory: no tempfile, no disk write — callers stream the payload straight to HTTP responses, archives, or Path.write_bytes() as needed. Anything imported from a submodule prefixed with _ (bankstract._api, bankstract._pdfplumber, bankstract._xlsx, bankstract._layout) is internal and may change in any release.

Reconciliation invariant

Two complementary checks; the CLI picks whichever applies per bank.

  • Row-wise (banks that print a running balance): prev.balance ± debit/credit == curr.balance. Mismatch raises ReconciliationError with the row index.
  • Totals-based (banks like PalmPay that omit a balance column): the parser reads Total Money In / Total Money Out from the statement header and the CLI asserts that the sum of parsed credits/debits equals those totals.

Both modes exist to catch silently-dropped rows — the failure mode of naive PDF parsers.

Contributing a bank parser

See CONTRIBUTING.md for the full checklist: gate setup, shared helpers (parsers/_money.py, parsers/_columnar.py, _xlsx.py), supported_formats declaration, XLSX redactor dispatch, dual-fixture testing rule, fixture privacy, and the Conventional Commits release gate.

Quick form: copy parsers/palmpay.py (PDF-only) or parsers/opay.py (PDF + XLSX) as the template; reuse shared helpers; declare supported_formats; drop the raw statement at tests/<bank>/fixtures/_local/statement.{pdf,xlsx} (gitignored); redact into sample.{pdf,xlsx}; commit only the redacted sample.

CI runs ruff + pyright (strict) + pytest. All three must pass clean. Reconciliation invariant holds on every fixture (or the parser opts out via ParseResult.row_wise_reconcilable=False and supplies header totals for verify_totals).

Fixtures must be redacted: account numbers, names, addresses, transaction IDs scrubbed. Never commit unredacted statements.

License

MIT. Author: logickoder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bankstract-0.13.0.tar.gz (128.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bankstract-0.13.0-py3-none-any.whl (56.3 kB view details)

Uploaded Python 3

File details

Details for the file bankstract-0.13.0.tar.gz.

File metadata

  • Download URL: bankstract-0.13.0.tar.gz
  • Upload date:
  • Size: 128.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bankstract-0.13.0.tar.gz
Algorithm Hash digest
SHA256 599ead419c10345dd99a3606ad1f0a793211e278960549daab35a2c2ca351099
MD5 f49a77365fde2ad1ce8d06532ada7462
BLAKE2b-256 be3b816896edc3d2c38293916907b86b59d84eaff54ff0671aba26c7fac29129

See more details on using hashes here.

Provenance

The following attestation bundles were made for bankstract-0.13.0.tar.gz:

Publisher: publish.yml on logickoder/bankstract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bankstract-0.13.0-py3-none-any.whl.

File metadata

  • Download URL: bankstract-0.13.0-py3-none-any.whl
  • Upload date:
  • Size: 56.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bankstract-0.13.0-py3-none-any.whl
Algorithm Hash digest
SHA256 39c4bfae04961672c6fa1860e5a77d064c79d211b674cf0f54459ff0868e8fe9
MD5 33fc7a30a53c07d77f4a161ecc7cda36
BLAKE2b-256 4f75bbac79bf0f01ff70adb26bb7edc1834773ccd6b27039a2a704f13d33f24d

See more details on using hashes here.

Provenance

The following attestation bundles were made for bankstract-0.13.0-py3-none-any.whl:

Publisher: publish.yml on logickoder/bankstract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page