Skip to main content

Anonymise UK bank statement PDFs by scrambling personal data while preserving document structure.

Project description

uk-bank-statement-anonymiser

Anonymise UK bank statement PDFs by scrambling personal data while preserving the document's visual structure and layout. All letters in transaction descriptions are replaced with random alternatives; dates, payment codes, protected phrases, and numeric identifiers (sort codes, account numbers, IBANs, card numbers) are handled deterministically so the anonymised output remains internally consistent across pages.

Supported statement types

  • HSBC UK current account
  • HSBC UK savings account
  • Natwest current account
  • TSB Spend & Save account
  • TSB credit card

Other UK bank PDFs may work, but have not been tested.

Requirements

  • Python 3.14+
  • pikepdf (installed automatically)

Installation

pip install uk-bank-statement-anonymiser

Quick start

from bank_statement_anonymiser import anonymise_pdf

# Minimal — output written alongside input as "anonymised_<original_name>.pdf"
anonymise_pdf("statement.pdf")

# Explicit output path (recommended — avoids exposing the original filename)
anonymise_pdf("statement.pdf", "safe_output_name.pdf")

User config files

The library ships two system config files (bundled in the package, committed to source control) that cover common protected phrases and known numeric patterns:

File Purpose
always_anonymise_system.toml Force specific strings to a known replacement value
never_anonymise_system.toml Protect specific phrases from being scrambled

You can supplement these with your own files passed as arguments to anonymise_pdf:

anonymise_pdf(
    "statement.pdf",
    "output.pdf",
    always_anonymise_path="my_always_anonymise.toml",
    never_anonymise_path="my_never_anonymise.toml",
)

User entries are merged with system entries. On a clash in always_anonymise, the user file wins. never_anonymise is a union of both files.

User config files should not be committed to source control — they will typically contain real account numbers, sort codes, or names that you are trying to protect.

always_anonymise.toml format

# Force exact string replacements before the scramble pass.
# User file wins over system file on a clash.

"40-37-28" = "00-00-00"
"12345678" = "00000000"
"Jason Farrar" = "John Doe"

never_anonymise.toml format

# Phrases listed here are left exactly as-is during the scramble pass.
# Matching is case-insensitive and whitespace-insensitive.

exclude = [
    "My Bank",
    "My Employer Ltd",
]

API reference

anonymise_pdf

def anonymise_pdf(
    input_path: str | Path,
    output_path: str | Path | None = None,
    always_anonymise_path: str | Path | None = None,
    never_anonymise_path: str | Path | None = None,
    debug: bool = False,
) -> Path

Anonymises a single PDF and returns the path to the output file.

Parameter Description
input_path Path to the input PDF
output_path Path for the output PDF. If omitted, writes anonymised_<stem><suffix> in the same directory as the input
always_anonymise_path Path to a user always_anonymise.toml (optional)
never_anonymise_path Path to a user never_anonymise.toml (optional)
debug Print diagnostic information to stdout when True

How it works

  1. Numeric ID detection — a document-level scan identifies sort codes, account numbers, IBANs, and card numbers. Each is replaced with a deterministic fake value (last two digits tiled across the full length, e.g. 40-37-2828-28-28). always_anonymise overrides take priority.

  2. Protected phrase detection — fragments matching dates, payment type codes, URLs, numeric values, or entries in never_anonymise configs are marked as protected and left unchanged.

  3. Content stream rewrite — pikepdf rewrites the PDF content streams directly, substituting scrambled bytes for original text bytes. Font encoding (Latin-1 and ToUnicode/CMap) is handled transparently, including subset-embedded fonts.

Licence

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_bank_statement_anonymiser-0.1.1.tar.gz (23.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uk_bank_statement_anonymiser-0.1.1-py3-none-any.whl (26.1 kB view details)

Uploaded Python 3

File details

Details for the file uk_bank_statement_anonymiser-0.1.1.tar.gz.

File metadata

  • Download URL: uk_bank_statement_anonymiser-0.1.1.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_bank_statement_anonymiser-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8fde09bba52042fb2ba5bf4b955215a0e80cc086937c844abd09c9cc92fe8e66
MD5 87aac6d947fc01424c759129e8c050df
BLAKE2b-256 1a79d467ee12af944ca4500b77cc722e72e0cfbabe25ca853a66fb0938ae19bd

See more details on using hashes here.

File details

Details for the file uk_bank_statement_anonymiser-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: uk_bank_statement_anonymiser-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 26.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_bank_statement_anonymiser-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5bea6ac2365b7b2bc6d787cbe14212220784cedc4ba26a797024522bb16b1a98
MD5 c3c9ae8df2293e30f8c4d7635e76cd50
BLAKE2b-256 49ece40cb777da20df4ac3859b65f07324598905dd92c94b978b6b267a3cabcf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page