Anonymise UK bank statement PDFs by scrambling personal data while preserving document structure.
Project description
uk-bank-statement-anonymiser
Anonymise UK bank statement PDFs by scrambling personal data while preserving the document's visual structure and layout. All letters in transaction descriptions are replaced with random alternatives; dates, payment codes, protected phrases, and numeric identifiers (sort codes, account numbers, IBANs, card numbers) are handled deterministically so the anonymised output remains internally consistent across pages.
Supported statement types
- HSBC UK current account
- HSBC UK savings account
- Natwest current account
- TSB Spend & Save account
- TSB credit card
Why only these banks? The library supports these specific banks because each uses a different PDF encoding strategy. Other UK bank PDFs may work if they use one of the same approaches, but have not been tested.
The supported encoding types are:
- HSBC uses single-byte Latin-1 encoding (WinAnsiEncoding)
- Natwest uses multi-byte Identity-H CID fonts
- TSB uses custom ToUnicode CMap reencoding
For technical details on how these encodings are detected and handled, see Technical design → Encoding strategies.
Requirements
- Python 3.14+
- pikepdf (installed automatically)
Installation
pip install uk-bank-statement-anonymiser
Quick start
By default, the library automatically detects and anonymises dates, sort codes, account numbers, card numbers, and other sensitive patterns. For custom rules—to force specific replacements or protect additional phrases—see User config files below.
from bank_statement_anonymiser import anonymise_pdf
# Minimal — output written alongside input
anonymise_pdf("statement.pdf")
# Output: anonymised_statement.pdf (in the same directory as input)
# Explicit output path (recommended — avoids exposing the original filename)
anonymise_pdf("statement.pdf", "safe_output_name.pdf")
User config files
The library ships two system config files (bundled in the package, committed to source control) that cover common protected phrases and known numeric patterns:
| File | Purpose |
|---|---|
always_anonymise_system.toml |
Force specific strings to a known replacement value |
never_anonymise_system.toml |
Protect specific phrases from being scrambled |
You can supplement these with your own files passed as arguments to anonymise_pdf:
anonymise_pdf(
"statement.pdf",
"output.pdf",
always_anonymise_path="my_always_anonymise.toml",
never_anonymise_path="my_never_anonymise.toml",
)
System config provides defaults; your custom config overrides or extends them. For always_anonymise: your rules win on any clash. For never_anonymise: both system and user lists are combined (union).
User config files should not be committed to source control — they will typically contain real account numbers, sort codes, or names that you are trying to protect.
always_anonymise.toml format
# Force exact string replacements before the scramble pass.
# User file wins over system file on a clash.
"40-37-28" = "00-00-00"
"12345678" = "00000000"
"Jason Farrar" = "John Doe"
never_anonymise.toml format
# Phrases listed here are left exactly as-is during the scramble pass.
# Matching is case-insensitive and whitespace-insensitive.
exclude = [
"My Bank",
"My Employer Ltd",
]
Example: Custom protection rules
If your bank statement includes regular suppliers, employers, or other names you want to keep readable, add them to your user never_anonymise.toml:
# my_never_anonymise.toml
exclude = [
"ACME Corporation",
"Salary Payment",
"Rent Payment",
]
Then pass it to anonymise_pdf:
anonymise_pdf(
"statement.pdf",
"output.pdf",
never_anonymise_path="my_never_anonymise.toml",
)
The matched phrases will remain readable in the output PDF, while everything else is scrambled.
API reference
anonymise_pdf
def anonymise_pdf(
input_path: str | Path,
output_path: str | Path | None = None,
always_anonymise_path: str | Path | None = None,
never_anonymise_path: str | Path | None = None,
debug: bool = False,
) -> Path
Anonymises a single PDF and returns the path to the output file.
| Parameter | Description |
|---|---|
input_path |
Path to the input PDF |
output_path |
Path for the output PDF. If omitted, writes anonymised_<stem><suffix> in the same directory as the input |
always_anonymise_path |
Path to a user always_anonymise.toml (optional) |
never_anonymise_path |
Path to a user never_anonymise.toml (optional) |
debug |
When True, print diagnostic information about config loading, numeric ID detection, and per-page pair building to stdout (optional; default False) |
| Returns | Absolute path to the output PDF file |
| Raises | FileNotFoundError if input_path does not exist |
Error handling
FileNotFoundError: Raised if input_path does not exist.
Other errors: PDF parsing or re-encoding failures are logged to stdout when debug=True but do not stop processing — the output PDF is written with any non-matching or un-reencoded fragments left unchanged.
When debug=True, diagnostic output includes:
- Config file loading status (system and user files)
- Numeric ID detection results (IBANs, sort codes, card numbers found)
- Per-page pair building details (count of always-anonymise, protected, and scramble pairs)
How it works
The anonymiser works in three steps:
-
Identify sensitive data — Detects sort codes, account numbers, IBANs, card numbers, and other patterns defined in config. Each gets a deterministic fake replacement (e.g.
40-37-28→28-28-28— last two digits repeated). This ensures the same data point is always replaced with the same fake value, even across multiple pages. -
Protect structural text — Dates, payment type codes, bank URLs, and any phrases in your
never_anonymiseconfig are left unchanged. This preserves the document's readability and structure. -
Scramble remaining text — All other letters are scrambled (e.g.
Barclays→Dqhyqbvd), while digits and symbols stay intact. The PDF's layout, fonts, images, and line breaks remain unchanged.
This three-step approach ensures that the same sensitive data is replaced consistently across all pages, while non-sensitive text is randomized uniformly.
Technical design
Why content-stream parsing?
The library parses PDF content streams directly via pikepdf.parse_content_stream() rather than using pdfplumber or similar tools. This approach is essential because pdfplumber merges multiple Tj text operators into visual "words," which loses the fragment boundaries that the anonymiser relies on to match phrases like sort codes and account numbers. Direct content-stream parsing preserves the original PDF encoding structure, enabling accurate font detection and re-encoding of scrambled text back to valid PDF bytes.
Encoding strategies
Different banks use different PDF text encodings. The library automatically detects and handles three encoding types. Latin-1 (WinAnsiEncoding) uses single-byte glyph codes where each byte from 0–255 maps to one character; HSBC statements use this encoding. ToUnicode CMaps define custom character-to-Unicode mappings embedded in the PDF font itself; TSB statements use this approach for special layout control. Identity-H CID fonts use multi-byte character IDs (CIDs) for complex encoding scenarios; Natwest 2025 statements employ this strategy with 2-byte big-endian CID sequences.
For each text fragment discovered in the content stream, the library: (1) decodes the raw bytes using the font's encoding (consulting the ToUnicode CMap if present, otherwise falling back to Latin-1), (2) applies the appropriate transformation (replacement, protection, or scrambling), and (3) re-encodes the result using the same encoding path, ensuring that replacement bytes are always valid for the target font.
Per-page processing (three phases)
The three-step architecture described in "How it works" above operates as follows at the implementation level. Phase 1 — Line-aware scan walks through all text operators in the PDF content stream. A line accumulator tracks the current visual line and resets at Td, TD, T*, Tm, or ET operators; this enables phrase matching that spans multiple Tj operators rendered on the same line (critical for multi-word phrases like "My Employer Ltd"). Within each line, a sliding window tests each start position by extending rightward, joining decoded fragment texts, and comparing against user rules in always_anonymise, system and user rules in never_anonymise, and built-in patterns (dates, amounts, sort codes, payment codes). The first match wins; the start pointer advances past the matched span.
Phase 2 — Build bytes pairs iterates through all fragments discovered in Phase 1. For each fragment: if it matched an always_anonymise rule, the replacement text is distributed across the original fragment slots (filling to the original length, with the last slot absorbing overflow/underflow), creating a (original_bytes, replacement_bytes) pair. If the fragment is protected (matched never_anonymise or a built-in pattern), it is skipped. Otherwise, the fragment is scramblable: letters are replaced via a per-document scramble map, while digits and symbols remain unchanged, producing a (original_bytes, scrambled_bytes) pair.
Phase 3 — Rewrite content stream takes the pairs built in Phase 2 and performs a dictionary-based lookup: wherever the original byte sequence appears in the content stream, it is replaced with the corresponding replacement byte sequence. This final step applies all transformations simultaneously, ensuring deterministic and consistent output.
Licence
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uk_bank_statement_anonymiser-0.1.5.tar.gz.
File metadata
- Download URL: uk_bank_statement_anonymiser-0.1.5.tar.gz
- Upload date:
- Size: 27.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c230fa97220e8661acdc9d6b9486f4788316294c526d36184affd350699898f
|
|
| MD5 |
703a5428ec547c7334d970bf38e427db
|
|
| BLAKE2b-256 |
3f5f892f4fe5fff2a3dceae9ddd24d0e1446c36e2e8e735914d68475961bdbfe
|
File details
Details for the file uk_bank_statement_anonymiser-0.1.5-py3-none-any.whl.
File metadata
- Download URL: uk_bank_statement_anonymiser-0.1.5-py3-none-any.whl
- Upload date:
- Size: 30.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d86ced12c1a959600d962f98cdcd10c1506a777f03341d12530454f194bd3311
|
|
| MD5 |
b152aef2f60a13ba031f393f9ce06ab3
|
|
| BLAKE2b-256 |
eef51f3dc63c1b7048f49a7faf94b1644332c29ec32259d3a738d95022e1f35c
|