Skip to main content

A hiding in plain sight module for Dutch medical text.

Project description

🇳🇱 dutch-med-hips

A robust, highly configurable PHI anonymization and surrogate generation toolkit designed for Dutch medical and radiology reports.
It replaces PHI tokens (e.g. <PERSOON>, <Z-NUMMER>, <DATUM>) with realistic surrogate values such as names, ages, dates, hospitals, study names, IDs, BSNs, IBANs, phone numbers, URLs, emails, and more.

dutch-med-hips uses:

  • Faker for Dutch-language surrogate generation
  • Configurable templates for ID-like fields
  • Locale dictionaries for hospitals, months, cities, study names
  • A combined regex engine for fast and safe substitution
  • Document-level deterministic seeding (optional)

Installation

You can install dutch-med-hips via pip:

pip install dutch-med-hips

or from source:

git clone https://github.com/DIAGNijmegen/dutch-med-hips.git
cd dutch-med-hips
pip install -e .

Quickstart

Python API

Running HideInPlainSight in your Python code is straightforward:

from dutch_med_hips import HideInPlainSight

text = """
Patiënt <PERSOON> werd opgenomen in <HOSPITAL_NAME> op <DATUM>.
Z-nummer: <Z-NUMMER>, BSN: <BSN>, Email: <EMAIL>.
Rapport ID: <RAPPORT_ID.T_NUMMER>.
""" 

hips = HideInPlainSight()
result = hips.run(text)

print(result["text"])
print(result["mapping"])  # Shows original -> surrogate mapping

Command-Line Interface

You can also use dutch-med-hips directly from the command line:

dutch-med-hips [OPTIONS]

Common Options

Option Meaning
-i, --input PATH Input file (UTF-8). Use - or omit to read from stdin.
-o, --output PATH Output file (UTF-8). Use - or omit to write to stdout.
--mapping-out PATH Write the JSON mapping (original → surrogate) to a file.
--seed N Use a fixed seed for deterministic surrogate generation.
--no-document-hash-seed Disable automatic seeding based on the document hash.
--no-header Disable the anonymization disclaimer header.
--disable-typos Disable random typo injection in surrogates.

Features

🔐 PHI Surrogates

👤 People & Demographics

  • Person names
    • Dutch-style names (first/last), tussenvoegsels (van, de, …)
    • Variants: first-only, last-only, full, initials (J. Jansen, J.S. Jansen)
    • Randomized casing (jan jansen, JAN JANSEN, Jansen, Jan)
  • Person initials
    • Derived from full fake names: Jan SteenJS, Vincent van GoghVvG
  • Age
    • Sampled from a hospital-like Gaussian mixture model (more 40–85 year olds)

🧾 Identifiers & Numbers

  • Patient IDs / Z-numbers / generic PHI numbers
    • All driven by simple templates per tag (e.g. <Z-NUMMER>Z######)
    • Template mini-language (# = digit, etc.)
  • Document IDs & sub-IDs
    • Main report IDs from templates
    • Sub-IDs like <RAPPORT_ID.T_NUMMER>T123456
  • BSN
    • Dutch BSN-like numbers via Faker ssn()
  • IBAN
    • Dutch IBANs via Faker, compact or grouped (NL91ABNA0417164300, NL91 ABNA 0417 1643 00)
  • Accreditation number
    • Always M + 3 digits (e.g. M007, M123)

🏥 Hospitals, Locations & Studies

  • Hospital names
    • Realistic Dutch hospital pool with full names and abbreviations
      e.g. Amsterdam UMC locatie AMC, AMC, Radboudumc, LUMC, ADRZ
    • Sometimes uses only the city as shorthand (e.g. Amsterdam, Nijmegen)
  • Locations
    • Dutch cities and place names drawn from hospital/location data
  • Study names
    • Curated list of real-looking study labels and variants
      e.g. LEMA, Donan, M-SPECT/mSPECT, Alpe d'Huzes MRI, TULIP, PRIAS

📅 Dates & Times

  • Dates
    • Dutch-style formats:
      • Numeric: D-M, DD-MM, with/without year (03-02-2025, 3-2-12)
      • Named months: 3 februari, 3 feb 2025
    • Mix of year/no-year, numeric vs month-name
    • Start/end date range configuration (e.g. last 10 years)
  • Times
    • 24h clock formats: 13:45, 13:45 uur, 13.45, 13u45
    • Natural Dutch phrases: kwart voor zes, kwart over drie, half vier

📞 Contact & Online

  • Phone numbers
    • Dutch mobile numbers (06-12345678, +31 6 12345678)
    • Landlines / hospital numbers (020-5669111, 088-…)
    • Internal SEIN/pager numbers (4–5 digit codes)
  • Email addresses
    • Fake but valid emails via Faker (customizable domains)
  • URLs
    • Fake but valid http(s) URLs via Faker (can be styled to look like portals/EPD endpoints)

🏠 Addresses & Misc

  • Addresses
    • Dutch-style street + number + postcode + city via Faker (nl_NL)
  • Other PHI
    • Any additional tag-based IDs or tokens configured via templates can be mapped to surrogates in the same way.

🔧 Flexible Configuration

All defaults live in settings.py and can be overridden at runtime:

from dutch_med_hips import settings

settings.ID_TEMPLATES_BY_TAG["<Z-NUMMER>"] = "Z-###-###"
settings.PERSON_NAME_REUSE_PROB = 0.15
settings.ENABLE_TYPOS = True

🧪 Deterministic Output

The system automatically hashes the document to generate a seed to stabilize output. This can be turned off, or you can provide your own fixed seed:

hips = HideInPlainSight(seed=123)

!!! Note Using a fixed seed means the same input document will always yield the same output document and same surrogate mappings.
Different documents will still produce different outputs.

✏️ Optional Typo Injection

Using the typo Python package, some surrogates can receive:

  • Adjacent-key typos
  • Insertions
  • Deletions
    You can enable/disable this globally.

⚠️ Automatic Disclaimer Header

Every anonymized document can automatically receive an anonymization disclaimer at the top.
You may customize or disable it.


Defining Custom PHI Tags

dutch-med-hips allows users to extend PHI detection and surrogate generation without modifying the library.
Below are the three most common customization patterns, with clear examples:

  1. Add a new regex to an existing PHI category (easiest)
  2. Create a new ID-style tag with its own format (advanced)
  3. Extend surrogate pools (e.g. add a new study name)

1️⃣ Adding a New Regex to Person Names (Simple Example)

If your reports use additional person‑name markers such as <NM> or <PERSOON_NAAM>, you can plug them into the existing system.

Just add them to the existing PHI type:

from dutch_med_hips import schema
from dutch_med_hips.schema import PHIType

schema.DEFAULT_PATTERNS[PHIType.PERSON_NAME].extend([
    r"<NM>",
    r"<PERSOON_NAAM>",
])

These new tags will:

✔ Be recognized as person names
✔ Use the existing name surrogate logic
✔ Appear in the mapping as phi_type="person_name"

Example:

text = "Patiënt <NM> en begeleider <PERSOON_NAAM> kwamen binnen."
print(HideInPlainSight(seed=1).run(text)["text"])

2️⃣ Creating a New ID Tag with a Custom Format (Advanced)

Suppose your system uses <CENTER_ID> and you want outputs like:

CEN-123456

Step 1 — Add a detection pattern

from dutch_med_hips import schema
from dutch_med_hips.schema import PHIType

schema.DEFAULT_PATTERNS.setdefault(PHIType.GENERIC_ID, []).append(
    r"<CENTER_ID>"
)

Step 2 — Assign a surrogate template

Use the template mini‑language (# = digit):

from dutch_med_hips import settings

settings.ID_TEMPLATES_BY_TAG["<CENTER_ID>"] = "CEN-######"

Step 3 — Use it

from dutch_med_hips import HideInPlainSight

text = "Centrum: <CENTER_ID>."
print(HideInPlainSight(seed=42).run(text)["text"])
# Centrum: CEN-123456.

3️⃣ Adding New Items to Surrogate Pools (e.g., Study Names)

Some surrogate categories use pools, such as the curated list of Dutch medical study names in locale.py:

STUDY_NAME_POOL = [
    "LEMA",
    "Donan",
    ["M-SPECT", "mSPECT"],
    ...
]

To add your own study:

from dutch_med_hips import settings

settings.STUDY_NAME_POOL.append("MY-NEW-STUDY")

Or add multiple variants:

settings.STUDY_NAME_POOL.append([
    "CUSTOMTRIAL",
    "Custom Trial",
    "CT-Study"
])

The generator will randomly pick one of the variants.

Example

text = "Onderzoek: <STUDY_NAME>."
hips = HideInPlainSight(seed=123)
print(hips.run(text)["text"])
# Onderzoek: MY-NEW-STUDY.

Optional: Full Custom Generator

For more complex surrogate rules, you can define your own PHI type and generator:

import re
from dutch_med_hips import schema, surrogates

def generate_center(match: re.Match) -> str:
    return "CENTER-" + "123456"

schema.DEFAULT_PATTERNS["center"] = [r"<CENTER_SPECIAL>"]
surrogates.DEFAULT_GENERATORS["center"] = generate_center

Now <CENTER_SPECIAL> maps through your custom function.


Summary

Task Best Method
Add a new tag to an existing PHI category Add regex to schema.DEFAULT_PATTERNS[...]
Create a new ID-like tag Add regex + assign template via settings.ID_TEMPLATES_BY_TAG
Add new study/hospital/name variants Append to the appropriate pool in settings
Create fully custom surrogate logic Register generator in surrogates.DEFAULT_GENERATORS

Customizing dutch-med-hips is:
Add regex → (optional) assign template or generator → done.


Mapping Output Structure

result = hips.run(text) returns:

{
    "text": "anonymized text...",
    "mapping": [
        {
            "original": "<PERSOON>",
            "surrogate": "Jan Steen",
            "phi_type": "person_name",
            "start": 10,
            "end": 18
        },
        ...
    ]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dutch_med_hips-1.0.0.tar.gz (67.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dutch_med_hips-1.0.0-py3-none-any.whl (50.0 kB view details)

Uploaded Python 3

File details

Details for the file dutch_med_hips-1.0.0.tar.gz.

File metadata

  • Download URL: dutch_med_hips-1.0.0.tar.gz
  • Upload date:
  • Size: 67.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for dutch_med_hips-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c72e5a9474ca1d02060d077702d9e7e2d5f27a2775b50b0f754ee142df63e871
MD5 d07037e84f06e5d7306dab5226909d61
BLAKE2b-256 e453a19347c4bf8c777a5f36cbf8bbd75beb2c31cd17b1f38cadb0ade67f0b84

See more details on using hashes here.

File details

Details for the file dutch_med_hips-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: dutch_med_hips-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 50.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for dutch_med_hips-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea4ce8ad8f466515cb2eddc42c720dec6e9c3cca696880d11b08ccdc528b1ca2
MD5 68740ebf5857c7015a89b634683d4e55
BLAKE2b-256 7b6574c2a3f60b1b30a87112416d2a65e0dcae4f9f1b015d47905efa4b581fd5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page