Skip to main content

Pseudonymization extensions for Dapla

Project description

Dapla Toolbelt Pseudo

PyPI Status Python Version License

Documentation Tests Coverage Quality Gate Status

pre-commit Black Ruff Poetry

Pseudonymize, repseudonymize and depseudonymize data on Dapla.

Features

Other examples can also be viewed through notebook files for pseudo and depseudo

Pseudonymize

from dapla_pseudo import Pseudonymize
import polars as pl

file_path="data/personer.csv"
dtypes = {"fnr": pl.Utf8, "fornavn": pl.Utf8, "etternavn": pl.Utf8, "kjonn": pl.Categorical, "fodselsdato": pl.Utf8}
df = pl.read_csv(file_path, dtypes=dtypes) # Create DataFrame from file

# Example: Single field default encryption (DAEAD)
result_df = (
    Pseudonymize.from_polars(df)                   # Specify what dataframe to use
    .on_fields("fornavn")                          # Select the field to pseudonymize
    .with_default_encryption()                     # Select the pseudonymization algorithm to apply
    .run()                                         # Apply pseudonymization to the selected field
    .to_polars()                                   # Get the result as a polars dataframe
)

# Example: Multiple fields default encryption (DAEAD)
result_df = (
    Pseudonymize.from_polars(df)                   # Specify what dataframe to use
    .on_fields("fornavn", "etternavn")             # Select multiple fields to pseudonymize
    .with_default_encryption()                     # Select the pseudonymization algorithm to apply
    .run()                                         # Apply pseudonymization to the selected fields
    .to_polars()                                   # Get the result as a polars dataframe
)

# Example: Single field sid mapping and pseudonymization (FPE)
result_df = (
    Pseudonymize.from_polars(df)                   # Specify what dataframe to use
    .on_fields("fnr")                              # Select the field to pseudonymize
    .with_stable_id()                              # Map the selected field to stable id
    .run()                                         # Apply pseudonymization to the selected fields
    .to_polars()                                   # Get the result as a polars dataframe
)

The default encryption algorithm is DAEAD (Deterministic Authenticated Encryption with Associated Data). However, if the field is a valid Norwegian personal identification number (fnr, dnr), the recommended way to pseudonymize is to use the function with_stable_id() to convert the identification number to a stable ID (SID) prior to pseudonymization. In that case, the pseudonymization algorithm is FPE (Format Preserving Encryption).

[!IMPORTANT] FPE requires minimum two bytes/characters to perform encryption and minimum four bytes in case of Unicode.

If a field cannot be converted using the function with_stable_id() the default behaviour is to use the original value as input to the FPE encryption function. However, this behaviour can be changed by supplying a on_map_failure argument like this:

from dapla_pseudo import Pseudonymize

# Example: Single field sid mapping and pseudonymization (FPE), unmatching SIDs will return Null
result_df = (
    Pseudonymize.from_polars(df)
    .on_fields("fnr")
    .with_stable_id(on_map_failure="RETURN_NULL")
    .run()
    .to_polars()
)

Reading dataframes

Note that you may also use a Pandas DataFrame as an input or output, by exchanging from_polars with from_pandas and to_polars with to_pandas. However, Pandas is much less performant, so take special care especially if your dataset is large.

Example:

# Example: Single field default encryption (DAEAD)
df_pandas = (
    Pseudonymize.from_pandas(df)                   # Specify what dataframe to use
    .on_fields("fornavn")                          # Select the field to pseudonymize
    .with_default_encryption()                     # Select the pseudonymization algorithm to apply
    .run()                                         # Apply pseudonymization to the selected field
    .to_pandas()                                   # Get the result as a polars dataframe
)

Validate SID mapping

from dapla_pseudo import Validator
import polars as pl

file_path="data/personer.csv"
dtypes = {"fnr": pl.Utf8, "fornavn": pl.Utf8, "etternavn": pl.Utf8, "kjonn": pl.Categorical, "fodselsdato": pl.Utf8}
df = pl.read_polars(file_path, dtypes=dtypes)

result = (
    Validator.from_polars(df)                   # Specify what dataframe to use
    .on_field("fnr")                            # Select the field to validate
    .validate_map_to_stable_id()                # Validate that all the field values can be mapped to a SID
)
# The resulting dataframe contains the field values that didn't have a corresponding SID
result.to_polars()

A sid_snapshot_date can also be specified to validate that the field values can be mapped to a SID at a specific date:

from dapla_pseudo import Validator
import polars as pl

file_path="data/personer.csv"
dtypes = {"fnr": pl.Utf8, "fornavn": pl.Utf8, "etternavn": pl.Utf8, "kjonn": pl.Categorical, "fodselsdato": pl.Utf8}

df = pl.read_csv(file_path, dtypes=dtypes)

result = (
    Validator.from_polars(df)
    .on_field("fnr")
    .validate_map_to_stable_id(
        sid_snapshot_date="2023-08-29"
    )
)
# Show metadata about the validation (e.g. which version of the SID catalog was used)
result.metadata
# Show the field values that didn't have a corresponding SID
result.to_polars()

Advanced usage

Pseudonymize

Pseudonymize using custom keys/keysets

from dapla_pseudo import Pseudonymize, PseudoKeyset

# Pseudonymize fields in a local file using the default key:
df = (
    Pseudonymize.from_polars(df)                            # Specify what dataframe to use
    .on_fields("fornavn")                                   # Select the field to pseudonymize
    .with_default_encryption()                              # Select the pseudonymization algorithm to apply
    .run()                                         # Apply pseudonymization to the selected field
    .to_polars()                                            # Get the result as a polars dataframe
)

# Pseudonymize fields in a local file, explicitly denoting the key to use:
df = (
    Pseudonymize.from_polars(df)                            # Specify what dataframe to use
    .on_fields("fornavn")                                   # Select the field to pseudonymize
    .with_default_encryption(custom_key="ssb-common-key-2") # Select the pseudonymization algorithm to apply
    .run()                                         # Apply pseudonymization to the selected field
    .to_polars()                                            # Get the result as a polars dataframe
)

# Pseudonymize a local file using a custom keyset:
import json
custom_keyset = PseudoKeyset(
    encrypted_keyset="CiQAp91NBhLdknX3j9jF6vwhdyURaqcT9/M/iczV7fLn...8XYFKwxiwMtCzDT6QGzCCCM=",
    keyset_info={
        "primaryKeyId": 1234567890,
        "keyInfo": [
            {
                "typeUrl": "type.googleapis.com/google.crypto.tink.AesSivKey",
                "status": "ENABLED",
                "keyId": 1234567890,
                "outputPrefixType": "TINK",
            }
        ],
    },
    kek_uri="gcp-kms://projects/some-project-id/locations/europe-north1/keyRings/some-keyring/cryptoKeys/some-kek-1",
)

df = (
    Pseudonymize.from_polars(df)
    .on_fields("fornavn")
    .with_default_encryption(custom_key="1234567890") # Note that the custom key has to be the same as "primaryKeyId" in the custom keyset
    .run(custom_keyset=custom_keyset)
    .to_polars()
)

Pseudonymize using custom rules

Instead of declaring the pseudonymization rules via the Pseudonymize functions, one can define the rules manually. This can be done via the PseudoRule class like this:

from dapla_pseudo import Pseudonymize, PseudoRule

rule_json = {
    'name': 'my-fule',
     'pattern': '**/identifiers/*',
     'func': 'redact(placeholder=#)' # This is a shorthand representation of the redact function
}

rule = PseudoRule.from_json(rule_json)

df = (
    Pseudonymize.from_polars(df)
    .add_rules(rule) # Add to pseudonymization rules
    .run()
    .to_polars()
)

Pseudonymization rules can also be read from file. This is especially handy when there are several rules, or if you prefer to store and maintain pseudonymization rules externally. For example:

from dapla_pseudo import PseudoRule
import json

with open("pseudo-rules.json", 'r') as rules_file:
    rules_json = json.load(rules_file)

pseudo_rules = [PseudoRule.from_json(rule) for rule in rules_json]

df = (
    Pseudonymize.from_polars(df)
    .add_rules(pseudo_rules)
    .run()
    .to_polars()
)

Depseudonymize

The "Depseudonymize" functions are almost exactly the same as when pseudonymizing. User can map from Stable ID back to FNR.

from dapla_pseudo import Depseudonymize
import polars as pl

file_path="data/personer_pseudonymized.csv"
dtypes = {"fnr": pl.Utf8, "fornavn": pl.Utf8, "etternavn": pl.Utf8, "kjonn": pl.Categorical, "fodselsdato": pl.Utf8}
df = pl.read_csv(file_path, dtypes=dtypes) # Create DataFrame from file

# Example: Single field default encryption (DAEAD)
result_df = (
    Depseudonymize.from_polars(df)                 # Specify what dataframe to use
    .on_fields("fornavn")                          # Select the field to depseudonymize
    .with_default_encryption()                     # Select the depseudonymization algorithm to apply
    .run()                                         # Apply depseudonymization to the selected field
    .to_polars()                                   # Get the result as a polars dataframe
)

# Example: Multiple fields default encryption (DAEAD)
result_df = (
    Depseudonymize.from_polars(df)                 # Specify what dataframe to use
    .on_fields("fornavn", "etternavn")             # Select multiple fields to depseudonymize
    .with_default_encryption()                     # Select the depseudonymization algorithm to apply
    .run()                                         # Apply depseudonymization to the selected fields
    .to_polars()                                   # Get the result as a polars dataframe
)

# Example: Depseudonymize Fnr field with SID mapping
result_df = (
    Depseudonymize.from_polars(df)                 # Specify what dataframe to use
    .on_fields("fnr")                              # Select fnr field to depseudonymize
    .with_stable_id()                              # Select the depseudonymization method (SID mapping) to apply
    .run()                                         # Apply depseudonymization to the selected fields
    .to_polars()                                   # Get the result as a polars dataframe
)

Note that depseudonymization requires elevated access privileges.

Repseudonymize

Repseudonymize can either 1) Change the algorithm used to pseudonymize, and/or 2) change the key used in pseudonymization, while keeping the algorithm.

# Example: Repseudonymize from PAPIS-compatible encryption to Stable ID
result_df = (
    Repseudonymize.from_polars(df)                 # Specify what dataframe to use
    .on_fields("fnr")                              # Select the field to pseudonymize
    .from_papis_compatible_encryption()            # Select the pseudonymization algorithm previously used
    .to_stable_id()                                # Select the new pseudonymization rule
    .run()                                         # Apply pseudonymization to the selected field
    .to_polars()                                   # Get the result as a polars dataframe
)
# Example: Repseudonymize with the same algorithm, but with a different key
result_df = (
    Repseudonymize.from_polars(df)                     # Specify what dataframe to use
    .on_fields("fnr")                                  # Select the field to pseudonymize
    .from_papis_compatible_encryption()                # Select the pseudonymization algorithm previously used
    .to_papis_compatible_encryption(key_id="some-key") # Select the new pseudonymization rule
    .run()                                             # Apply pseudonymization to the selected field
    .to_polars()                                       # Get the result as a polars dataframe
)

Datadoc

Datadoc metadata is gathered while pseudonymizing, and can be seen like so:

result = (
    Pseudonymize.from_polars(df)
    .on_fields("fornavn")
    .with_default_encryption()
    .run()
)

print(result.datadoc)

Datadoc metadata is automatically written to the folder or bucket as the pseudonymized data, when using the to_file() method on the result object. The metadata file has the suffix __DOC, and is always a .json file. The data and metadata is written to the file like so:

result = (
    Pseudonymize.from_polars(df)
    .on_fields("fornavn")
    .with_default_encryption()
    .run()
)

# The line of code below also writes the file "gs://bucket/test__DOC.json"
result.to_file("gs://bucket/test.parquet")

Note that if you choose to only use the DataFrame from the result, the metadata will be lost forever! An example of how this can happen:

import dapla as dp
result = (
    Pseudonymize.from_polars(df)
    .on_fields("fornavn")
    .with_default_encryption()
    .run()
)
df = result.to_pandas()

dp.write_pandas(df, "gs://bucket/test.parquet", file_format="parquet") # The metadata is lost!!

Requirements

  • Python >= 3.10
  • Dependencies can be found in pyproject.toml

Installation

You can install Dapla Toolbelt Pseudo via pip from PyPI:

pip install dapla-toolbelt-pseudo

Usage

Please see the Reference Guide for details.

Contributing

Contributions are very welcome. To learn more, see the Contributor Guide.

License

Distributed under the terms of the MIT license, Dapla Toolbelt Pseudo is free and open source software.

Issues

If you encounter any problems, please file an issue along with a detailed description.

Credits

This project was generated from Statistics Norway's SSB PyPI Template.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dapla_toolbelt_pseudo-4.4.0.tar.gz (33.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dapla_toolbelt_pseudo-4.4.0-py3-none-any.whl (39.9 kB view details)

Uploaded Python 3

File details

Details for the file dapla_toolbelt_pseudo-4.4.0.tar.gz.

File metadata

  • Download URL: dapla_toolbelt_pseudo-4.4.0.tar.gz
  • Upload date:
  • Size: 33.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for dapla_toolbelt_pseudo-4.4.0.tar.gz
Algorithm Hash digest
SHA256 b03dd9e91569964c603fe52f892ff7bba351b9c202c6ad7678f94ed1de4453cf
MD5 9047bf220738cf3b27b127815f1759c5
BLAKE2b-256 a273fcf1376693d6bba2cbd3d96c29d3328e04f7e694cb99636c1fa29497daec

See more details on using hashes here.

Provenance

The following attestation bundles were made for dapla_toolbelt_pseudo-4.4.0.tar.gz:

Publisher: release.yml on statisticsnorway/dapla-toolbelt-pseudo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dapla_toolbelt_pseudo-4.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dapla_toolbelt_pseudo-4.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 338ad1294e3be9ee0339cbeb4fe7563571f0c2d3f0ec1fae053e849d68b9db3c
MD5 15ad68bfbd9caf7e33b97105792d8af1
BLAKE2b-256 53c2f876d84c0772d8fd9971486b8f2f57c25e1854dd75f2d1b79b7d707c47d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for dapla_toolbelt_pseudo-4.4.0-py3-none-any.whl:

Publisher: release.yml on statisticsnorway/dapla-toolbelt-pseudo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page