Pseudonymization extensions for Dapla Toolbelt
Project description
Pseudonymization extensions for Dapla Toolbelt
Pseudonymize, repseudonymize and depseudonymize data on Dapla.
Usage
See the command-line reference for details.
Pseudonymize
from dapla_pseudo import pseudonymize
# Pseudonymize fields in a local file using the default key:
pseudonymize(file_path="./data/personer.json", fields=["fnr", "fornavn"])
# Pseudonymize fields in a local file, explicitly denoting the key to use:
pseudonymize(file_path="./data/personer.json", fields=["fnr", "fornavn"], key="ssb-common-key-1")
# Pseudonymize a local file using a custom key:
import json
custom_keyset = json.dumps( {
"encryptedKeyset": "CiQAp91NBhLdknX3j9jF6vwhdyURaqcT9/M/iczV7fLn...8XYFKwxiwMtCzDT6QGzCCCM=",
"keysetInfo": {
"primaryKeyId": 1234567890,
"keyInfo": [
{
"typeUrl": "type.googleapis.com/google.crypto.tink.AesSivKey",
"status": "ENABLED",
"keyId": 1234567890,
"outputPrefixType": "TINK",
}
],
},
"kekUri": "gcp-kms://projects/some-project-id/locations/europe-north1/keyRings/some-keyring/cryptoKeys/some-kek-1",
})
pseudonymize(file_path="./data/personer.json", fields=["fnr", "fornavn"], key=custom_keyset)
# Operate on data in a streaming manner:
import shutil
with pseudonymize("./data/personer.json", fields=["fnr", "fornavn", "etternavn"], stream=True) as res:
with open("./data/personer_deid.json", 'wb') as f:
res.raw.decode_content = True
shutil.copyfileobj(res.raw, f)
# Map certain fields to stabil ID
pseudonymize(file_path="./data/personer.json", fields=["fornavn"], sid_fields=["fnr"])
Builder pattern pseudonymization examples
# Import necessary modules
from dapla_pseudo import PseudoData
from dapla import AuthClient
import pandas as pd
file_path="data/personer.json"
options = {
# Specify data types of columns in the dataset
"dtype" : { "fnr": "string","fornavn": "string","etternavn": "string","kjonn": "category","fodselsdato": "string"}
}
# Example: Single field default encryption (DAEAD)
df = pd.read_json(file_path,**options) # Create DataFrame from file
result_df = (
PseudoData.from_pandas(df) # Specify what dataframe to use
.on_field("fornavn") # Select the field to pseudonymize
.pseudonymize() # Apply pseudonymization to the selected field
.to_polars() # Get the result as a polars dataframe
)
# Example: Multiple fields default encryption (DAEAD)
result_df = (
PseudoData.from_file(file_path, **options) # Read the DataFrame from file
.on_fields("fornavn", "etternavn") # Select multiple fields to pseudonymize
.pseudonymize() # Apply pseudonymization to the selected fields
.to_polars() # Get the result as a polars dataframe
)
# Example: Single field sid mapping (FPE)
options = {
# Specify data types of columns in the dataset
"dtype" : { "fnr": "string","fornavn": "string","etternavn": "string","kjonn": "category","fodselsdato": "string"},
# Specify storage options for Google Cloud Storage (GCS)
"storage_options" : {"token": AuthClient.fetch_google_token()}
}
gcs_file_path = "gs://ssb-staging-dapla-felles-data-delt/felles/pseudo-examples/andeby_personer.csv"
result_df = (
PseudoData.from_file(gcs_file_path, **options) # Read DataFrame from GCS
.on_field("fnr") # Select multiple fields to pseudonymize
.map_to_stable_id() # Map the selected field to stable id
.pseudonymize() # Apply pseudonymization to the selected fields
.to_polars() # Get the result as a polars dataframe
)
Repseudonymize
from dapla_pseudo import repseudonymize
# Repseudonymize fields in a local file, denoting source and target keys to use:
repseudonymize(file_path="./data/personer_deid.json", fields=["fnr", "fornavn"], source_key="ssb-common-key-1", target_key="ssb-common-key-2")
Depseudonymize
from dapla_pseudo import depseudonymize
# Depseudonymize fields in a local file using the default key:
depseudonymize(file_path="./data/personer_deid.json", fields=["fnr", "fornavn"])
# Depseudonymize fields in a local file, explicitly denoting the key to use:
depseudonymize(file_path="./data/personer_deid.json", fields=["fnr", "fornavn"], key="ssb-common-key-1")
Note that depseudonymization requires elevated access privileges.
Requirements
Installation
You can install dapla-toolbelt-pseudo via pip from PyPI:
pip install dapla-toolbelt-pseudo
Contributing
Contributions are very welcome. To learn more, see the Contributor Guide.
License
Distributed under the terms of the MIT license, Pseudonymization extensions for Dapla Toolbelt is free and open source software.
Issues
If you encounter any problems, please file an issue along with a detailed description.
Credits
This project was generated from @cjolowicz's Hypermodern Python Cookiecutter template.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dapla_toolbelt_pseudo-0.5.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0b174826af89148f0940242f8dbfb788c6a578e1bab9eb6f6862b86e03b7138 |
|
MD5 | fd0285194802f5097e44bfb7ef4535bd |
|
BLAKE2b-256 | 45afb35699055eafb2b1701dc6090d4536b16b5f93bae9fb7d1f652b53dd42ec |
Hashes for dapla_toolbelt_pseudo-0.5.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14635345dcf8e88f8ee43927ddb5c5cb976236976298bbf8051bee53e95264b4 |
|
MD5 | b65b7a50d4fd90d068f98d17142dd6d0 |
|
BLAKE2b-256 | 7424b360c5a33ce732cdedda40a445d0c6c26e5078a0155c2a6114bddb6d3ad9 |