Anonymizes pandas dataset and provides a hash dictionary to de-anonymize
Project description
NER Anonymizer
This package contains some developmental tools to anonymize a pandas dataframe.
NER Anonymizer contains a class DataAnonymizer
which handles anonymization for both free text and categorical columns in a pandas dataframe:
- For free text columns, it uses a pretrained model from the transformers package to perform named entity recognition (NER) to pick up user specified entities such as location and person, generate a MD5 hash for the entity, replaces the entity with the hash, and stores the hash to entity in a dictionary
- For categorical columns, it simply generates a MD5 hash for every category, replaces the category with the hash, and stores the hash to category in a dictionary
The saved dictionary can then be used for de-anonymization and the original dataset is obtained. Referential integrity is preserved as the same hash will be generated for the same category / entity.
Installation
Install the package with pip
pip install ner-anonymizer
Example Usage
The package uses the NER model dslim/bert-base-NER by default. To anonymize a particular pandas dataframe, df
, using a pretrained NER model:
import ner_anonymizer
# to anonymize
anonymizer = ner_anonymizer.DataAnoynmizer(
pretrained_model_name="dslim/bert-base-NER",
label_list=["O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"],
labels_to_anonymize=["B-PER", "I-PER", "B-LOC", "I-LOC"]
)
anonymized_df, hash_dictionary = anonymizer.anonymize(
df=df,
free_text_columns=["free_text_column_1", "free_text_column_2"],
categorical_columns=["categorical_column_1"],
)
# to de-anonymize
de_anonymized_df = ner_anonymizer.de_anonymize_data(anonymized_df, hash_dictionary)
You may specify for the argument pretrained_model_name
any available pre-trained NER model from the transformers package in the links below (do note that you will need to specify the labels that the NER model uses, label_list
, and from that list, the labels you want to anonymize, labels_to_anonymize
):
You may also view an example notebook in the following directory examples/example_usage.ipynb
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ner_anonymizer-0.1.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 280dbbbf08e543ff88bbbafd6a7dc9848ea028dd25de048255e8424640dd75d2 |
|
MD5 | 549cd1a118b8e9ca89bd9b5824b96a5d |
|
BLAKE2b-256 | 142b1cd6198e2a68ad62a75434ea3303aa7e3d4079cfeef96d2f4ed30c35e755 |