Anonymizes pandas dataset and provides a hash dictionary to de-anonymize
Project description
NER Anonymizer
This package contains some developmental tools to anonymize a pandas dataframe.
NER Anonymizer contains a class DataAnonymizer
which handles anonymization for both free text and categorical columns in a pandas dataframe:
- For free text columns, it uses a pretrained model from the transformers package to perform named entity recognition (NER) to pick up user specified entities such as location and person, generate a MD5 hash for the entity, replaces the entity with the hash, and stores the hash to entity in a dictionary
- For categorical columns, it simply generates a MD5 hash for every category, replaces the category with the hash, and stores the hash to category in a dictionary
The saved dictionary can then be used for de-anonymization and the original dataset is obtained. Referential integrity is preserved as the same hash will be generated for the same category / entity.
Installation
Install the package with pip
pip install ner-anonymizer
Example Usage
The package uses the NER model dslim/bert-base-NER by default. To anonymize a particular pandas dataframe, df
, using a pretrained NER model:
import ner_anonymizer
# to anonymize
anonymizer = ner_anonymizer.DataAnoynmizer(
pretrained_model_name="dslim/bert-base-NER",
label_list=["O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"],
labels_to_anonymize=["B-PER", "I-PER", "B-LOC", "I-LOC"]
)
anonymized_df, hash_dictionary = anonymizer.anonymize(
df=df,
free_text_columns=["free_text_column_1", "free_text_column_2"],
categorical_columns=["categorical_column_1"],
)
# to de-anonymize
de_anonymized_df = ner_anonymizer.de_anonymize_data(anonymized_df, hash_dictionary)
You may specify for the argument pretrained_model_name
any available pre-trained NER model from the transformers package in the links below (do note that you will need to specify the labels that the NER model uses, label_list
, and from that list, the labels you want to anonymize, labels_to_anonymize
):
You may also view an example notebook in the following directory examples/example_usage.ipynb
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ner-anonymizer-0.1.6.tar.gz
.
File metadata
- Download URL: ner-anonymizer-0.1.6.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
4a3a21ac3ce956dd5847b90006cafb5121c6f23682d3fa9340130e37ec392d47
|
|
MD5 |
52b3581819dea23dbeca920562046703
|
|
BLAKE2b-256 |
38a9fc67158397973552c5bc67f6a22e8c9f998baf10cb39f84731bfacde4209
|
File details
Details for the file ner_anonymizer-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: ner_anonymizer-0.1.6-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
280dbbbf08e543ff88bbbafd6a7dc9848ea028dd25de048255e8424640dd75d2
|
|
MD5 |
549cd1a118b8e9ca89bd9b5824b96a5d
|
|
BLAKE2b-256 |
142b1cd6198e2a68ad62a75434ea3303aa7e3d4079cfeef96d2f4ed30c35e755
|