Skip to main content

Anonymizes pandas dataset and provides a hash dictionary to de-anonymize

Project description

NER Anonymizer

PyPI version

This package contains some developmental tools to anonymize a pandas dataframe.

NER Anonymizer contains a class DataAnonymizer which handles anonymization for both free text and categorical columns in a pandas dataframe:

  • For free text columns, it uses a pretrained model from the transformers package to perform named entity recognition (NER) to pick up user specified entities such as location and person, generate a MD5 hash for the entity, replaces the entity with the hash, and stores the hash to entity in a dictionary
  • For categorical columns, it simply generates a MD5 hash for every category, replaces the category with the hash, and stores the hash to category in a dictionary

The saved dictionary can then be used for de-anonymization and the original dataset is obtained. Referential integrity is preserved as the same hash will be generated for the same category / entity.

Installation

Install the package with pip

pip install ner-anonymizer

Example Usage

The package uses the NER model dslim/bert-base-NER by default. To anonymize a particular pandas dataframe, df, using a pretrained NER model:

import ner_anonymizer

# to anonymize
anonymizer = ner_anonymizer.DataAnoynmizer(
    pretrained_model_name="dslim/bert-base-NER",
    label_list=["O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"],
    labels_to_anonymize=["B-PER", "I-PER", "B-LOC", "I-LOC"]
)
anonymized_df, hash_dictionary = anonymizer.anonymize(
    df=df,
    free_text_columns=["free_text_column_1", "free_text_column_2"],
    categorical_columns=["categorical_column_1"],

)

# to de-anonymize
de_anonymized_df = ner_anonymizer.de_anonymize_data(anonymized_df, hash_dictionary)

You may specify for the argument pretrained_model_name any available pre-trained NER model from the transformers package in the links below (do note that you will need to specify the labels that the NER model uses, label_list, and from that list, the labels you want to anonymize, labels_to_anonymize):

You may also view an example notebook in the following directory examples/example_usage.ipynb.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ner-anonymizer-0.1.6.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

ner_anonymizer-0.1.6-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file ner-anonymizer-0.1.6.tar.gz.

File metadata

  • Download URL: ner-anonymizer-0.1.6.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for ner-anonymizer-0.1.6.tar.gz
Algorithm Hash digest
SHA256 4a3a21ac3ce956dd5847b90006cafb5121c6f23682d3fa9340130e37ec392d47
MD5 52b3581819dea23dbeca920562046703
BLAKE2b-256 38a9fc67158397973552c5bc67f6a22e8c9f998baf10cb39f84731bfacde4209

See more details on using hashes here.

File details

Details for the file ner_anonymizer-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: ner_anonymizer-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for ner_anonymizer-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 280dbbbf08e543ff88bbbafd6a7dc9848ea028dd25de048255e8424640dd75d2
MD5 549cd1a118b8e9ca89bd9b5824b96a5d
BLAKE2b-256 142b1cd6198e2a68ad62a75434ea3303aa7e3d4079cfeef96d2f4ed30c35e755

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page