Python package to help datascientists remove or redact Personal Identifiable Information (PII)

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

PyPI

sanityze

Data scientists often need to remove or redact Personal Identifiable Information (PII) from their data. This package provides utilities to spot and redact PII from Pandas data frames.

PII can be used to uniquely identify a person. This includes names, addresses, credit card numbers, phone numbers, email addresses, and social security numbers, and therefore regulatory bodies such as the European Union's General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require that PII be removed or redacted from data sets before they are shared an further processed.

Contributors and Maintainers

Why `sanityze` ?

Because it's a fun name and it's a play on the word "sanitize" which is what we are doing to the data.

Similar packages in Python

The closet Python package in functionality to sanityze is scrubadub which is a package for finding and removing PII from text. The package is not designed to work with Pandas data frames, or other data structures, and we believe that our package will be more useful to data scientists, as we add more spotters (mechanisms for finding PII), support for more data structures, and provide mechanisms for users to define their own spotters.

Quick Start

To get started with sanityze, install it using pip:

pip install sanityze

And visit the documentation for more information and examples.

Features and Usage

Conceptually, sanityze is a package that provides a way to remove PII from Pandas data frames. The package provides a number of default spotters, which can be used to identify PII in the data and redact them.

The main entry point to the package is the Cleanser class. The Cleanser class is used to add Spotters to the cleanser, which will be used to identify PII in the data. The cleanser can then be used to cleanse the data, and redact the PII from the given data frame (all future data structures that will be suppportd by the package, in the future).

The package comes with a number of default spotters, as subclassess of Spotter:

CreditCardSpotter - identifies credit card numbers
EmailSpotter - identifies email addresses

Spotters can be added to it using the add_spotter() method. The cleanser can then be used to cleanse data using the cleanse() method which takes a Pandas data frame and returns a Pandas data frame with PII redacted.

The redaction options provided by `sanityze`` are:

Redact using a fixed string - The string in this case is the ID of the spotter. For example, if the spotter is an instance of CreditCardSpotter, the string will be {{CREDITCARD}}, or {{EMAILADDRS}} for an instance of EmailSpotter.
Redact using a hash of the input - The hash is computed using the hashlib package, and the hash function is md5. For example, if the spotter is an instance of CreditCardSpotter, the string will be {{6a8b8c6c8c62bc939a11f36089ac75dd}}, if the input is contains a PII 1234-5678-9012-3456.

Classes and Functions

Cleanser: the main class of the package. It is used to add spotters to it, and then cleanse data using the spotters.
1. add_spotter(): adds a spotter to the cleanser
2. remove_spotter(): removes a spotter from the cleanser
3. clean(): cleanses the data in the given data frame, and returns a new data frame with PII redacted
EmailSpotter: a spotter that identifies email addresses
1. getUID(): returns the unique ID of the spotter
2. process(): performs the PII matching and redaction
CreditCardSpotter: a spotter that identifies credit card numbers
1. getUID(): returns the unique ID of the spotter
2. process(): performs the PII matching and redaction

You can checkout detailed API Documentations here.

Below is a simple quick start example:

import pandas as pd
from sanityze import Cleanser, EmailSpotter

# Create a cleanser, and don't add the default spotters
cleanser = Cleanser(include_default_spotters=False)
cleaner.add_spotter(from sanityze import Cleanser, EmailSpotter())
cleaned_df = cleanser.clean(df)

High-level Design

To better understand the design of the package, we have provided a high-level design document, which will be kept up to date as the package evolves. The document can be found here.

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

sanityze was created by Caesar Wong, Jonah Hamilton and Tony Zoght. It is licensed under the terms of the MIT license.

Credits

sanityze was created with cookiecutter and the py-pkgs-cookiecutter template.

Quick Links

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.2

Feb 4, 2023

1.0.1

Feb 4, 2023

1.0.0

Feb 4, 2023

0.1.3

Jan 27, 2023

0.1.2

Jan 27, 2023

0.1.1

Jan 27, 2023

0.1.0

Jan 27, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sanityze-1.0.2.tar.gz (8.4 kB view hashes)

Uploaded Feb 4, 2023 Source

Built Distribution

sanityze-1.0.2-py3-none-any.whl (8.5 kB view hashes)

Uploaded Feb 4, 2023 Python 3

Hashes for sanityze-1.0.2.tar.gz

Hashes for sanityze-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`d974e06c35c70d16ad5689e2705bad0690d96d9b6444b6ee9fcaae160be6c842`
MD5	`654d603cea93ac70b0ca12d62ba4852c`
BLAKE2b-256	`e2d1713124cdcfd82a8c5ed7b5b16109f96e54534a05ab231390100e65ac55f0`

Hashes for sanityze-1.0.2-py3-none-any.whl

Hashes for sanityze-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`300022e5d555580da6fc25a69dc7e0da474cc37b8d9b2596674713f8c026f4a4`
MD5	`bbddcc620492444d61673282a14ee588`
BLAKE2b-256	`6a6cd2540cdba2a043addaa0c3b45e44755436390f6a3d38605fb0df5b5f48fd`

sanityze 1.0.2

Navigation

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Project description

sanityze

Contributors and Maintainers

Why `sanityze` ?

Similar packages in Python

Quick Start

Features and Usage

Classes and Functions

High-level Design

Contributing

License

Credits

Quick Links

Project details

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

sanityze 1.0.2

Navigation

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Project description

sanityze

Contributors and Maintainers

Why sanityze ?

Similar packages in Python

Quick Start

Features and Usage

Classes and Functions

High-level Design

Contributing

License

Credits

Quick Links

Project details

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Why `sanityze` ?