Skip to main content

Python library to anonymize csv datasets

Project description

Anonymization library

Overview

Anonymize your csv datasets with 4 techniques:

Your CSV Dataset

  • Your dataset must be a csv file with comma or semi-colon separator. Encoding must be UTF-8.

The configuration file

  • The configuration file (.json) is generated with Cosmian Anonymization tool. Install the tool with docker and generate a config file.

Using library

anonymize_dataset function takes 2 arguments:

  • data: the dataset to anonymize. It must be a DataFrame
  • config: the json anonymization config file, it must be a Dictionary

Example

from cosmian_lib_anonymization import anonymize_dataset

# dataset to anonymise
data_to_anonymize = {
  "first_name": ["Jane", "Bob", "John"],
  "last_name": ["Warner", "Smith", "Moor"],
  "birthdate": ["01/12/1979", "06/28/85", "08/20/96"],
  "city": ["London", "Munich", "Beijin"]
  }

# load config file and transform json config into dictionary
json_config_path = Path("config.json").resolve()
config = json.loads(json_config_path.read_bytes())

# get anonymized result
df_result = anonymize_dataset(data_to_anonymize, config)
df_result.head()

Anonymizations techniques

Hash function

This corresponds to a function which returns a fixed size output from an input of any size (the input may be a single attribute or a set of attributes) and cannot be reversed; this means that the reversal risk seen with encryption no longer exists. However, if the range of input values the hash function are known they can be replayed through the hash function in order to derive the correct value for a particular record.

Example
For instance, if a dataset was pseudonymised by hashing the national identification number, then this can be derived simply by hashing all possible input values and comparing the result with those values in the dataset. Hash functions are usually designed to be relatively fast to compute, and are subject to brute force attacks. Pre-computed tables can also be created to allow for the bulk reversal of a large set of hash values.

Aggregation and K-anonymity

Aggregation and K-anonymity techniques aim to prevent a data subject from being singled out by grouping them with, at least, k other individuals. To achieve this, the attribute values are generalized to an extent such that each individual shares the same value.

Example
For example, by lowering the granularity of a location from a city to a country a higher number of data subjects are included. Individual dates of birth can be generalized into a range of dates, or grouped by month or year. Other numerical attributes (e.g. salaries, weight, height, or the dose of a medicine) can be generalized by interval values (e.g. salary €20,000 – €30,000). These methods may be used when the correlation of punctual values of attributes may create quasi-identifiers.

Noise addition

The technique of noise addition is especially useful when attributes may have an important adverse effect on individuals and consists of modifying attributes in the dataset such that they are less accurate whilst retaining the overall distribution. When processing a dataset, an observer will assume that values are accurate but this will only be true to a certain degree.

Example
As an example, if an individual’s height was originally measured to the nearest centimetre the anonymised dataset may contain a height accurate to only +-10cm. If this technique is applied effectively, a third-party will not be able to identify an individual nor should he be able to repair the data or otherwise detect how the data have been modified. Noise addition will commonly need to be combined with other anonymization techniques such as the removal of obvious attributes and quasi-identifiers. The level of noise should depend on the necessity of the level of information required and the impact on individuals’ privacy as a result of disclosure of the protected attributes.

Block words

This technique can be used to hide certain sensitive words from your dataset. These words can be either masked, i.e. replaced with XXXX, or tokenized, i.e. replace with a non-deterministic UID. During the anonymization, every occurrence of a word that is in the word list will be tokenized by the same UID. But if you create new anonymization, a new UID will be generated, even if the word is the same. Also, please note that this system does not provide any way to decipher the token and reveal the original data.

Example
As an example, if the word list contains the words “Doe” and “Smith”, the text “Mr. Doe and Ms. Smith were sentenced to 100 months imprisonment” will become:

  • Mr. XXXX and Ms. XXXX were sentenced to 100 months imprisonment”“ if you choose the Mask option.
  • “Mr. edc8e2e1-2963-49b0-b32b-f553a7378985 and Ms. 15bfc4a5-dc2d-4c27-8187-3c0e84c9043d were sentenced to 100 months imprisonment” {" "} if you choose the Mask option.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cosmian_lib_anonymization-0.1.1.tar.gz (7.6 kB view hashes)

Uploaded Source

Built Distribution

cosmian_lib_anonymization-0.1.1-py3-none-any.whl (7.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page