Skip to main content

Python library to anonymize csv datasets

Project description

Anonymization library

Overview

Anonymize your csv datasets with 4 techniques:

Your CSV Dataset

  • Your dataset must be a csv file with comma or semi-colon separator. Encoding must be UTF-8.

The configuration file

  • The configuration file (.json) is generated with Cosmian Anonymization tool. Install the tool with docker and generate a config file.

Using library

anonymize_dataset function takes 2 arguments:

  • data: the dataset to anonymize. It must be a DataFrame
  • config: the json anonymization config file, it must be a Dictionary

Example

from cosmian_lib_anonymization import anonymize_dataset

# dataset to anonymise
data_to_anonymize = {
  "first_name": ["Jane", "Bob", "John"],
  "last_name": ["Warner", "Smith", "Moor"],
  "birthdate": ["01/12/1979", "06/28/85", "08/20/96"],
  "city": ["London", "Munich", "Beijin"]
  }

# load config file and transform json config into dictionary
json_config_path = Path("config.json").resolve()
config = json.loads(json_config_path.read_bytes())

# get anonymized result
df_result = anonymize_dataset(data_to_anonymize, config)
df_result.head()

Anonymizations techniques

Hash function

This corresponds to a function which returns a fixed size output from an input of any size (the input may be a single attribute or a set of attributes) and cannot be reversed; this means that the reversal risk seen with encryption no longer exists. However, if the range of input values the hash function are known they can be replayed through the hash function in order to derive the correct value for a particular record.

Example
For instance, if a dataset was pseudonymised by hashing the national identification number, then this can be derived simply by hashing all possible input values and comparing the result with those values in the dataset. Hash functions are usually designed to be relatively fast to compute, and are subject to brute force attacks. Pre-computed tables can also be created to allow for the bulk reversal of a large set of hash values.

Aggregation and K-anonymity

Aggregation and K-anonymity techniques aim to prevent a data subject from being singled out by grouping them with, at least, k other individuals. To achieve this, the attribute values are generalized to an extent such that each individual shares the same value.

Example
For example, by lowering the granularity of a location from a city to a country a higher number of data subjects are included. Individual dates of birth can be generalized into a range of dates, or grouped by month or year. Other numerical attributes (e.g. salaries, weight, height, or the dose of a medicine) can be generalized by interval values (e.g. salary €20,000 – €30,000). These methods may be used when the correlation of punctual values of attributes may create quasi-identifiers.

Noise addition

The technique of noise addition is especially useful when attributes may have an important adverse effect on individuals and consists of modifying attributes in the dataset such that they are less accurate whilst retaining the overall distribution. When processing a dataset, an observer will assume that values are accurate but this will only be true to a certain degree.

Example
As an example, if an individual’s height was originally measured to the nearest centimetre the anonymised dataset may contain a height accurate to only +-10cm. If this technique is applied effectively, a third-party will not be able to identify an individual nor should he be able to repair the data or otherwise detect how the data have been modified. Noise addition will commonly need to be combined with other anonymization techniques such as the removal of obvious attributes and quasi-identifiers. The level of noise should depend on the necessity of the level of information required and the impact on individuals’ privacy as a result of disclosure of the protected attributes.

Block words

This technique can be used to hide certain sensitive words from your dataset. These words can be either masked, i.e. replaced with XXXX, or tokenized, i.e. replace with a non-deterministic UID. During the anonymization, every occurrence of a word that is in the word list will be tokenized by the same UID. But if you create new anonymization, a new UID will be generated, even if the word is the same. Also, please note that this system does not provide any way to decipher the token and reveal the original data.

Example
As an example, if the word list contains the words “Doe” and “Smith”, the text “Mr. Doe and Ms. Smith were sentenced to 100 months imprisonment” will become:

  • Mr. XXXX and Ms. XXXX were sentenced to 100 months imprisonment”“ if you choose the Mask option.
  • “Mr. edc8e2e1-2963-49b0-b32b-f553a7378985 and Ms. 15bfc4a5-dc2d-4c27-8187-3c0e84c9043d were sentenced to 100 months imprisonment” {" "} if you choose the Mask option.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cosmian_lib_anonymization-0.1.2.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file cosmian_lib_anonymization-0.1.2.tar.gz.

File metadata

File hashes

Hashes for cosmian_lib_anonymization-0.1.2.tar.gz
Algorithm Hash digest
SHA256 52258f8ec3c1669d344f9b684aa11cfb26c88e088b5a90c70deaf6c955d49f8f
MD5 4c6d4ba38ed59cee57b9af92f42799c9
BLAKE2b-256 b305c0aa523b33f8cbb3436daea22c3180d39fdc4be9856d083ecbd49fa5d238

See more details on using hashes here.

File details

Details for the file cosmian_lib_anonymization-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for cosmian_lib_anonymization-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9bdc0fd76d01a43b91982ed18fe3b41ff23f9dd6a1f0d6ae2b9f8620d000ba66
MD5 5dacefa2eb333df85a0498e84d92eda2
BLAKE2b-256 770e4f54da567a346b26b6afbdc1234b9498f8402e1a80d4f5debdbf2062d909

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page