Skip to main content

Python package anonym

Project description

anonym

Python Pypi Docs Downloads Downloads License Issues Project Status

  • The anonym library is designed to anonymize sensitive data in Python, allowing users to work with, share, or publish their data without compromising privacy or violating data protection regulations. It uses Named Entity Recognition (NER) from spacy to identify sensitive information in the data. Once identified, the library leverages the faker library to generate fake but realistic replacements. Depending on the type of sensitive information (like names, addresses, dates), corresponding faker methods are used, ensuring the anonymized data maintains a similar structure and format to the original, making it suitable for further data analysis or testing.

  • The anonym algorithm is designed to anonymize data in a DataFrame. It works by replacing real data with fake data, while maintaining the structure and format of the original data. Here's a step-by-step explanation of how it works:

1. Initialization: The anonym class is initialized with a language parameter (default is 'dutch') and a verbosity level (default is 'info'). The language parameter is used to load the appropriate language model for named entity recognition (NER), and the verbosity level sets the logger's verbosity.

2. Data Import: The import_data method is used to import a dataset from a given file path. The data is read into a pandas DataFrame.

3. Data Anonymization: The anonymize method is the core of the algorithm. It takes a DataFrame and optional parameters for specifying columns to fake or not to fake, and a NER blacklist. The method works as follows:

4. It calls the extract_entities function to extract all entities from the DataFrame. This function uses the spacy library's NER capabilities to identify entities in the data. If a column is specified in the fakeit parameter, the entities in that column are replaced with the specified fake replacement. If a column is specified in the do_not_fake parameter, it is left untouched. Otherwise, NER is performed on each row of the column.

5. The generate_fake_labels function is then called to generate fake labels for the extracted entities. This function uses the faker library to generate fake data that matches the type of the original data (e.g., names, companies, dates, cities, etc.).

6. The replace_label_with_fake function is then used to replace the original entities in the DataFrame with the generated fake labels.

7. Data Export: The to_csv method is used to write the anonymized DataFrame to a CSV file.

8. Example Data Import: The import_example method is used to import example datasets from a GitHub source or a specified URL.

Start
  |
  v
Initialize `anonym` class
  |
  v
Import data using `import_data` method
  |
  v
Anonymize data using `anonymize` method
  |         |
  |         v
  |     Extract entities using `extract_entities` function
  |         |
  |         v
  |     Generate fake labels using `generate_fake_labels` function
  |         |
  |         v
  |     Replace original labels with fake ones using `replace_label_with_fake` function
  v
Export anonymized data using `to_csv` method
  |
  v
End

The algorithm also includes several utility functions for text cleaning, preprocessing, filtering values, checking the spacy model, and setting the logger. The main function at the end of the script demonstrates how to use the anonym class to import an example dataset, anonymize it, and plot the results.

Documentation

Contents

Installation

  • Install anonym from PyPI (recommended). anonym is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows.
  • A new environment can be created as following:
conda create -n env_anonym python=3.10
conda activate env_anonym
pip install anonym            # normal install
pip install --upgrade anonym # or update if needed
  • Alternatively, you can install from the GitHub source:
# Directly install from github source
pip install -e git://gitlab.com/datainnovatielab/public/anonym.git@0.1.0#egg=master
pip install git+https://gitlab.com/datainnovatielab/public/anonym#egg=master
pip install git+https://gitlab.com/datainnovatielab/public/anonym

# By cloning
git clone https://gitlab.com/datainnovatielab/public/anonym.git
cd anonym
pip install -U .

Import anonym package

import anonym as anonym

Example:

  # Example 2
  # Load library
  from anonym import anonym
  # Initialize
  model = anonym(language='english', verbose='info')
  # Import example data set
  df = model.import_example('titanic')
  # Anonimyze the data set
  df_fake = model.anonymize(df)

References

Citation

Please cite in your publications if this is useful for your research (see citation).

Contribute

  • All kinds of contributions are welcome!

Licence

See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anonym-0.1.1.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

anonym-0.1.1-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file anonym-0.1.1.tar.gz.

File metadata

  • Download URL: anonym-0.1.1.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for anonym-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c970ae49426b4226ebaa51d79cd4608085f67ec75768d6488e86529c92a21066
MD5 b39c4861f616b7abf0e2f23eaeb2c40d
BLAKE2b-256 3d2eceb1c9a14718d09e72cac5ae8156359c8a74ef73ce33d713edeb52271540

See more details on using hashes here.

File details

Details for the file anonym-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: anonym-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for anonym-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b01027ce0cd747c0d19de13e56fab8bbd80abcde45cb1a3719ff47350ab8baad
MD5 45e2862335d37dc8cac1fd91cc5b1b32
BLAKE2b-256 29e6249fe755dcacd24ec8be482b4287ba34dd7a4c338b93a6f08683bbddd92b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page