Python package anonym
Project description
anonym
-
The
anonym
library is designed to anonymize sensitive data in Python, allowing users to work with, share, or publish their data without compromising privacy or violating data protection regulations. It uses Named Entity Recognition (NER) fromspacy
to identify sensitive information in the data. Once identified, the library leverages thefaker
library to generate fake but realistic replacements. Depending on the type of sensitive information (like names, addresses, dates), corresponding faker methods are used, ensuring the anonymized data maintains a similar structure and format to the original, making it suitable for further data analysis or testing. -
The
anonym
algorithm is designed to anonymize data in a DataFrame. It works by replacing real data with fake data, while maintaining the structure and format of the original data. Here's a step-by-step explanation of how it works:
1. Initialization: The anonym class is initialized with a language parameter (default is 'dutch') and a verbosity level (default is 'info'). The language parameter is used to load the appropriate language model for named entity recognition (NER), and the verbosity level sets the logger's verbosity.
2. Data Import: The import_data method is used to import a dataset from a given file path. The data is read into a pandas DataFrame.
3. Data Anonymization: The anonymize method is the core of the algorithm. It takes a DataFrame and optional parameters for specifying columns to fake or not to fake, and a NER blacklist. The method works as follows:
4. It calls the extract_entities function to extract all entities from the DataFrame. This function uses the spacy
library's NER capabilities to identify entities in the data. If a column is specified in the fakeit parameter, the entities in that column are replaced with the specified fake replacement. If a column is specified in the do_not_fake parameter, it is left untouched. Otherwise, NER is performed on each row of the column.
5. The generate_fake_labels function is then called to generate fake labels for the extracted entities. This function uses the faker
library to generate fake data that matches the type of the original data (e.g., names, companies, dates, cities, etc.).
6. The replace_label_with_fake function is then used to replace the original entities in the DataFrame with the generated fake labels.
7. Data Export: The to_csv method is used to write the anonymized DataFrame to a CSV file.
8. Example Data Import: The import_example method is used to import example datasets from a GitHub source or a specified URL.
Start
|
v
Initialize `anonym` class
|
v
Import data using `import_data` method
|
v
Anonymize data using `anonymize` method
| |
| v
| Extract entities using `extract_entities` function
| |
| v
| Generate fake labels using `generate_fake_labels` function
| |
| v
| Replace original labels with fake ones using `replace_label_with_fake` function
v
Export anonymized data using `to_csv` method
|
v
End
The algorithm also includes several utility functions for text cleaning, preprocessing, filtering values, checking the spacy
model, and setting the logger. The main function at the end of the script demonstrates how to use the anonym class to import an example dataset, anonymize it, and plot the results.
Documentation
Contents
Installation
- Install anonym from PyPI (recommended). anonym is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows.
- A new environment can be created as following:
conda create -n env_anonym python=3.10
conda activate env_anonym
pip install anonym # normal install
pip install --upgrade anonym # or update if needed
- Alternatively, you can install from the GitHub source:
# Directly install from github source
pip install -e git://gitlab.com/datainnovatielab/public/anonym.git@0.1.0#egg=master
pip install git+https://gitlab.com/datainnovatielab/public/anonym#egg=master
pip install git+https://gitlab.com/datainnovatielab/public/anonym
# By cloning
git clone https://gitlab.com/datainnovatielab/public/anonym.git
cd anonym
pip install -U .
Import anonym package
import anonym as anonym
Example:
# Example 2
# Load library
from anonym import anonym
# Initialize
model = anonym(language='english', verbose='info')
# Import example data set
df = model.import_example('titanic')
# Anonimyze the data set
df_fake = model.anonymize(df)
References
Citation
Please cite in your publications if this is useful for your research (see citation).
Contribute
- All kinds of contributions are welcome!
Licence
See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file anonym-0.1.1.tar.gz
.
File metadata
- Download URL: anonym-0.1.1.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c970ae49426b4226ebaa51d79cd4608085f67ec75768d6488e86529c92a21066 |
|
MD5 | b39c4861f616b7abf0e2f23eaeb2c40d |
|
BLAKE2b-256 | 3d2eceb1c9a14718d09e72cac5ae8156359c8a74ef73ce33d713edeb52271540 |
File details
Details for the file anonym-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: anonym-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b01027ce0cd747c0d19de13e56fab8bbd80abcde45cb1a3719ff47350ab8baad |
|
MD5 | 45e2862335d37dc8cac1fd91cc5b1b32 |
|
BLAKE2b-256 | 29e6249fe755dcacd24ec8be482b4287ba34dd7a4c338b93a6f08683bbddd92b |