Anonymizing Library for Apache Spark
Project description
spark-privacy-preserver
This module provides a simple tool for anonymizing a dataset using PySpark. Given a spark.sql.dataframe
with relevant metadata mondrian_privacy_preserver generates a anonymized spark.sql.dataframe
. This provides following privacy preserving techniques for the anonymization.
- K Anonymity
- L Diversity
- T Closeness
Note: Only works with PySpark
Demo
Jupyter notebook for each of the following modules are included.
- Mondrian Based Anonymity (Single User Anonymization included)
- Clustering Based K-Anonymity
- Differential Privacy
Requirements
- PySpark 2.4.5. You can easily install it with
pip install pyspark
. - PyArrow. You can easily install it with
pip install pyarrow
. - Pandas. You can easily install it with
pip install pandas
.
Installation
Using pip
Use pip install spark_privacy_preserver
to install library
using source code
Add the spark_privacy_preserver folder to your working directory and you can import the required submodule from the library.
Usage - Basic Mondrian
K Anonymity
The spark.sql.dataframe
you get after anonymizing will always contain a extra column count
which indicates the number of similar rows.
Return type of all the non categorical columns will be string
from spark_privacy_preserver.mondrian_preserver import Preserver #requires pandas
#df - spark.sql.dataframe - original dataframe
#k - int - value of the k
#feature_columns - list - what you want in the output dataframe
#sensitive_column - string - what you need as senstive attribute
#categorical - set -all categorical columns of the original dataframe as a set
#schema - spark.sql.types StructType - schema of the output dataframe you are expecting
df = spark.read.csv(your_csv_file).toDF('age',
'occupation',
'race',
'sex',
'hours-per-week',
'income')
categorical = set((
'occupation',
'sex',
'race'
))
feature_columns = ['age', 'occupation']
sensitive_column = 'income'
your_anonymized_dataframe = Preserver.k_anonymize(df,
k,
feature_columns,
sensitive_column,
categorical,
schema)
Following code snippet shows how to construct an example schema. You need to always consider the count column when constructing the schema. Count column is an integer type column.
from spark.sql.type import *
#age, occupation - feature columns
#income - sensitive column
schema = StructType([
StructField("age", DoubleType()),
StructField("occupation", StringType()),
StructField("income", StringType()),
StructField("count", IntegerType())
])
K Anonymity (without row suppresion)
This function provides a simple way to anonymize a dataset which has an user identification attribute without grouping the rows.
This function doesn't return a dataframe with the count variable as above function. Instead it returns the same dataframe, k-anonymized. Return type of all the non categorical columns will be string.
User attribute column must not be given as a feature column and its return type will be same as the input type.
Function takes exact same parameters as the above function. To use this method to anonymize the dataset, instead of calling k_anonymize
, call k_anonymize_w_user
.
L Diversity
Same as the K Anonymity, the spark.sql.dataframe
you get after anonymizing will always contain a extra column count
which indicates the number of similar rows.
Return type of all the non categorical columns will be string
from spark_privacy_preserver.mondrian_preserver import Preserver #requires pandas
#df - spark.sql.dataframe - original dataframe
#k - int - value of the k
#l - int - value of the l
#feature_columns - list - what you want in the output dataframe
#sensitive_column - string - what you need as senstive attribute
#categorical - set -all categorical columns of the original dataframe as a set
#schema - spark.sql.types StructType - schema of the output dataframe you are expecting
df = spark.read.csv(your_csv_file).toDF('age',
'occupation',
'race',
'sex',
'hours-per-week',
'income')
categorical = set((
'occupation',
'sex',
'race'
))
feature_columns = ['age', 'occupation']
sensitive_column = 'income'
your_anonymized_dataframe = Preserver.l_diversity(df,
k,
l,
feature_columns,
sensitive_column,
categorical,
schema)
L Diversity (without row suppresion)
This function provides a simple way to anonymize a dataset which has an user identification attribute without grouping the rows.
This function doesn't return a dataframe with the count variable as above function. Instead it returns the same dataframe, l-diversity anonymized. Return type of all the non categorical columns will be string.
User attribute column must not be given as a feature column and its return type will be same as the input type.
Function takes exact same parameters as the above function. To use this method to anonymize the dataset, instead of calling l_diversity
, call l_diversity_w_user
.
T - Closeness
Same as the K Anonymity, the spark.sql.dataframe
you get after anonymizing will always contain a extra column count
which indicates the number of similar rows.
Return type of all the non categorical columns will be string
from spark_privacy_preserver.mondrian_preserver import Preserver #requires pandas
#df - spark.sql.dataframe - original dataframe
#k - int - value of the k
#l - int - value of the l
#feature_columns - list - what you want in the output dataframe
#sensitive_column - string - what you need as senstive attribute
#categorical - set -all categorical columns of the original dataframe as a set
#schema - spark.sql.types StructType - schema of the output dataframe you are expecting
df = spark.read.csv(your_csv_file).toDF('age',
'occupation',
'race',
'sex',
'hours-per-week',
'income')
categorical = set((
'occupation',
'sex',
'race'
))
feature_columns = ['age', 'occupation']
sensitive_column = 'income'
your_anonymized_dataframe = Preserver.t_closeness(df,
k,
t,
feature_columns,
sensitive_column,
categorical,
schema)
T Closeness (without row suppresion)
This function provides a simple way to anonymize a dataset which has an user identification attribute without grouping the rows.
This function doesn't return a dataframe with the count variable as above function. Instead it returns the same dataframe, t-closeness anonymized. Return type of all the non categorical columns will be string.
User attribute column must not be given as a feature column and its return type will be same as the input type.
Function takes exact same parameters as the above function. To use this method to anonymize the dataset, instead of calling t_closeness
, call t_closeness_w_user
.
Single User K Anonymity
This function provides a simple way to anonymize a given user in a dataset. Even though this doesn't use the mondrian algorithm, function is included in the mondrian_preserver
. User identification attribute and the column name of the user identification atribute is needed as parameters.
This doesn't return a dataframe with count variable. Instead this returns the same dataframe, anonymized for the given user. Return type of user column and all the non categorical columns will be string.
from spark_privacy_preserver.mondrian_preserver import Preserver #requires pandas
#df - spark.sql.dataframe - original dataframe
#k - int - value of the k
#user - name, id, number of the user. Unique user identification attribute.
#usercolumn_name - name of the column containing unique user identification attribute.
#sensitive_column - string - what you need as senstive attribute
#categorical - set -all categorical columns of the original dataframe as a set
#schema - spark.sql.types StructType - schema of the output dataframe you are expecting
#random - a flag by default set to false. In a case where algorithm can't find similar rows for given user, if this is set to true, slgorithm will randomly select rows from dataframe.
df = spark.read.csv(your_csv_file).toDF('name',
'age',
'occupation',
'race',
'sex',
'hours-per-week',
'income')
categorical = set((
'occupation',
'sex',
'race'
))
sensitive_column = 'income'
user = 'Jon'
usercolumn_name = 'name'
random = True
your_anonymized_dataframe = Preserver.anonymize_user(df,
k,
user,
usercolumn_name,
sensitive_column,
categorical,
schema,
random)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spark_privacy_preserver-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f1a5581f6714a956d7e81df51acb9d7fc9b2ed62d02854f076bb9186648528e |
|
MD5 | 0984d6792e8c8ed2a73d337509c860e5 |
|
BLAKE2b-256 | 1a293adeba7c98cf8a942813416ce8c5bca1852eca8933d2aaf921803d8b8af8 |
Hashes for spark_privacy_preserver-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3164d95d24984caf1a346b320aa9ea9fa0b04a1ea93f5b7e364e2b6a8cf509b |
|
MD5 | 0422d4874eecede5c27d5894daad6b4c |
|
BLAKE2b-256 | d963d0dc7c1a1c32866772d02b32c9e9bf3762651f4507543161ee519101443a |