Improved K-Anonymity and l-Diversity library
Project description
pyikaild
A Python package for data privacy and anonymization implementing Improved k-Anonymity (IKA) and Improved l-Diversity (ILD) algorithms.
Overview
pyikaild provides implementations of two key privacy-preserving techniques for sensitive data:
-
Improved k-Anonymity (IKA): Ensures each record is indistinguishable from at least k-1 other records based on quasi-identifier attributes through generalization techniques.
-
Improved l-Diversity (ILD): Ensures each equivalence class (group of records with identical quasi-identifiers) contains at least l distinct values for sensitive attributes.
These techniques help protect privacy in datasets while maintaining data utility for analysis, in compliance with privacy regulations and best practices.
Installation
pip install pyikaild
Key Concepts
- Quasi-Identifiers (QI): Attributes that, when combined, could potentially identify an individual (e.g., age, zip code, gender)
- Sensitive Attribute (SA): Data that should be protected (e.g., disease, salary)
- k-Anonymity: Each record is indistinguishable from at least k-1 other records
- l-Diversity: Each group of records with identical QIs has at least l different values for sensitive attributes
Usage
Basic Example
from pyikaild.ika import IKA
from pyikaild.ild import ILD
import pandas as pd
# Sample dataset
data = {
'Age': [45, 47, 52, 53, 64, 67, 62],
'Zipcode': [400052, 400058, 400032, 400045, 100032, 100053, 200045],
'Disease': ['Flu', 'Pneumonia', 'Flu', 'Stomach ulcers', 'Stomach infection', 'Hepatitis', 'Stomach cancer']
}
df = pd.DataFrame(data)
# Define quasi-identifiers and sensitive attribute
qi_attributes = ['Age', 'Zipcode']
sa_attribute = 'Disease'
numerical_qi = ['Age', 'Zipcode'] # Specify which QIs are numerical
# Apply k-anonymity (k=3)
ika = IKA(k=3,
qi_attributes=qi_attributes,
sa_attribute=sa_attribute,
numerical_qi=numerical_qi)
anonymized_df = ika.fit_transform(df)
# Calculate information loss
info_loss = ika.get_information_loss()
print(f"Information Loss: {info_loss:.4f}")
# Apply l-diversity (l=2) on the k-anonymized data
ild = ILD(l=2, qi_attributes=qi_attributes, sa_attribute=sa_attribute)
diverse_df = ild.transform(anonymized_df)
# Verify l-diversity
for name, group in diverse_df.groupby(qi_attributes):
print(f"Group {name}: SA count = {group[sa_attribute].nunique()}")
Adult Dataset Example
# Load adult dataset (available from UCI ML Repository)
adult_colnames = [
'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
'hours-per-week', 'native-country', 'income'
]
adult_df = pd.read_csv('adult.data', header=None, names=adult_colnames,
na_values=' ?', skipinitialspace=True)
# Define QIs and SA
qi_adult = ['age', 'workclass', 'education', 'race', 'sex']
sa_adult = 'occupation'
num_qi_adult = ['age']
cat_qi_adult = ['workclass', 'education', 'race', 'sex']
# Apply IKA (k=10)
ika_adult = IKA(k=10,
qi_attributes=qi_adult,
sa_attribute=sa_adult,
numerical_qi=num_qi_adult,
categorical_qi=cat_qi_adult,
max_split_level=8)
anonymized_adult_df = ika_adult.fit_transform(adult_df)
# Apply ILD (l=5)
ild_adult = ILD(l=5, qi_attributes=qi_adult, sa_attribute=sa_adult)
diverse_adult_df = ild_adult.transform(anonymized_adult_df)
API Reference
IKA (Improved k-Anonymization)
class IKA:
def __init__(self, k, qi_attributes, sa_attribute, numerical_qi=None,
categorical_qi=None, max_split_level=10):
"""
Parameters:
-----------
k : int
The minimum size of an equivalence class (>= 2)
qi_attributes : List[str]
List of column names to be treated as Quasi-Identifiers
sa_attribute : str
Column name of the Sensitive Attribute
numerical_qi : List[str], optional
List of QI attributes that are numerical
categorical_qi : List[str], optional
List of QI attributes that are categorical
max_split_level : int, default=10
Maximum recursion depth for splitting (controls granularity)
"""
def fit(self, df):
"""Fit the model to the DataFrame, partitioning it for k-anonymity"""
def transform(self, df):
"""Transform the DataFrame to achieve k-anonymity"""
def fit_transform(self, df):
"""Fit and transform in one step"""
def get_information_loss(self):
"""Calculate information loss due to anonymization"""
ILD (Improved l-Diversity)
class ILD:
def __init__(self, l, qi_attributes, sa_attribute):
"""
Parameters:
-----------
l : int
The minimum number of distinct sensitive values required per group (>= 2)
qi_attributes : List[str]
List of column names treated as Quasi-Identifiers (should match those used in IKA)
sa_attribute : str
Column name of the Sensitive Attribute
"""
def transform(self, df):
"""Apply l-diversity enforcement to a k-anonymized DataFrame"""
def fit_transform(self, df):
"""Transform the data to enforce l-diversity (fit is not needed)"""
Algorithm Details
IKA Algorithm
- Recursively partition the dataset based on QI attributes
- Ensure each partition has at least k records
- Generalize QI values within each partition
- Numerical: Represented as ranges [min-max]
- Categorical: Set to common value or '*' if values differ
ILD Algorithm
- Identify equivalence classes that violate l-diversity
- Borrow sensitive attribute values from other diverse classes
- Modify records in violating classes to ensure l-diversity
- Verify the result satisfies l-diversity constraints
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyikaild-0.0.4.tar.gz.
File metadata
- Download URL: pyikaild-0.0.4.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5a3c559c525c7e01a5457feadcc85100349dfeeea008996d1541a9fca00112c
|
|
| MD5 |
f7ac6ae2df757d1c42c8b19a654787a2
|
|
| BLAKE2b-256 |
800f884c9c901b75b1ba321b4af25d326f2419849a5b0f8636aaadb24b691f4e
|
File details
Details for the file pyikaild-0.0.4-py3-none-any.whl.
File metadata
- Download URL: pyikaild-0.0.4-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ee734563e11977c35021c35d8a4cbd080ca9be984383a7851a2b143735d1564
|
|
| MD5 |
2ae90c7afd804f45e298602ab9633898
|
|
| BLAKE2b-256 |
7fa2bcaa83f71271988a0c6f0c8799aa081ce44e8db00659807fbe59e0c6402e
|