Skip to main content

Improved K-Anonymity and l-Diversity library

Project description

pyikaild

A Python package for data privacy and anonymization implementing Improved k-Anonymity (IKA) and Improved l-Diversity (ILD) algorithms.

Overview

pyikaild provides implementations of two key privacy-preserving techniques for sensitive data:

  1. Improved k-Anonymity (IKA): Ensures each record is indistinguishable from at least k-1 other records based on quasi-identifier attributes through generalization techniques.

  2. Improved l-Diversity (ILD): Ensures each equivalence class (group of records with identical quasi-identifiers) contains at least l distinct values for sensitive attributes.

These techniques help protect privacy in datasets while maintaining data utility for analysis, in compliance with privacy regulations and best practices.

Installation

pip install pyikaild

Key Concepts

  • Quasi-Identifiers (QI): Attributes that, when combined, could potentially identify an individual (e.g., age, zip code, gender)
  • Sensitive Attribute (SA): Data that should be protected (e.g., disease, salary)
  • k-Anonymity: Each record is indistinguishable from at least k-1 other records
  • l-Diversity: Each group of records with identical QIs has at least l different values for sensitive attributes

Usage

Basic Example

from pyikaild.ika import IKA
from pyikaild.ild import ILD
import pandas as pd

# Sample dataset
data = {
    'Age': [45, 47, 52, 53, 64, 67, 62],
    'Zipcode': [400052, 400058, 400032, 400045, 100032, 100053, 200045],
    'Disease': ['Flu', 'Pneumonia', 'Flu', 'Stomach ulcers', 'Stomach infection', 'Hepatitis', 'Stomach cancer']
}
df = pd.DataFrame(data)

# Define quasi-identifiers and sensitive attribute
qi_attributes = ['Age', 'Zipcode']
sa_attribute = 'Disease'
numerical_qi = ['Age', 'Zipcode']  # Specify which QIs are numerical

# Apply k-anonymity (k=3)
ika = IKA(k=3, 
          qi_attributes=qi_attributes, 
          sa_attribute=sa_attribute, 
          numerical_qi=numerical_qi)
anonymized_df = ika.fit_transform(df)

# Calculate information loss
info_loss = ika.get_information_loss()
print(f"Information Loss: {info_loss:.4f}")

# Apply l-diversity (l=2) on the k-anonymized data
ild = ILD(l=2, qi_attributes=qi_attributes, sa_attribute=sa_attribute)
diverse_df = ild.transform(anonymized_df)

# Verify l-diversity
for name, group in diverse_df.groupby(qi_attributes):
    print(f"Group {name}: SA count = {group[sa_attribute].nunique()}")

Adult Dataset Example

# Load adult dataset (available from UCI ML Repository)
adult_colnames = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
    'hours-per-week', 'native-country', 'income'
]
adult_df = pd.read_csv('adult.data', header=None, names=adult_colnames, 
                      na_values=' ?', skipinitialspace=True)

# Define QIs and SA
qi_adult = ['age', 'workclass', 'education', 'race', 'sex']
sa_adult = 'occupation'
num_qi_adult = ['age']
cat_qi_adult = ['workclass', 'education', 'race', 'sex']

# Apply IKA (k=10)
ika_adult = IKA(k=10,
                qi_attributes=qi_adult,
                sa_attribute=sa_adult,
                numerical_qi=num_qi_adult,
                categorical_qi=cat_qi_adult,
                max_split_level=8)

anonymized_adult_df = ika_adult.fit_transform(adult_df)

# Apply ILD (l=5)
ild_adult = ILD(l=5, qi_attributes=qi_adult, sa_attribute=sa_adult)
diverse_adult_df = ild_adult.transform(anonymized_adult_df)

API Reference

IKA (Improved k-Anonymization)

class IKA:
    def __init__(self, k, qi_attributes, sa_attribute, numerical_qi=None, 
                 categorical_qi=None, max_split_level=10):
        """
        Parameters:
        -----------
        k : int
            The minimum size of an equivalence class (>= 2)
        qi_attributes : List[str]
            List of column names to be treated as Quasi-Identifiers
        sa_attribute : str
            Column name of the Sensitive Attribute
        numerical_qi : List[str], optional
            List of QI attributes that are numerical
        categorical_qi : List[str], optional
            List of QI attributes that are categorical
        max_split_level : int, default=10
            Maximum recursion depth for splitting (controls granularity)
        """
        
    def fit(self, df):
        """Fit the model to the DataFrame, partitioning it for k-anonymity"""
        
    def transform(self, df):
        """Transform the DataFrame to achieve k-anonymity"""
        
    def fit_transform(self, df):
        """Fit and transform in one step"""
        
    def get_information_loss(self):
        """Calculate information loss due to anonymization"""

ILD (Improved l-Diversity)

class ILD:
    def __init__(self, l, qi_attributes, sa_attribute):
        """
        Parameters:
        -----------
        l : int
            The minimum number of distinct sensitive values required per group (>= 2)
        qi_attributes : List[str]
            List of column names treated as Quasi-Identifiers (should match those used in IKA)
        sa_attribute : str
            Column name of the Sensitive Attribute
        """
        
    def transform(self, df):
        """Apply l-diversity enforcement to a k-anonymized DataFrame"""
        
    def fit_transform(self, df):
        """Transform the data to enforce l-diversity (fit is not needed)"""

Algorithm Details

IKA Algorithm

  1. Recursively partition the dataset based on QI attributes
  2. Ensure each partition has at least k records
  3. Generalize QI values within each partition
    • Numerical: Represented as ranges [min-max]
    • Categorical: Set to common value or '*' if values differ

ILD Algorithm

  1. Identify equivalence classes that violate l-diversity
  2. Borrow sensitive attribute values from other diverse classes
  3. Modify records in violating classes to ensure l-diversity
  4. Verify the result satisfies l-diversity constraints

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyikaild-0.0.3.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyikaild-0.0.3-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file pyikaild-0.0.3.tar.gz.

File metadata

  • Download URL: pyikaild-0.0.3.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pyikaild-0.0.3.tar.gz
Algorithm Hash digest
SHA256 0cde19a0cbb82303cf438f9226fc25149fd4fb7abcdbf40603dcc76f25c4845d
MD5 dfad78543b98d92e1c25614f0c6757b0
BLAKE2b-256 327c7d3dbc0f296e204b20ebb7536f01941749de97c32a84cf3d98bd561b0b67

See more details on using hashes here.

File details

Details for the file pyikaild-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pyikaild-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pyikaild-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a0f8215cc47bd00ebad0e33bdb63ce3921ead063e17cc9401166b8d3da18a745
MD5 813077f8fa31384faf548cb99c14e7ac
BLAKE2b-256 0212b337008e2ef7b60f9c22057829b25ddab168663c0a1652fb461f51606a0d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page