Skip to main content

Improved K-Anonymity and l-Diversity library

Project description

pyikaild

A Python package for data privacy and anonymization implementing Improved k-Anonymity (IKA) and Improved l-Diversity (ILD) algorithms.

Overview

pyikaild provides implementations of two key privacy-preserving techniques for sensitive data:

  1. Improved k-Anonymity (IKA): Ensures each record is indistinguishable from at least k-1 other records based on quasi-identifier attributes through generalization techniques.

  2. Improved l-Diversity (ILD): Ensures each equivalence class (group of records with identical quasi-identifiers) contains at least l distinct values for sensitive attributes.

These techniques help protect privacy in datasets while maintaining data utility for analysis, in compliance with privacy regulations and best practices.

Installation

pip install pyikaild

Key Concepts

  • Quasi-Identifiers (QI): Attributes that, when combined, could potentially identify an individual (e.g., age, zip code, gender)
  • Sensitive Attribute (SA): Data that should be protected (e.g., disease, salary)
  • k-Anonymity: Each record is indistinguishable from at least k-1 other records
  • l-Diversity: Each group of records with identical QIs has at least l different values for sensitive attributes

Usage

Basic Example

from pyikaild.ika import IKA
from pyikaild.ild import ILD
import pandas as pd

# Sample dataset
data = {
    'Age': [45, 47, 52, 53, 64, 67, 62],
    'Zipcode': [400052, 400058, 400032, 400045, 100032, 100053, 200045],
    'Disease': ['Flu', 'Pneumonia', 'Flu', 'Stomach ulcers', 'Stomach infection', 'Hepatitis', 'Stomach cancer']
}
df = pd.DataFrame(data)

# Define quasi-identifiers and sensitive attribute
qi_attributes = ['Age', 'Zipcode']
sa_attribute = 'Disease'
numerical_qi = ['Age', 'Zipcode']  # Specify which QIs are numerical

# Apply k-anonymity (k=3)
ika = IKA(k=3, 
          qi_attributes=qi_attributes, 
          sa_attribute=sa_attribute, 
          numerical_qi=numerical_qi)
anonymized_df = ika.fit_transform(df)

# Calculate information loss
info_loss = ika.get_information_loss()
print(f"Information Loss: {info_loss:.4f}")

# Apply l-diversity (l=2) on the k-anonymized data
ild = ILD(l=2, qi_attributes=qi_attributes, sa_attribute=sa_attribute)
diverse_df = ild.transform(anonymized_df)

# Verify l-diversity
for name, group in diverse_df.groupby(qi_attributes):
    print(f"Group {name}: SA count = {group[sa_attribute].nunique()}")

Adult Dataset Example

# Load adult dataset (available from UCI ML Repository)
adult_colnames = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
    'hours-per-week', 'native-country', 'income'
]
adult_df = pd.read_csv('adult.data', header=None, names=adult_colnames, 
                      na_values=' ?', skipinitialspace=True)

# Define QIs and SA
qi_adult = ['age', 'workclass', 'education', 'race', 'sex']
sa_adult = 'occupation'
num_qi_adult = ['age']
cat_qi_adult = ['workclass', 'education', 'race', 'sex']

# Apply IKA (k=10)
ika_adult = IKA(k=10,
                qi_attributes=qi_adult,
                sa_attribute=sa_adult,
                numerical_qi=num_qi_adult,
                categorical_qi=cat_qi_adult,
                max_split_level=8)

anonymized_adult_df = ika_adult.fit_transform(adult_df)

# Apply ILD (l=5)
ild_adult = ILD(l=5, qi_attributes=qi_adult, sa_attribute=sa_adult)
diverse_adult_df = ild_adult.transform(anonymized_adult_df)

API Reference

IKA (Improved k-Anonymization)

class IKA:
    def __init__(self, k, qi_attributes, sa_attribute, numerical_qi=None, 
                 categorical_qi=None, max_split_level=10):
        """
        Parameters:
        -----------
        k : int
            The minimum size of an equivalence class (>= 2)
        qi_attributes : List[str]
            List of column names to be treated as Quasi-Identifiers
        sa_attribute : str
            Column name of the Sensitive Attribute
        numerical_qi : List[str], optional
            List of QI attributes that are numerical
        categorical_qi : List[str], optional
            List of QI attributes that are categorical
        max_split_level : int, default=10
            Maximum recursion depth for splitting (controls granularity)
        """
        
    def fit(self, df):
        """Fit the model to the DataFrame, partitioning it for k-anonymity"""
        
    def transform(self, df):
        """Transform the DataFrame to achieve k-anonymity"""
        
    def fit_transform(self, df):
        """Fit and transform in one step"""
        
    def get_information_loss(self):
        """Calculate information loss due to anonymization"""

ILD (Improved l-Diversity)

class ILD:
    def __init__(self, l, qi_attributes, sa_attribute):
        """
        Parameters:
        -----------
        l : int
            The minimum number of distinct sensitive values required per group (>= 2)
        qi_attributes : List[str]
            List of column names treated as Quasi-Identifiers (should match those used in IKA)
        sa_attribute : str
            Column name of the Sensitive Attribute
        """
        
    def transform(self, df):
        """Apply l-diversity enforcement to a k-anonymized DataFrame"""
        
    def fit_transform(self, df):
        """Transform the data to enforce l-diversity (fit is not needed)"""

Algorithm Details

IKA Algorithm

  1. Recursively partition the dataset based on QI attributes
  2. Ensure each partition has at least k records
  3. Generalize QI values within each partition
    • Numerical: Represented as ranges [min-max]
    • Categorical: Set to common value or '*' if values differ

ILD Algorithm

  1. Identify equivalence classes that violate l-diversity
  2. Borrow sensitive attribute values from other diverse classes
  3. Modify records in violating classes to ensure l-diversity
  4. Verify the result satisfies l-diversity constraints

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyikaild-0.0.6.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyikaild-0.0.6-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file pyikaild-0.0.6.tar.gz.

File metadata

  • Download URL: pyikaild-0.0.6.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pyikaild-0.0.6.tar.gz
Algorithm Hash digest
SHA256 6d4fc54f162baaa272fca3b9d07da446b5a13873c8296121dde27bbff9b98a8d
MD5 d509d2cf5f96472a72b310ae561f0e5d
BLAKE2b-256 f28ef18766597430bd1565bee1df40ce16f2eddca5e75afeb07a44cefc9ef8ff

See more details on using hashes here.

File details

Details for the file pyikaild-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: pyikaild-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pyikaild-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 fb3d5a596b9f9c573ff5a30fa09b80cb808418ca7c48fb4887ba531ea961c350
MD5 c92d0679a61dcb80d6991387c381b96a
BLAKE2b-256 6c3d87062122dcdd3cf97c9eb10b0074271990a2d8759a3ac9ffa71de04a041a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page