Improved K-Anonymity and l-Diversity library

These details have not been verified by PyPI

Project description

pyikaild

A Python package for data privacy and anonymization implementing Improved k-Anonymity (IKA) and Improved l-Diversity (ILD) algorithms.

Overview

pyikaild provides implementations of two key privacy-preserving techniques for sensitive data:

Improved k-Anonymity (IKA): Ensures each record is indistinguishable from at least k-1 other records based on quasi-identifier attributes through generalization techniques.
Improved l-Diversity (ILD): Ensures each equivalence class (group of records with identical quasi-identifiers) contains at least l distinct values for sensitive attributes.

These techniques help protect privacy in datasets while maintaining data utility for analysis, in compliance with privacy regulations and best practices.

Installation

pip install pyikaild

Key Concepts

Quasi-Identifiers (QI): Attributes that, when combined, could potentially identify an individual (e.g., age, zip code, gender)
Sensitive Attribute (SA): Data that should be protected (e.g., disease, salary)
k-Anonymity: Each record is indistinguishable from at least k-1 other records
l-Diversity: Each group of records with identical QIs has at least l different values for sensitive attributes

Usage

Basic Example

from pyikaild.ika import IKA
from pyikaild.ild import ILD
import pandas as pd

# Sample dataset
data = {
    'Age': [45, 47, 52, 53, 64, 67, 62],
    'Zipcode': [400052, 400058, 400032, 400045, 100032, 100053, 200045],
    'Disease': ['Flu', 'Pneumonia', 'Flu', 'Stomach ulcers', 'Stomach infection', 'Hepatitis', 'Stomach cancer']
}
df = pd.DataFrame(data)

# Define quasi-identifiers and sensitive attribute
qi_attributes = ['Age', 'Zipcode']
sa_attribute = 'Disease'
numerical_qi = ['Age', 'Zipcode']  # Specify which QIs are numerical

# Apply k-anonymity (k=3)
ika = IKA(k=3, 
          qi_attributes=qi_attributes, 
          sa_attribute=sa_attribute, 
          numerical_qi=numerical_qi)
anonymized_df = ika.fit_transform(df)

# Calculate information loss
info_loss = ika.get_information_loss()
print(f"Information Loss: {info_loss:.4f}")

# Apply l-diversity (l=2) on the k-anonymized data
ild = ILD(l=2, qi_attributes=qi_attributes, sa_attribute=sa_attribute)
diverse_df = ild.transform(anonymized_df)

# Verify l-diversity
for name, group in diverse_df.groupby(qi_attributes):
    print(f"Group {name}: SA count = {group[sa_attribute].nunique()}")

Adult Dataset Example

# Load adult dataset (available from UCI ML Repository)
adult_colnames = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
    'hours-per-week', 'native-country', 'income'
]
adult_df = pd.read_csv('adult.data', header=None, names=adult_colnames, 
                      na_values=' ?', skipinitialspace=True)

# Define QIs and SA
qi_adult = ['age', 'workclass', 'education', 'race', 'sex']
sa_adult = 'occupation'
num_qi_adult = ['age']
cat_qi_adult = ['workclass', 'education', 'race', 'sex']

# Apply IKA (k=10)
ika_adult = IKA(k=10,
                qi_attributes=qi_adult,
                sa_attribute=sa_adult,
                numerical_qi=num_qi_adult,
                categorical_qi=cat_qi_adult,
                max_split_level=8)

anonymized_adult_df = ika_adult.fit_transform(adult_df)

# Apply ILD (l=5)
ild_adult = ILD(l=5, qi_attributes=qi_adult, sa_attribute=sa_adult)
diverse_adult_df = ild_adult.transform(anonymized_adult_df)

API Reference

IKA (Improved k-Anonymization)

class IKA:
    def __init__(self, k, qi_attributes, sa_attribute, numerical_qi=None, 
                 categorical_qi=None, max_split_level=10):
        """
        Parameters:
        -----------
        k : int
            The minimum size of an equivalence class (>= 2)
        qi_attributes : List[str]
            List of column names to be treated as Quasi-Identifiers
        sa_attribute : str
            Column name of the Sensitive Attribute
        numerical_qi : List[str], optional
            List of QI attributes that are numerical
        categorical_qi : List[str], optional
            List of QI attributes that are categorical
        max_split_level : int, default=10
            Maximum recursion depth for splitting (controls granularity)
        """
        
    def fit(self, df):
        """Fit the model to the DataFrame, partitioning it for k-anonymity"""
        
    def transform(self, df):
        """Transform the DataFrame to achieve k-anonymity"""
        
    def fit_transform(self, df):
        """Fit and transform in one step"""
        
    def get_information_loss(self):
        """Calculate information loss due to anonymization"""

ILD (Improved l-Diversity)

class ILD:
    def __init__(self, l, qi_attributes, sa_attribute):
        """
        Parameters:
        -----------
        l : int
            The minimum number of distinct sensitive values required per group (>= 2)
        qi_attributes : List[str]
            List of column names treated as Quasi-Identifiers (should match those used in IKA)
        sa_attribute : str
            Column name of the Sensitive Attribute
        """
        
    def transform(self, df):
        """Apply l-diversity enforcement to a k-anonymized DataFrame"""
        
    def fit_transform(self, df):
        """Transform the data to enforce l-diversity (fit is not needed)"""

Algorithm Details

IKA Algorithm

Recursively partition the dataset based on QI attributes
Ensure each partition has at least k records
Generalize QI values within each partition
- Numerical: Represented as ranges [min-max]
- Categorical: Set to common value or '*' if values differ

ILD Algorithm

Identify equivalence classes that violate l-diversity
Borrow sensitive attribute values from other diverse classes
Modify records in violating classes to ensure l-diversity
Verify the result satisfies l-diversity constraints

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.6

Apr 5, 2025

0.0.5

Apr 1, 2025

0.0.4

Apr 1, 2025

This version

0.0.3

Apr 1, 2025

0.0.2

Apr 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyikaild-0.0.3.tar.gz (12.6 kB view details)

Uploaded Apr 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyikaild-0.0.3-py3-none-any.whl (12.8 kB view details)

Uploaded Apr 1, 2025 Python 3

File details

Details for the file pyikaild-0.0.3.tar.gz.

File metadata

Download URL: pyikaild-0.0.3.tar.gz
Upload date: Apr 1, 2025
Size: 12.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pyikaild-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`0cde19a0cbb82303cf438f9226fc25149fd4fb7abcdbf40603dcc76f25c4845d`
MD5	`dfad78543b98d92e1c25614f0c6757b0`
BLAKE2b-256	`327c7d3dbc0f296e204b20ebb7536f01941749de97c32a84cf3d98bd561b0b67`

See more details on using hashes here.

File details

Details for the file pyikaild-0.0.3-py3-none-any.whl.

File metadata

Download URL: pyikaild-0.0.3-py3-none-any.whl
Upload date: Apr 1, 2025
Size: 12.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pyikaild-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0f8215cc47bd00ebad0e33bdb63ce3921ead063e17cc9401166b8d3da18a745`
MD5	`813077f8fa31384faf548cb99c14e7ac`
BLAKE2b-256	`0212b337008e2ef7b60f9c22057829b25ddab168663c0a1652fb461f51606a0d`

See more details on using hashes here.

pyikaild 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pyikaild

Overview

Installation

Key Concepts

Usage

Basic Example

Adult Dataset Example

API Reference

IKA (Improved k-Anonymization)

ILD (Improved l-Diversity)

Algorithm Details

IKA Algorithm

ILD Algorithm

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes