Skip to main content

A sample Python package

Project description

CTAB_XTRA_DP

GitHub PyPI version License: MIT

A privacy-preserving synthetic tabular data generator based on GANs.

Installation

pip install --upgrade ctab-xtra-dp

Disclaimer

This library is under development at the moment andl looks like a mess. Some credits might me missing, but these will be added shortly.

Overview

CTAB_XTRA_DP is a generative model for creating high-quality synthetic tabular data with differential privacy guarantees. It extends the CTAB-GAN+ architecture to generate synthetic datasets that preserve the statistical properties of the original data while providing formal privacy protection.

Features

  • Generate synthetic tabular data with similar statistical properties to the original data
  • Automatic handling of various data types (categorical, numerical, mixed, log-transformed)
  • Handles Missing Not at Random (MNAR) null values
  • Built-in differential privacy to protect sensitive information
  • Conditional generation based on specific column values

Quick Start

A more detaile example is found in examples/car.ipynb

from ctab_xtra_dp import CTAB_XTRA_DP , load_demo
import pandas as pd

# Load your data
df = load_demo("car").drop(columns=['Year','Model'])

# Initialize the model
synthesizer = CTAB_XTRA_DP(
    df=df,
    categorical_columns=["Brand","Fuel_Type","Transmission"],
)

# Train the model
synthesizer.fit(epochs=50)

# Generate synthetic data
synthetic_data = synthesizer.generate_samples(n=df.shape[0])



# Generate samples with specific conditions
synthetic_data_electric = synthesizer.generate_samples(
    n=500, 
    conditioning_column='Fuel_Type', 
    conditioning_value='Electric'
)

API Reference

CTAB_XTRA_DP

CTAB_XTRA_DP(
    df,
    categorical_columns=[], 
    log_columns=[],
    mixed_columns={},
    gaussian_columns=[],
    integer_columns=[],
    problem_type=("Classification", 'target_column'),
    dp_constraints={
        "epsilon_budget": 10,
        "delta": None,
        "sigma": None,
        "clip_coeff": 1
    }
)

Parameters

  • df : pandas.DataFrame

    • The input dataframe to train on
  • categorical_columns : list

    • List of column names that should be treated as categorical
  • log_columns : list

    • List of column names that should be log-transformed before modeling
  • mixed_columns : dict

    • Dictionary mapping column names to their unique modal values
    • Used for columns with mixed continuous-discrete distributions
    • Example: {'capital-loss': [0]} indicates that 'capital-loss' has a special value at 0
    • Specifying {'capital-loss': [np.null]} indicates to the model that we have a MNAR value
  • gaussian_columns : list

    • List of column names that should be modeled with a Gaussian distribution
  • integer_columns : list

    • List of column names that should be treated as integers
    • This overwrites the original datatype purposed to the model
    • If the column type is interger, this overwrite is not neccesary
  • problem_type : tuple

    • Set the target column used for the auxiliary classifier during training
    • A tuple of (problem_type, target_column)
    • problem_type can be "Classification" or "Regression"
    • If sett to None, no auxiliary classifier is used. (The generation works more then fine without it)
  • dp_constraints : dict

    • Differential privacy parameters:
      • epsilon_budget: Privacy budget for the entire training process. Computes the sigma noise for the given epsilon to ensure privacy guarentees.
      • delta: Probabilistic relaxation parameter, should be set to a number much less than 1/n (default: 0.1/n)
      • sigma: Gaussian noise to be added for each itteration. If this is set, it overrides any epsilon value.
      • clip_coeff: Coefficient for gradient clipping. A common practice is to leave this at 1. (default: 1)

Methods

fit

fit(epochs=100)

Train the model on the input dataframe provided in the constructor.

Parameters:

  • epochs : int
    • Number of training epochs (default: 100)

Returns:

  • None

generate_samples

generate_samples(n=100, conditioning_column=None, conditioning_value=None)

Generate synthetic samples with option to conditional generation.

Parameters:

  • n : int
    • Number of synthetic samples to generate (default: 100)
  • conditioning_column : str, optional
    • Column name to condition on
  • conditioning_value : any, optional
    • Value of the conditioning column

Returns:

  • pandas.DataFrame
    • Generated synthetic data

Data Type Handling

CTAB_XTRA_DP automatically processes different data types: This is not yet implemented

  • Categorical data: One-hot encoded
  • Mixed data: Modeled using a mixture of discrete modes and continuous distributions
  • Log-transformed data: Log-transformed before modeling, exponentiated during generation
  • Integer data: Values are rounded to integers during generation
  • Gaussian data: Modeled directly with a Gaussian distribution

Differential Privacy

The model implements differential privacy using:

  • Gradient clipping to bound the sensitivity of the training process
  • Gaussian noise addition to gradients according to the specified privacy parameters
  • Privacy accounting to track epsilon expenditure

Examples

Basic Usage

import pandas as pd
from ctab_xtra_dp import CTAB_XTRA_DP, load_demo

# Load data
df = load_demo()

# Initialize model
synthesizer = CTAB_XTRA_DP(
    df=df,
    categorical_columns=['workclass', 'education', 'marital-status', 'occupation', 
                         'relationship', 'race', 'gender', 'native-country'],
    mixed_columns={'capital-loss': [0], 'capital-gain': [0]},
    integer_columns=['age', 'fnlwgt', 'hours-per-week']
)

# Train model
synthesizer.fit(epochs=150)

# Generate synthetic data
synthetic_data = synthesizer.generate_samples(n=1000)

# Save synthetic data
synthetic_data.to_csv("synthetic_adult.csv", index=False)

Generating Conditioned Samples

# Generate samples with specific education level
bachelors_samples = synthesizer.generate_samples(
    n=500,
    conditioning_column='education',
    conditioning_value='Bachelors'
)

# Generate samples with specific occupation
managers_samples = synthesizer.generate_samples(
    n=500,
    conditioning_column='occupation',
    conditioning_value='Exec-managerial'
)

With auxiliary classifier

import pandas as pd
from ctab_xtra_dp import CTAB_XTRA_DP, load_demo

# Load data
df = load_demo()

# Initialize model
synthesizer = CTAB_XTRA_DP(
    df=df,
    categorical_columns=['workclass', 'education', 'marital-status', 'occupation', 
                         'relationship', 'race', 'gender', 'native-country',
    mixed_columns={'capital-loss': [0], 'capital-gain': [0]},
    integer_columns=['age', 'fnlwgt', 'hours-per-week'],
    problem_type=("Classification", 'income')
)

# Train model
synthesizer.fit(epochs=150)

# Generate synthetic data
synthetic_data = synthesizer.generate_samples(n=1000)

# Save synthetic data
synthetic_data.to_csv("synthetic_adult.csv", index=False)

With Differential Privacy

# Initialize with stronger privacy guarantees
private_synthesizer = CTAB_XTRA_DP(
    df=df,
    categorical_columns=['workclass', 'education', 'marital-status', 'occupation', 
                         'relationship', 'race', 'gender', 'native-country'],
    integer_columns=['age', 'hours-per-week'],
    dp_constraints={
        "epsilon_budget": 1.0,  # Stricter privacy budget
        "clip_coeff": 1.0
    }
)

private_synthesizer.fit(epochs=100)
private_samples = private_synthesizer.generate_samples(n=1000)

Not specifying delta allows the model to compute a resonable delta value.

With MNAR value

A handfull of the 'capital-loss' and 'capital-gain' has missing financial data. In this case removing null values would loose valueable information. At the same time, interperating it as 0 will not distinguios does who do not have much financial activity from people who have null in the system some access reason.

# Initialize with stronger privacy guarantees
private_synthesizer = CTAB_XTRA_DP(
    df=df,
    categorical_columns=['workclass', 'education', 'marital-status', 'occupation', 
                         'relationship', 'race', 'gender', 'native-country'],
    integer_columns=['age', 'hours-per-week'],
    mixed_columns={'capital-loss': [0,np.nan], 'capital-gain': [0,np.nan]},
    dp_constraints={
        "epsilon_budget": 1.0,  # Stricter privacy budget
        "clip_coeff": 1.0
    }
)

private_synthesizer.fit(epochs=100)
private_samples = private_synthesizer.generate_samples(n=1000)

Not specifying delta allows the model to compute a resonable delta value.

Evaluation Framework

CTAB_XTRA_DP includes a comprehensive evaluation framework to assess both the utility and privacy of the generated synthetic data.

Utility Evaluation

from ctab_xtra_dp.evaluation import get_utility_metrics, stat_sim

# Load synthetic data

synthetic_data = synthesizer.generate_samples(n=1000)

# Evaluate supervised learning performance difference
# Lower difference values indicate better utility preservation
utility_diff = get_utility_metrics(
    data_real=df,
    data_synthetic=synthetic_data,
    scaler="MinMax",  # or "Standard"
    type={"Classification": ["lr", "dt", "rf", "mlp"]},  # for classification tasks
    # type={"Regression": ["l_reg", "ridge", "lasso", "B_ridge"]},  # for regression tasks
    test_ratio=0.2
)

# Evaluate statistical similarity
# Lower values indicate better statistical preservation
cat_columns = ['workclass', 'education', 'marital-status', 'occupation']
stat_metrics = stat_sim(real_data, synthetic_data, cat_cols=cat_columns)

# stat_metrics[0]: Average Wasserstein distance for numerical columns
# stat_metrics[1]: Average Jensen-Shannon divergence for categorical columns
# stat_metrics[2]: Correlation matrix distance

Privacy Evaluation

from ctab_xtra_dp.evaluation import privacy_metrics

# Assess privacy protection
privacy_results = privacy_metrics(
    real=df, 
    fake=synthetic_data, 
    data_percent=15  # Percentage of data to sample for efficiency
)

# Key metrics in privacy_results:
# - min_dist_rf_5th: Minimum distance from real to fake records (5th percentile)
# - min_dist_rr_5th: Minimum distance within real records (5th percentile)
# - min_dist_ff_5th: Minimum distance within fake records (5th percentile)
# - privacy_risk_score: Overall privacy protection score (higher is better)

Interpreting Evaluation Results

Utility Metrics

  • Machine Learning Performance: Lower difference values indicate synthetic data that better preserves predictive relationships
  • Statistical Similarity: Lower values indicate better preservation of distributions and correlations

Privacy Metrics

  • Minimum Distances: Higher real-to-fake distances relative to within-dataset distances suggest better privacy protection
  • Privacy Risk Score: Values greater than 1.0 indicate good privacy protection, with higher values being better

Complete Evaluation Example

import pandas as pd
from ctab_xtra_dp import CTAB_XTRA_DP, load_demo
from ctab_xtra_dp.evaluation import get_utility_metrics, stat_sim, privacy_metrics

# Load data
real_data = load_demo()

# Initialize and train model
synthesizer = CTAB_XTRA_DP(
    df=real_data,
    categorical_columns=['workclass', 'education', 'marital-status', 'occupation',
                        'relationship', 'race', 'gender', 'native-country'],
    mixed_columns={'capital-loss': [0], 'capital-gain': [0]},
    integer_columns=['age', 'hours-per-week'],
    dp_constraints={"epsilon_budget": 5.0}
)

synthesizer.fit(epochs=100)
synthetic_data = synthesizer.generate_samples(n=len(real_data))

# Comprehensive evaluation
# 1. Utility evaluation
cat_cols = ['workclass', 'education', 'marital-status', 'occupation',
            'relationship', 'race', 'gender', 'native-country', 'income']

ml_diff = get_utility_metrics(
    data_real=real_data,
    data_synthetic=synthetic_data,
    type={"Classification": ["lr", "dt", "rf"]},
    test_ratio=0.2
)

print("Machine Learning Utility Difference:")
print(f"Accuracy diff: {ml_diff[0][0]:.4f}")
print(f"AUC diff: {ml_diff[0][1]:.4f}")
print(f"F1-score diff: {ml_diff[0][2]:.4f}")

# 2. Statistical similarity
stat_results = stat_sim(real_data, synthetic_data, cat_cols=cat_cols)
print("\nStatistical Similarity:")
print(f"Numerical columns (Wasserstein): {stat_results[0]:.4f}")
print(f"Categorical columns (JSD): {stat_results[1]:.4f}")
print(f"Correlation distance: {stat_results[2]:.4f}")

# 3. Privacy evaluation
priv_results = privacy_metrics(real_data, synthetic_data)
print("\nPrivacy Evaluation:")
print(f"Privacy Risk Score: {priv_results['privacy_risk_score']:.4f}")
print(f"Real-to-Fake Min Distance (5th): {priv_results['min_dist_rf_5th']:.4f}")
print(f"Real-to-Real Min Distance (5th): {priv_results['min_dist_rr_5th']:.4f}")

Citation

The citation is not yet avaliable If you use this package in your research, please cite:

@article{ctab-xtra-dp,
  title={CTAB-XTRA-DP: Improved Tabular Data Synthesis with Differential Privacy},
  author={Your Name},
  journal={arXiv preprint},
  year={2025}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ctab_xtra_dp-2.0.0.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ctab_xtra_dp-2.0.0-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file ctab_xtra_dp-2.0.0.tar.gz.

File metadata

  • Download URL: ctab_xtra_dp-2.0.0.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ctab_xtra_dp-2.0.0.tar.gz
Algorithm Hash digest
SHA256 6954db3cb767be3004e4765d315380258e92dcf859a9f497659472a7238246e1
MD5 5cd3596559f5252fa141d54807140a5c
BLAKE2b-256 fa295a3a97b32920e4fba9d6816410c5e351627cf5d6cb48f4024c37ea769162

See more details on using hashes here.

File details

Details for the file ctab_xtra_dp-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: ctab_xtra_dp-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ctab_xtra_dp-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8539d9d5b049c88a7c6414bdc7438165b62914dad8c0b3ca7bd35eb51abc757e
MD5 67fa51c6419f4f529daab50f0c8ccdef
BLAKE2b-256 ad68bd0263a2a71a3f031903c843689ba9c4319bcb137994a7ce036bad48214d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page