A sample Python package

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

CTAB_XTRA_DP

A privacy-preserving synthetic tabular data generator based on GANs.

Installation

pip install --upgrade ctab-xtra-dp

Disclaimer

This library is under development at the moment andl looks like a mess. Some credits might me missing, but these will be added shortly.

Overview

CTAB_XTRA_DP is a generative model for creating high-quality synthetic tabular data with differential privacy guarantees. It extends the CTAB-GAN+ architecture to generate synthetic datasets that preserve the statistical properties of the original data while providing formal privacy protection.

Features

Generate synthetic tabular data with similar statistical properties to the original data
Automatic handling of various data types (categorical, numerical, mixed, log-transformed)
Handles Missing Not at Random (MNAR) null values
Built-in differential privacy to protect sensitive information
Conditional generation based on specific column values

Quick Start

A more detaile example is found in examples/car.ipynb

from ctab_xtra_dp import CTAB_XTRA_DP , load_demo
import pandas as pd

# Load your data
df = load_demo("car").drop(columns=['Year','Model'])

# Initialize the model
synthesizer = CTAB_XTRA_DP(
    df=df,
    categorical_columns=["Brand","Fuel_Type","Transmission"],
)

# Train the model
synthesizer.fit(epochs=50)

# Generate synthetic data
synthetic_data = synthesizer.generate_samples(n=df.shape[0])



# Generate samples with specific conditions
synthetic_data_electric = synthesizer.generate_samples(
    n=500, 
    conditioning_column='Fuel_Type', 
    conditioning_value='Electric'
)

API Reference

CTAB_XTRA_DP

CTAB_XTRA_DP(
    df,
    categorical_columns=[], 
    log_columns=[],
    mixed_columns={},
    gaussian_columns=[],
    integer_columns=[],
    problem_type=("Classification", 'target_column'),
    dp_constraints={
        "epsilon_budget": 10,
        "delta": None,
        "sigma": None,
        "clip_coeff": 1
    }
)

Parameters

df : pandas.DataFrame
- The input dataframe to train on
categorical_columns : list
- List of column names that should be treated as categorical
log_columns : list
- List of column names that should be log-transformed before modeling
mixed_columns : dict
- Dictionary mapping column names to their unique modal values
- Used for columns with mixed continuous-discrete distributions
- Example: {'capital-loss': [0]} indicates that 'capital-loss' has a special value at 0
- Specifying {'capital-loss': [np.null]} indicates to the model that we have a MNAR value
gaussian_columns : list
- List of column names that should be modeled with a Gaussian distribution
integer_columns : list
- List of column names that should be treated as integers
- This overwrites the original datatype purposed to the model
- If the column type is interger, this overwrite is not neccesary
problem_type : tuple
- Set the target column used for the auxiliary classifier during training
- A tuple of (problem_type, target_column)
- problem_type can be "Classification" or "Regression"
- If sett to None, no auxiliary classifier is used. (The generation works more then fine without it)
dp_constraints : dict
- Differential privacy parameters:
  - epsilon_budget: Privacy budget for the entire training process. Computes the sigma noise for the given epsilon to ensure privacy guarentees.
  - delta: Probabilistic relaxation parameter, should be set to a number much less than 1/n (default: 0.1/n)
  - sigma: Gaussian noise to be added for each itteration. If this is set, it overrides any epsilon value.
  - clip_coeff: Coefficient for gradient clipping. A common practice is to leave this at 1. (default: 1)

Methods

fit

fit(epochs=100)

Train the model on the input dataframe provided in the constructor.

Parameters:

epochs : int
- Number of training epochs (default: 100)

Returns:

None

generate_samples

generate_samples(n=100, conditioning_column=None, conditioning_value=None)

Generate synthetic samples with option to conditional generation.

Parameters:

n : int
- Number of synthetic samples to generate (default: 100)
conditioning_column : str, optional
- Column name to condition on
conditioning_value : any, optional
- Value of the conditioning column

Returns:

pandas.DataFrame
- Generated synthetic data

Data Type Handling

CTAB_XTRA_DP automatically processes different data types: This is not yet implemented

Categorical data: One-hot encoded
Mixed data: Modeled using a mixture of discrete modes and continuous distributions
Log-transformed data: Log-transformed before modeling, exponentiated during generation
Integer data: Values are rounded to integers during generation
Gaussian data: Modeled directly with a Gaussian distribution

Differential Privacy

The model implements differential privacy using:

Gradient clipping to bound the sensitivity of the training process
Gaussian noise addition to gradients according to the specified privacy parameters
Privacy accounting to track epsilon expenditure

Examples

Basic Usage

import pandas as pd
from ctab_xtra_dp import CTAB_XTRA_DP, load_demo

# Load data
df = load_demo()

# Initialize model
synthesizer = CTAB_XTRA_DP(
    df=df,
    categorical_columns=['workclass', 'education', 'marital-status', 'occupation', 
                         'relationship', 'race', 'gender', 'native-country'],
    mixed_columns={'capital-loss': [0], 'capital-gain': [0]},
    integer_columns=['age', 'fnlwgt', 'hours-per-week']
)

# Train model
synthesizer.fit(epochs=150)

# Generate synthetic data
synthetic_data = synthesizer.generate_samples(n=1000)

# Save synthetic data
synthetic_data.to_csv("synthetic_adult.csv", index=False)

Generating Conditioned Samples

# Generate samples with specific education level
bachelors_samples = synthesizer.generate_samples(
    n=500,
    conditioning_column='education',
    conditioning_value='Bachelors'
)

# Generate samples with specific occupation
managers_samples = synthesizer.generate_samples(
    n=500,
    conditioning_column='occupation',
    conditioning_value='Exec-managerial'
)

With auxiliary classifier

import pandas as pd
from ctab_xtra_dp import CTAB_XTRA_DP, load_demo

# Load data
df = load_demo()

# Initialize model
synthesizer = CTAB_XTRA_DP(
    df=df,
    categorical_columns=['workclass', 'education', 'marital-status', 'occupation', 
                         'relationship', 'race', 'gender', 'native-country',
    mixed_columns={'capital-loss': [0], 'capital-gain': [0]},
    integer_columns=['age', 'fnlwgt', 'hours-per-week'],
    problem_type=("Classification", 'income')
)

# Train model
synthesizer.fit(epochs=150)

# Generate synthetic data
synthetic_data = synthesizer.generate_samples(n=1000)

# Save synthetic data
synthetic_data.to_csv("synthetic_adult.csv", index=False)

With Differential Privacy

# Initialize with stronger privacy guarantees
private_synthesizer = CTAB_XTRA_DP(
    df=df,
    categorical_columns=['workclass', 'education', 'marital-status', 'occupation', 
                         'relationship', 'race', 'gender', 'native-country'],
    integer_columns=['age', 'hours-per-week'],
    dp_constraints={
        "epsilon_budget": 1.0,  # Stricter privacy budget
        "clip_coeff": 1.0
    }
)

private_synthesizer.fit(epochs=100)
private_samples = private_synthesizer.generate_samples(n=1000)

Not specifying delta allows the model to compute a resonable delta value.

With MNAR value

A handfull of the 'capital-loss' and 'capital-gain' has missing financial data. In this case removing null values would loose valueable information. At the same time, interperating it as 0 will not distinguios does who do not have much financial activity from people who have null in the system some access reason.

# Initialize with stronger privacy guarantees
private_synthesizer = CTAB_XTRA_DP(
    df=df,
    categorical_columns=['workclass', 'education', 'marital-status', 'occupation', 
                         'relationship', 'race', 'gender', 'native-country'],
    integer_columns=['age', 'hours-per-week'],
    mixed_columns={'capital-loss': [0,np.nan], 'capital-gain': [0,np.nan]},
    dp_constraints={
        "epsilon_budget": 1.0,  # Stricter privacy budget
        "clip_coeff": 1.0
    }
)

private_synthesizer.fit(epochs=100)
private_samples = private_synthesizer.generate_samples(n=1000)

Not specifying delta allows the model to compute a resonable delta value.

Evaluation Framework

CTAB_XTRA_DP includes a comprehensive evaluation framework to assess both the utility and privacy of the generated synthetic data.

Utility Evaluation

from ctab_xtra_dp.evaluation import get_utility_metrics, stat_sim

# Load synthetic data

synthetic_data = synthesizer.generate_samples(n=1000)

# Evaluate supervised learning performance difference
# Lower difference values indicate better utility preservation
utility_diff = get_utility_metrics(
    data_real=df,
    data_synthetic=synthetic_data,
    scaler="MinMax",  # or "Standard"
    type={"Classification": ["lr", "dt", "rf", "mlp"]},  # for classification tasks
    # type={"Regression": ["l_reg", "ridge", "lasso", "B_ridge"]},  # for regression tasks
    test_ratio=0.2
)

# Evaluate statistical similarity
# Lower values indicate better statistical preservation
cat_columns = ['workclass', 'education', 'marital-status', 'occupation']
stat_metrics = stat_sim(real_data, synthetic_data, cat_cols=cat_columns)

# stat_metrics[0]: Average Wasserstein distance for numerical columns
# stat_metrics[1]: Average Jensen-Shannon divergence for categorical columns
# stat_metrics[2]: Correlation matrix distance

Privacy Evaluation

from ctab_xtra_dp.evaluation import privacy_metrics

# Assess privacy protection
privacy_results = privacy_metrics(
    real=df, 
    fake=synthetic_data, 
    data_percent=15  # Percentage of data to sample for efficiency
)

# Key metrics in privacy_results:
# - min_dist_rf_5th: Minimum distance from real to fake records (5th percentile)
# - min_dist_rr_5th: Minimum distance within real records (5th percentile)
# - min_dist_ff_5th: Minimum distance within fake records (5th percentile)
# - privacy_risk_score: Overall privacy protection score (higher is better)

Interpreting Evaluation Results

Utility Metrics

Machine Learning Performance: Lower difference values indicate synthetic data that better preserves predictive relationships
Statistical Similarity: Lower values indicate better preservation of distributions and correlations

Privacy Metrics

Minimum Distances: Higher real-to-fake distances relative to within-dataset distances suggest better privacy protection
Privacy Risk Score: Values greater than 1.0 indicate good privacy protection, with higher values being better

Complete Evaluation Example

import pandas as pd
from ctab_xtra_dp import CTAB_XTRA_DP, load_demo
from ctab_xtra_dp.evaluation import get_utility_metrics, stat_sim, privacy_metrics

# Load data
real_data = load_demo()

# Initialize and train model
synthesizer = CTAB_XTRA_DP(
    df=real_data,
    categorical_columns=['workclass', 'education', 'marital-status', 'occupation',
                        'relationship', 'race', 'gender', 'native-country'],
    mixed_columns={'capital-loss': [0], 'capital-gain': [0]},
    integer_columns=['age', 'hours-per-week'],
    dp_constraints={"epsilon_budget": 5.0}
)

synthesizer.fit(epochs=100)
synthetic_data = synthesizer.generate_samples(n=len(real_data))

# Comprehensive evaluation
# 1. Utility evaluation
cat_cols = ['workclass', 'education', 'marital-status', 'occupation',
            'relationship', 'race', 'gender', 'native-country', 'income']

ml_diff = get_utility_metrics(
    data_real=real_data,
    data_synthetic=synthetic_data,
    type={"Classification": ["lr", "dt", "rf"]},
    test_ratio=0.2
)

print("Machine Learning Utility Difference:")
print(f"Accuracy diff: {ml_diff[0][0]:.4f}")
print(f"AUC diff: {ml_diff[0][1]:.4f}")
print(f"F1-score diff: {ml_diff[0][2]:.4f}")

# 2. Statistical similarity
stat_results = stat_sim(real_data, synthetic_data, cat_cols=cat_cols)
print("\nStatistical Similarity:")
print(f"Numerical columns (Wasserstein): {stat_results[0]:.4f}")
print(f"Categorical columns (JSD): {stat_results[1]:.4f}")
print(f"Correlation distance: {stat_results[2]:.4f}")

# 3. Privacy evaluation
priv_results = privacy_metrics(real_data, synthetic_data)
print("\nPrivacy Evaluation:")
print(f"Privacy Risk Score: {priv_results['privacy_risk_score']:.4f}")
print(f"Real-to-Fake Min Distance (5th): {priv_results['min_dist_rf_5th']:.4f}")
print(f"Real-to-Real Min Distance (5th): {priv_results['min_dist_rr_5th']:.4f}")

Citation

The citation is not yet avaliable If you use this package in your research, please cite:

@article{ctab-xtra-dp,
  title={CTAB-XTRA-DP: Improved Tabular Data Synthesis with Differential Privacy},
  author={Your Name},
  journal={arXiv preprint},
  year={2025}
}

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

3.7.0

May 14, 2025

3.6.3

May 12, 2025

3.6.2

May 12, 2025

3.6.1

May 12, 2025

3.6.0

May 11, 2025

3.5.0

May 7, 2025

3.4.0

May 7, 2025

3.2.0

May 6, 2025

3.1.2

May 6, 2025

3.1.1

May 6, 2025

3.1.0

May 6, 2025

3.0.8

May 6, 2025

3.0.7

May 6, 2025

3.0.3

May 6, 2025

3.0.2

May 6, 2025

3.0.0

May 6, 2025

This version

2.0.0

Apr 28, 2025

1.0.0

Mar 13, 2025

0.6.1

Mar 12, 2025

0.6.0

Mar 12, 2025

0.5.0

Mar 12, 2025

0.4.0

Mar 12, 2025

0.3.2

Mar 12, 2025

0.3.1

Mar 12, 2025

0.3.0

Mar 12, 2025

0.2.2

Mar 12, 2025

0.2.1

Mar 11, 2025

0.2.0

Mar 11, 2025

0.1.0

Mar 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ctab_xtra_dp-2.0.0.tar.gz (1.2 MB view details)

Uploaded Apr 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ctab_xtra_dp-2.0.0-py3-none-any.whl (1.3 MB view details)

Uploaded Apr 28, 2025 Python 3

File details

Details for the file ctab_xtra_dp-2.0.0.tar.gz.

File metadata

Download URL: ctab_xtra_dp-2.0.0.tar.gz
Upload date: Apr 28, 2025
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ctab_xtra_dp-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`6954db3cb767be3004e4765d315380258e92dcf859a9f497659472a7238246e1`
MD5	`5cd3596559f5252fa141d54807140a5c`
BLAKE2b-256	`fa295a3a97b32920e4fba9d6816410c5e351627cf5d6cb48f4024c37ea769162`

See more details on using hashes here.

File details

Details for the file ctab_xtra_dp-2.0.0-py3-none-any.whl.

File metadata

Download URL: ctab_xtra_dp-2.0.0-py3-none-any.whl
Upload date: Apr 28, 2025
Size: 1.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ctab_xtra_dp-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8539d9d5b049c88a7c6414bdc7438165b62914dad8c0b3ca7bd35eb51abc757e`
MD5	`67fa51c6419f4f529daab50f0c8ccdef`
BLAKE2b-256	`ad68bd0263a2a71a3f031903c843689ba9c4319bcb137994a7ce036bad48214d`

See more details on using hashes here.

ctab-xtra-dp 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CTAB_XTRA_DP

Installation

Disclaimer

Overview

Features

Quick Start

API Reference

CTAB_XTRA_DP

Parameters

Methods

fit

generate_samples

Data Type Handling

Differential Privacy

Examples

Basic Usage

Generating Conditioned Samples

With auxiliary classifier

With Differential Privacy

With MNAR value

Evaluation Framework

Utility Evaluation

Privacy Evaluation

Interpreting Evaluation Results

Utility Metrics

Privacy Metrics

Complete Evaluation Example

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes