Skip to main content

A library to generate synthetic tabular or RDF data using Conditional Generative Adversary Networks (GANs) combined with Differential Privacy techniques.

Project description

👯 DP-CGANS (Differentially Private - Conditional Generative Adversarial NetworkS)

PyPi Shield Py versions Test package Publish package

Abstract: This repository presents a Conditional Generative Adversary Networks (GANs) on tabular data (and RDF data) combining with Differential Privacy techniques. Our pre-print publication: Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy.

Author: Chang Sun, Institute of Data Science, Maastricht University Start date: Nov-2021 Status: Under development

Note: "Standing on the shoulders of giants". This repository is inspired by the excellent work of CTGAN from Synthetic Data Vault (SDV), Tensorflow Privacy, and RdfPdans. Highly appreciate they shared the ideas and implementations, made code publicly available, well-written documentation. More related work can be found in the References below.

This package is extended from SDV (https://github.com/sdv-dev/SDV), CTGAN (https://github.com/sdv-dev/CTGAN), and Differential Privacy in GANs (https://github.com/civisanalytics/dpwgan). The author modified the conditional matrix and cost functions to emphasize on the relations between variables. The main changes are in ctgan/synthesizers/ctgan.py ../data_sampler.py ../data_transformer.py

📥️ Installation

You will need Python >=3.8+ and <=3.11. sdv ==1.6.0, and rdt==1.9.0

pip install dp-cgans

🪄 Usage

⌨️ Use as a command-line interface

You can easily generate synthetic data for a file using your terminal after installing dp-cgans with pip.

To quickly run our example, you can download the example data:

wget https://raw.githubusercontent.com/sunchang0124/dp_cgans/main/resources/example_tabular_data_UCIAdult.csv

Then run dp-cgans:

dp-cgans gen example_tabular_data_UCIAdult.csv --epochs 100 --output out.csv --gen-size 100

Get a full rundown of the available options for generating synthetic data with:

dp-cgans gen --help

🐍 Use with python

This library can also be used directly in python scripts

If your input is tabular data (e.g., csv):

from dp_cgans import DP_CGAN
import pandas as pd
import time

tabular_data=pd.read_csv("../resources/example_tabular_data_UCIAdult.csv")

### Add your pre-processing if needed
for col in tabular_data.columns:
   tabular_data[col] = pd.to_numeric(tabular_data[col], errors='ignore', downcast='integer')
for col in tabular_data.columns:
   if tabular_data[col].nunique() < 10: 
       tabular_data[col] = tabular_data[col].astype('object')

# Configure model hyper-parameters
model = DP_CGAN(
   epochs=500, # number of training epochs
   batch_size=100, # the size of each batch
   log_frequency=True,
   verbose=True,
   generator_dim=(128, 128, 128),
   discriminator_dim=(128, 128, 128),
   generator_lr=2e-4, 
   discriminator_lr=2e-4,
   discriminator_steps=10, 
   private=False,
)

start_time = time.time()
print("Start training the model: ")
model.fit(tabular_data)
end_time = time.time()

elapsed_time = end_time - start_time
print("Training model time ", elapsed_time)

print("Saving the trained generator...")
model.save("generator.pkl")

# print("load the trained file.")
# loaded_model=DP_CGAN.load("PATH_TO_MODEL")

# Generate 100 synthetic rows
syn_data = model.sample(100)
syn_data.to_csv("syn_data_file.csv",index=None)

🧑‍💻 Development setup

For development, we recommend to install and use Hatch, as it will automatically install and sync the dependencies when running development scripts. But you can also directly create a virtual environment and install the library with pip install -e .

Install

Clone the repository:

git clone https://github.com/sunchang0124/dp_cgans
cd dp_cgans

When working in development the hatch tool will automatically install and sync the dependencies when running a script. But you can also directly

Run

Run the library with the CLI:

hatch -v run dp-cgans gen --help

You can also enter a new shell with the virtual environments automatically activated:

hatch shell
dp-cgans gen --help

Tests

Run the tests locally:

hatch run pytest -s

Format

Run formatting and linting (black and ruff):

hatch run fmt

Reset the virtual environments

In case the virtual environments is not updating as expected you can easily reset it with:

hatch env prune

📦️ New release process

The deployment of new releases is done automatically by a GitHub Action workflow when a new release is created on GitHub. To release a new version:

  1. Make sure the PYPI_API_TOKEN secret has been defined in the GitHub repository (in Settings > Secrets > Actions). You can get an API token from PyPI here.

  2. Increment the version number in src/dp_cgans/__init__.py file:

    hatch version fix    # Bump from 0.0.1 to 0.0.2
    hatch version minor  # Bump from 0.0.1 to 0.1.0
    hatch version 0.1.1  # Bump to the specified version
    
  3. Create a new release on GitHub, which will automatically trigger the publish workflow, and publish the new release to PyPI.

You can also manually build and publish from you laptop:

hatch build
hatch publish

📚️ References / Further reading

There are many excellent work on generating synthetic data using GANS and other methods. We list the studies that made great conbributions for the field and inspiring for our work.

GANS
  1. Synthetic Data Vault (SDV) [Paper] [Github]
  2. Modeling Tabular Data using Conditional GAN (a part of SDV) [Paper] [Github]
  3. Wasserstein GAN [Paper]
  4. Improved Training of Wasserstein GANs [Paper]
  5. Synthesising Tabular Data using Wasserstein Conditional GANs with Gradient Penalty (WCGAN-GP) [Paper]
  6. PacGAN: The power of two samples in generative adversarial networks [Paper]
  7. CTAB-GAN: Effective Table Data Synthesizing [Paper]
  8. Conditional Tabular GAN-Based Two-Stage Data Generation Scheme for Short-Term Load Forecasting [Paper]
  9. TabFairGAN: Fair Tabular Data Generation with Generative Adversarial Networks [Paper]
  10. Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning [Paper]
Differential Privacy
  1. Tensorflow Privacy [Github]
  2. Renyi Differential Privacy [Paper]
  3. DP-CGAN : Differentially Private Synthetic Data and Label Generation [Paper]
  4. Differentially Private Generative Adversarial Network [Paper] [Github] Another implementation [Github]
  5. Private Data Generation Toolbox [Github]
  6. autodp: Automating differential privacy computation [Github]
  7. Differentially Private Synthetic Medical Data Generation using Convolutional GANs [Paper]
  8. DTGAN: Differential Private Training for Tabular GANs [Paper]
  9. DIFFERENTIALLY PRIVATE SYNTHETIC DATA: APPLIED EVALUATIONS AND ENHANCEMENTS [Paper]
  10. FFPDG: FAST, FAIR AND PRIVATE DATA GENERATION [Paper]
Others
  1. EvoGen: a Generator for Synthetic Versioned RDF [Paper]
  2. Generation and evaluation of synthetic patient data [Paper]
  3. Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation [Paper]
  4. Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy [Paper]
  5. Synthetic data for open and reproducible methodological research in social sciences and official statistics [Paper]
  6. A Study of the Impact of Synthetic Data Generation Techniques on Data Utility using the 1991 UK Samples of Anonymised Records [Paper]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dp_cgans-0.2.0.tar.gz (98.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dp_cgans-0.2.0-py3-none-any.whl (109.4 kB view details)

Uploaded Python 3

File details

Details for the file dp_cgans-0.2.0.tar.gz.

File metadata

  • Download URL: dp_cgans-0.2.0.tar.gz
  • Upload date:
  • Size: 98.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dp_cgans-0.2.0.tar.gz
Algorithm Hash digest
SHA256 344c30e130715c61fa6b1b3c6b652eebbbcde7550f9941126028903bf2f0dbaa
MD5 b0ca738f84774889fd18ec6524b90d7d
BLAKE2b-256 848040173f6854fa860d3caaa27bfb86d819b5a6b3be7fe37abaf7103aee515a

See more details on using hashes here.

File details

Details for the file dp_cgans-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dp_cgans-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 109.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dp_cgans-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 97c100e9eb68a7678310e8952f0a4d4525969dc82e26a35e54512ac6ff8ca0a8
MD5 f86c7706dca3981ac840cf8c637416ea
BLAKE2b-256 2caa6b1ff9e344ef720bebb26067899279d4449931c7b37e7c2b16b11faa5ada

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page