A library to generate synthetic tabular data using Conditional Generative Adversary Networks (GANs) combined with Differential Privacy techniques.
Project description
👯 DP-CGANS (Differentially Private - Conditional Generative Adversarial NetworkS)
Abstract: This repository presents a Conditional Generative Adversary Networks (GANs) on tabular data (and RDF data) combining with Differential Privacy techniques. Our pre-print publication: Improving Correlation Capture in Generating Imbalanced Data using Differentially Private Conditional GANs.
Author: Chang Sun, Institute of Data Science, Maastricht University Start date: Nov-2021 Status: Under development
Note: "Standing on the shoulders of giants". This repository is inspired by the excellent work of CTGAN from Synthetic Data Vault (SDV), Tensorflow Privacy, and RdfPdans. Highly appreciate they shared the ideas and implementations, made code publicly available, well-written documentation. More related work can be found in the References below.
This package is extended from SDV (https://github.com/sdv-dev/SDV), CTGAN (https://github.com/sdv-dev/CTGAN), and Differential Privacy in GANs (https://github.com/civisanalytics/dpwgan). The author modified the conditional matrix and cost functions to emphasize on the relations between variables. The main changes are in ctgan/synthesizers/ctgan.py ../data_sampler.py ../data_transformer.py
📥️ Installation
You will need Python >=3.8+ and <3.10
pip install dp-cgans
🪄 Usage
⌨️ Use as a command-line interface
You can easily generate synthetic data for a file using your terminal after installing dp-cgans
with pip.
To quickly run our example, you can download the example data:
wget https://raw.githubusercontent.com/sunchang0124/dp_cgans/main/resources/example_tabular_data_UCIAdult.csv
Then run dp-cgans
:
dp-cgans gen example_tabular_data_UCIAdult.csv --epochs 2 --output out.csv --gen-size 100
Get a full rundown of the available options for generating synthetic data with:
dp-cgans gen --help
🐍 Use with python
This library can also be used directly in python scripts
If your input is tabular data (e.g., csv):
from dp_cgans import DP_CGAN
import pandas as pd
tabular_data=pd.read_csv("../resources/example_tabular_data_UCIAdult.csv")
# We adjusted the original CTGAN model from SDV. Instead of looking at the distribution of individual variable, we extended to two variables and keep their corrll
model = DP_CGAN(
epochs=100, # number of training epochs
batch_size=1000, # the size of each batch
log_frequency=True,
verbose=True,
generator_dim=(128, 128, 128),
discriminator_dim=(128, 128, 128),
generator_lr=2e-4,
discriminator_lr=2e-4,
discriminator_steps=1,
private=False,
)
print("Start training model")
model.fit(tabular_data)
model.save("generator.pkl")
# Generate 100 synthetic rows
syn_data = model.sample(100)
syn_data.to_csv("syn_data_file.csv")
🧑💻 Development setup
You will need to install Poetry
Be careful as poetry sometime uses a weird python version by default, you can check for the environment used by poetry for the current folder by running:
poetry env list
You can easily tell poetry
to use your current version of python for this folder by running the following command:
poetry env use $(which python)
Install
Clone the repository:
git clone https://github.com/sunchang0124/dp_cgans
cd dp_cgans
Install the dependencies:
poetry install
Run
Run the library with the CLI:
poetry run dp-cgans gen --help
Run the tests locally:
poetry run pytest -s
Add a new dependency
You can change the pyproject.toml
file and run:
poetry update
Or you can do it directly with the CLI (e.g. for pandas
here):
poetry add pandas
Build and publish
Build:
poetry build
Publishing a new release is automatically done by a GitHub Action workflow when a release is created on GitHub:
poetry publish
📦️ New release process
The deployment of new releases is done automatically by a GitHub Action workflow when a new release is created on GitHub. To release a new version:
- Make sure the
PYPI_API_TOKEN
secret has been defined in the GitHub repository (in Settings > Secrets > Actions). You can get an API token from PyPI here. - Increment the
version
number in thepyproject.toml
file in the root folder of the repository. - Create a new release on GitHub, which will automatically trigger the publish workflow, and publish the new release to PyPI.
You can also manually trigger the workflow from the Actions tab in your GitHub repository webpage.
📚️ References / Further reading
There are many excellent work on generating synthetic data using GANS and other methods. We list the studies that made great conbributions for the field and inspiring for our work.
GANS
- Synthetic Data Vault (SDV) [Paper] [Github]
- Modeling Tabular Data using Conditional GAN (a part of SDV) [Paper] [Github]
- Wasserstein GAN [Paper]
- Improved Training of Wasserstein GANs [Paper]
- Synthesising Tabular Data using Wasserstein Conditional GANs with Gradient Penalty (WCGAN-GP) [Paper]
- PacGAN: The power of two samples in generative adversarial networks [Paper]
- CTAB-GAN: Effective Table Data Synthesizing [Paper]
- Conditional Tabular GAN-Based Two-Stage Data Generation Scheme for Short-Term Load Forecasting [Paper]
- TabFairGAN: Fair Tabular Data Generation with Generative Adversarial Networks [Paper]
- Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning [Paper]
Differential Privacy
- Tensorflow Privacy [Github]
- Renyi Differential Privacy [Paper]
- DP-CGAN : Differentially Private Synthetic Data and Label Generation [Paper]
- Differentially Private Generative Adversarial Network [Paper] [Github] Another implementation [Github]
- Private Data Generation Toolbox [Github]
- autodp: Automating differential privacy computation [Github]
- Differentially Private Synthetic Medical Data Generation using Convolutional GANs [Paper]
- DTGAN: Differential Private Training for Tabular GANs [Paper]
- DIFFERENTIALLY PRIVATE SYNTHETIC DATA: APPLIED EVALUATIONS AND ENHANCEMENTS [Paper]
- FFPDG: FAST, FAIR AND PRIVATE DATA GENERATION [Paper]
Others
- EvoGen: a Generator for Synthetic Versioned RDF [Paper]
- Generation and evaluation of synthetic patient data [Paper]
- Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation [Paper]
- Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy [Paper]
- Synthetic data for open and reproducible methodological research in social sciences and official statistics [Paper]
- A Study of the Impact of Synthetic Data Generation Techniques on Data Utility using the 1991 UK Samples of Anonymised Records [Paper]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.