Skip to main content

Create tabular synthetic data using a conditional GAN

Project description


This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Development Status PyPI Shield Unit Tests Downloads Coverage Status

Overview

CTGAN is a collection of Deep Learning based synthetic data generators for single table data, which are able to learn from real data and generate synthetic data with high fidelity.

Important Links
:computer: Website Check out the SDV Website for more information about our overall synthetic data ecosystem.
:orange_book: Blog A deeper look at open source, synthetic data creation and evaluation.
:book: Documentation Quickstarts, User and Development Guides, and API Reference.
:octocat: Repository The link to the Github Repository of this library.
:keyboard: Development Status This software is in its Pre-Alpha stage.
Community Join our Slack Workspace for announcements and discussions.

Currently, this library implements the CTGAN and TVAE models described in the Modeling Tabular data using Conditional GAN paper, presented at the 2019 NeurIPS conference.

Install

Use CTGAN through the SDV library

:warning: If you're just getting started with synthetic data, we recommend installing the SDV library which provides user-friendly APIs for accessing CTGAN. :warning:

The SDV library provides wrappers for preprocessing your data as well as additional usability features like constraints. See the SDV documentation to get started.

Use the CTGAN standalone library

Alternatively, you can also install and use CTGAN directly, as a standalone library:

Using pip:

pip install ctgan

Using conda:

conda install -c pytorch -c conda-forge ctgan

When using the CTGAN library directly, you may need to manually preprocess your data into the correct format, for example:

  • Continuous data must be represented as floats
  • Discrete data must be represented as ints or strings
  • The data should not contain any missing values

Usage Example

In this example we load the Adult Census Dataset* which is a built-in demo dataset. We use CTGAN to learn from the real data and then generate some synthetic data.

from ctgan import CTGAN
from ctgan import load_demo

real_data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGAN(epochs=10)
ctgan.fit(real_data, discrete_columns)

# Create synthetic data
synthetic_data = ctgan.sample(1000)

*For more information about the dataset see: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Join our community

Join our Slack channel to discuss more about CTGAN and synthetic data. If you find a bug or have a feature request, you can also open an issue on our GitHub.

Interested in contributing to CTGAN? Read our Contribution Guide to get started.

Citing CTGAN

If you use CTGAN, please cite the following work:

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.

@inproceedings{ctgan,
  title={Modeling Tabular data using Conditional GAN},
  author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2019}
}

Related Projects

Please note that these projects are external to the SDV Ecosystem. They are not affiliated with or maintained by DataCebo.




The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

History

v0.7.2 - 2023-05-09

This release adds support for Pandas 2.0! It also fixes a bug in the load_demo function.

Bugs Fixed

  • load_demo raises urllib.error.HTTPError: HTTP Error 403: Forbidden - Issue #284 by @amontanez24

Maintenance

  • Remove upper bound for pandas - Issue #282 by @frances-h

v0.7.1 - 2023-02-23

This release fixes a bug that prevented the CTGAN model from being saved after sampling.

Bugs Fixed

  • Cannot save CTGANSynthesizer after sampling (TypeError) - Issue #270 by @pvk-developer

v0.7.0 - 2023-01-20

This release adds support for python 3.10 and drops support for python 3.6. It also fixes a couple of the most common warnings that were surfacing.

New Features

  • Support Python 3.10 and 3.11 - Issue #259 by @pvk-developer

Bugs Fixed

  • Fix SettingWithCopyWarning (may be leading to a numerical calculation bug) - Issue #215 by @amontanez24
  • FutureWarning in data_transformer with pandas 1.5.0 - Issue #246 by @amontanez24

Maintenance

  • CTGAN Package Maintenance Updates - Issue #257 by @amontanez24

v0.6.0 - 2022-10-07

This release renames the models in CTGAN. CTGANSynthesizer is now called CTGAN and TVAESynthesizer is now called TVAE.

New Features

  • Rename synthesizers - Issue #243 by @amontanez24

v0.5.2 - 2022-08-18

This release updates CTGAN to use the latest version of RDT. It also includes performance and robustness updates to the data transformer.

Issues closed

  • Bump rdt version - Issue #242 by @katxiao
  • Single thread data transform is slow for huge table - Issue #151 by @mfhbree
  • Fix RDT api - Issue #232 by @pvk-developer
  • Update macos to use latest version. - Issue #237 by @pvk-developer
  • Update the RDT version to 1.0 - Issue #224 by @pvk-developer
  • Update slack invite link. - Issue #222 by @pvk-developer
  • robustness fix, when data have less rows than the default number of cl… - Issue #211 by @Deathn0t

v0.5.1 - 2022-02-25

This release fixes a bug with the decoder instantiation, and also allows users to set a random state for the model fitting and sampling.

Issues closed

  • Update self.decoder with correct variable name - Issue #203 by @tejuafonja
  • Add random state - Issue #204 by @katxiao

v0.5.0 - 2021-11-18

This release adds support for Python 3.9 and updates dependencies to ensure compatibility with the rest of the SDV ecosystem, and upgrades to the latests RDT release.

Issues closed

  • Add support for Python 3.9 - Issue #177 by @pvk-developer
  • Add pip check to CI workflows - Issue #174 by @pvk-developer
  • Typo in CTGAN code - Issue #158 by @ori-katz100 and @fealho

v0.4.3 - 2021-07-12

Dependency upgrades to ensure compatibility with the rest of the SDV ecosystem.

v0.4.2 - 2021-04-27

In this release, the way in which the loss function of the TVAE model was computed has been fixed. In addition, the default value of the discriminator_decay has been changed to a more optimal value. Also some improvements to the tests were added.

Issues closed

  • TVAE: loss function - Issue #143 by @fealho and @DingfanChen
  • Set discriminator_decay to 1e-6 - Pull request #145 by @fealho
  • Adds unit tests - Pull requests #140 by @fealho

v0.4.1 - 2021-03-30

This release exposes all the hyperparameters which the user may find useful for both CTGAN and TVAE. Also TVAE can now be fitted on datasets that are shorter than the batch size and drops the last batch only if the data size is not divisible by the batch size.

Issues closed

  • TVAE: Adapt batch_size to data size - Issue #135 by @fealho and @csala
  • ValueError from validate_discre_columns with uniqueCombinationConstraint - Issue 133 by @fealho and @MLjungg

v0.4.0 - 2021-02-24

Maintenance relese to upgrade dependencies to ensure compatibility with the rest of the SDV libraries.

Also add a validation on the CTGAN condition_column and condition_value inputs.

Improvements

  • Validate condition_column and condition_value - Issue #124 by @fealho

v0.3.1 - 2021-01-27

Improvements

  • Check discrete_columns valid before fitting - Issue #35 by @fealho

Bugs fixed

  • ValueError: max() arg is an empty sequence - Issue #115 by @fealho

v0.3.0 - 2020-12-18

In this release we add a new TVAE model which was presented in the original CTGAN paper. It also exposes more hyperparameters and moves epochs and log_frequency from fit to the constructor.

A new verbose argument has been added to optionally disable unnecessary printing, and a new hyperparameter called discriminator_steps has been added to CTGAN to control the number of optimization steps performed in the discriminator for each generator epoch.

The code has also been reorganized and cleaned up for better readability and interpretability.

Special thanks to @Baukebrenninkmeijer @fealho @leix28 @csala for the contributions!

Improvements

  • Add TVAE - Issue #111 by @fealho
  • Move log_frequency to __init__ - Issue #102 by @fealho
  • Add discriminator steps hyperparameter - Issue #101 by @Baukebrenninkmeijer
  • Code cleanup / Expose hyperparameters - Issue #59 by @fealho and @leix28
  • Publish to conda repo - Issue #54 by @fealho

Bugs fixed

  • Fixed NaN != NaN counting bug. - Issue #100 by @fealho
  • Update dependencies and testing - Issue #90 by @csala

v0.2.2 - 2020-11-13

In this release we introduce several minor improvements to make CTGAN more versatile and propertly support new types of data, such as categorical NaN values, as well as conditional sampling and features to save and load models.

Additionally, the dependency ranges and python versions have been updated to support up to date runtimes.

Many thanks @fealho @leix28 @csala @oregonpillow and @lurosenb for working on making this release possible!

Improvements

  • Drop Python 3.5 support - Issue #79 by @fealho
  • Support NaN values in categorical variables - Issue #78 by @fealho
  • Sample synthetic data conditioning on a discrete column - Issue #69 by @leix28
  • Support recent versions of pandas - Issue #57 by @csala
  • Easy solution for restoring original dtypes - Issue #26 by @oregonpillow

Bugs fixed

  • Loss to nan - Issue #73 by @fealho
  • Swapped the sklearn utils testing import statement - Issue #53 by @lurosenb

v0.2.1 - 2020-01-27

Minor version including changes to ensure the logs are properly printed and the option to disable the log transformation to the discrete column frequencies.

Special thanks to @kevinykuo for the contributions!

Issues Resolved:

  • Option to sample from true data frequency instead of logged frequency - Issue #16 by @kevinykuo
  • Flush stdout buffer for epoch updates - Issue #14 by @kevinykuo

v0.2.0 - 2019-12-18

Reorganization of the project structure with a new Python API, new Command Line Interface and increased data format support.

Issues Resolved:

  • Reorganize the project structure - Issue #10 by @csala
  • Move epochs to the fit method - Issue #5 by @csala

v0.1.0 - 2019-11-07

First Release - NeurIPS 2019 Version.

Project details


Release history Release notifications | RSS feed

This version

0.7.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ctgan-0.7.2.tar.gz (43.4 kB view details)

Uploaded Source

Built Distribution

ctgan-0.7.2-py2.py3-none-any.whl (26.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file ctgan-0.7.2.tar.gz.

File metadata

  • Download URL: ctgan-0.7.2.tar.gz
  • Upload date:
  • Size: 43.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.30.0 requests-toolbelt/1.0.0 urllib3/2.0.2 tqdm/4.65.0 importlib-metadata/6.6.0 keyring/23.13.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.11

File hashes

Hashes for ctgan-0.7.2.tar.gz
Algorithm Hash digest
SHA256 a71221a9a15ca2bdbdf07e7fda5f578c4e758732cfcaa64831d8acc9ea87bdc7
MD5 5605f8ac21e5325f22c473b082c5227a
BLAKE2b-256 f6b86ad7c7409b956f877a38d4b7150f70ef81b595e09b92fbc081890f74581f

See more details on using hashes here.

Provenance

File details

Details for the file ctgan-0.7.2-py2.py3-none-any.whl.

File metadata

  • Download URL: ctgan-0.7.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.30.0 requests-toolbelt/1.0.0 urllib3/2.0.2 tqdm/4.65.0 importlib-metadata/6.6.0 keyring/23.13.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.11

File hashes

Hashes for ctgan-0.7.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e7c4bdf8209028f0557a56dce1dc509407cab3b1f0da9033480f645c01bc6afa
MD5 9cfb7156dad59c1e4e8e8e5f80f16169
BLAKE2b-256 b42a6a1027196dd68d3fdb1106a0437f96903137793c4acdacb44ee4438652b5

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page