Skip to main content

Conditional GAN for Tabular Data

Project description

DAI-Lab An Open Source Project from the Data to AI Lab, at MIT

Development Status PyPI Shield Travis CI Shield Downloads Coverage Status

Overview

CTGAN is a collection of Deep Learning based Synthetic Data Generators for single table data, which are able to learn from real data and generate synthetic clones with high fidelity.

Currently, this library implements the CTGAN and TVAE models proposed in the Modeling Tabular data using Conditional GAN paper. For more information about these models, please check out the respective user guides:

Install

Requirements

CTGAN has been developed and tested on Python 3.6, 3.7 and 3.8

Install from PyPI

The recommended way to installing CTGAN is using pip:

pip install ctgan

This will pull and install the latest stable release from PyPI.

Install with conda

CTGAN can also be installed using conda:

conda install -c sdv-dev -c pytorch -c conda-forge ctgan

This will pull and install the latest stable release from Anaconda.

Usage Example

:warning: WARNING: If you're just getting started with synthetic data, we recommend using the SDV library which provides user-friendly APIs for interacting with CTGAN. To learn more about using CTGAN through SDV, check out the user guide here.

To get started with CTGAN, you should prepare your data as either a numpy.ndarray or a pandas.DataFrame object with two types of columns:

  • Continuous Columns: can contain any numerical value.
  • Discrete Columns: contain a finite number values, whether these are string values or not.

In this example we load the Adult Census Dataset which is a built-in demo dataset. We then model it using the CTGANSynthesizer and generate a synthetic copy of it.

from ctgan import CTGANSynthesizer
from ctgan import load_demo

data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGANSynthesizer(epochs=10)
ctgan.fit(data, discrete_columns)

# Synthetic copy
samples = ctgan.sample(1000)

Join our community

  1. Please have a look at the Contributing Guide to see how you can contribute to the project.
  2. If you have any doubts, feature requests or detect an error, please open an issue on github or join our Slack Workspace.
  3. Also, do not forget to check the project documentation site!

Citing TGAN

If you use CTGAN, please cite the following work:

  • Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.
@inproceedings{xu2019modeling,
  title={Modeling Tabular data using Conditional GAN},
  author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2019}
}

Related Projects

Please note that these libraries are external contributions and are not maintained nor supervised by the MIT DAI-Lab team.

R interface for CTGAN

A wrapper around CTGAN has been implemented by Kevin Kuo @kevinykuo, bringing the functionalities of CTGAN to R users.

More details can be found in the corresponding repository: https://github.com/kasaai/ctgan

CTGAN Server CLI

A package to easily deploy CTGAN onto a remote server. This package is developed by Timothy Pillow @oregonpillow.

More details can be found in the corresponding repository: https://github.com/oregonpillow/ctgan-server-cli

The Synthetic Data Vault

This repository is part of The Synthetic Data Vault Project

History

v0.3.1 - 2021-01-27

Improvements

  • Check discrete_columns valid before fitting - Issue #35 by @fealho

Bugs fixed

  • ValueError: max() arg is an empty sequence - Issue #115 by @fealho

v0.3.0 - 2020-12-18

In this release we add a new TVAE model which was presented in the original CTGAN paper. It also exposes more hyperparameters and moves epochs and log_frequency from fit to the constructor.

A new verbose argument has been added to optionally disable unnecessary printing, and a new hyperparameter called discriminator_steps has been added to CTGAN to control the number of optimization steps performed in the discriminator for each generator epoch.

The code has also been reorganized and cleaned up for better readability and interpretability.

Special thanks to @Baukebrenninkmeijer @fealho @leix28 @csala for the contributions!

Improvements

  • Add TVAE - Issue #111 by @fealho
  • Move log_frequency to __init__ - Issue #102 by @fealho
  • Add discriminator steps hyperparameter - Issue #101 by @Baukebrenninkmeijer
  • Code cleanup / Expose hyperparameters - Issue #59 by @fealho and @leix28
  • Publish to conda repo - Issue #54 by @fealho

Bugs fixed

  • Fixed NaN != NaN counting bug. - Issue #100 by @fealho
  • Update dependencies and testing - Issue #90 by @csala

v0.2.2 - 2020-11-13

In this release we introduce several minor improvements to make CTGAN more versatile and propertly support new types of data, such as categorical NaN values, as well as conditional sampling and features to save and load models.

Additionally, the dependency ranges and python versions have been updated to support up to date runtimes.

Many thanks @fealho @leix28 @csala @oregonpillow and @lurosenb for working on making this release possible!

Improvements

  • Drop Python 3.5 support - Issue #79 by @fealho
  • Support NaN values in categorical variables - Issue #78 by @fealho
  • Sample synthetic data conditioning on a discrete column - Issue #69 by @leix28
  • Support recent versions of pandas - Issue #57 by @csala
  • Easy solution for restoring original dtypes - Issue #26 by @oregonpillow

Bugs fixed

  • Loss to nan - Issue #73 by @fealho
  • Swapped the sklearn utils testing import statement - Issue #53 by @lurosenb

v0.2.1 - 2020-01-27

Minor version including changes to ensure the logs are properly printed and the option to disable the log transformation to the discrete column frequencies.

Special thanks to @kevinykuo for the contributions!

Issues Resolved:

  • Option to sample from true data frequency instead of logged frequency - Issue #16 by @kevinykuo
  • Flush stdout buffer for epoch updates - Issue #14 by @kevinykuo

v0.2.0 - 2019-12-18

Reorganization of the project structure with a new Python API, new Command Line Interface and increased data format support.

Issues Resolved:

  • Reorganize the project structure - Issue #10 by @csala
  • Move epochs to the fit method - Issue #5 by @csala

v0.1.0 - 2019-11-07

First Release - NeurIPS 2019 Version.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ctgan-0.4.0.dev0.tar.gz (94.3 kB view details)

Uploaded Source

Built Distribution

ctgan-0.4.0.dev0-py2.py3-none-any.whl (21.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file ctgan-0.4.0.dev0.tar.gz.

File metadata

  • Download URL: ctgan-0.4.0.dev0.tar.gz
  • Upload date:
  • Size: 94.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.5

File hashes

Hashes for ctgan-0.4.0.dev0.tar.gz
Algorithm Hash digest
SHA256 a0379c234ec80b5c4816eaee46cb7275b942839f6e962cc6c9422e8f6b7e367b
MD5 21c02316b3fa7eaf268ed2299ace0157
BLAKE2b-256 6045ac2917a7df9df9e2719a438a73dd01486b1be4ed7ff30463ed3b4b4db65c

See more details on using hashes here.

Provenance

File details

Details for the file ctgan-0.4.0.dev0-py2.py3-none-any.whl.

File metadata

  • Download URL: ctgan-0.4.0.dev0-py2.py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.5

File hashes

Hashes for ctgan-0.4.0.dev0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 1a7687f48bf4fd8c55596706aa54537eeba0edb9633e50a9da115ecffac6332d
MD5 b5d7bb944e1908cb8da39eb8a6c9cc48
BLAKE2b-256 c5e95f5d87bbd73853afe49957816466dac2e4e64d81b0cffa2a40eed75fa00e

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page