Applying GAN in tabular data generation for uneven distribution

These details have not been verified by PyPI

Project links

Project description

GANs for tabular data

We well know GANs for success in the realistic image generation. However, they can be applied in tabular data generation. We will review and examine some recent papers about tabular GANs in action.

Github project: "GAN-for-tabular-data"
Arxiv article: "Tabular GANs for uneven distribution"
Medium post: GANs for tabular data

Library goal

Let say we have T_train and T_test (train and test set respectively). We need to train the model on T_train and make predictions on T_test. However, we will increase the train by generating new data by GAN, somehow similar to T_test, without using ground truth labels.

How to use library

Installation: pip install tabgan
To generate new data to train by sampling and then filtering by adversarial training call GANGenerator().generate_data_pipe:

from tabgan.sampler import OriginalGenerator, GANGenerator
import pandas as pd
import numpy as np

# random input data
train = pd.DataFrame(np.random.randint(-10, 150, size=(150, 4)), columns=list("ABCD"))
target = pd.DataFrame(np.random.randint(0, 2, size=(150, 1)), columns=list("Y"))
test = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))

# generate data
new_train1, new_target1 = OriginalGenerator().generate_data_pipe(train, target, test, )
new_train1, new_target1 = GANGenerator().generate_data_pipe(train, target, test, )

# example with all params defined
new_train3, new_target3 = GANGenerator(gen_x_times=1.1, cat_cols=None,
           bot_filter_quantile=0.001, top_filter_quantile=0.999, is_post_process=True,
           adversarial_model_params={
               "metrics": "AUC", "max_depth": 2, "max_bin": 100, 
               "learning_rate": 0.02, "random_state": 42, "n_estimators": 500,
           }, pregeneration_frac=2, only_generated_data=False,
           gan_params = {"batch_size": 500, "patience": 25, "epochs" : 500,}).generate_data_pipe(train, target,
                                          test, deep_copy=True, only_adversarial=False, use_adversarial=True)

Both samplers OriginalGenerator and GANGenerator have same input parameters:

gen_x_times: float = 1.1 - how much data to generate, output might be less because of postprocessing and adversarial filtering
cat_cols: list = None - categorical columns
bot_filter_quantile: float = 0.001 - bottom quantile for postprocess filtering
top_filter_quantile: float = 0.999 - bottom quantile for postprocess filtering
is_post_process: bool = True - perform or not postfiltering, if false bot_filter_quantile and top_filter_quantile ignored
adversarial_model_params: dict params for adversarial filtering model, default values for binary task
pregeneration_frac: float = 2 - for generataion step gen_x_times * pregeneration_frac amount of data will be generated. However, in postprocessing (1 + gen_x_times) % of original data will be returned
gan_params: dict params for GAN training

For generate_data_pipe methods params:

train_df: pd.DataFrame Train dataframe which has separate target
target: pd.DataFrame Input target for the train dataset
test_df: pd.DataFrame Test dataframe - newly generated train dataframe should be close to it
deep_copy: bool = True - make copy of input files or not. If not input dataframes will be overridden
only_adversarial: bool = False - only adversarial fitering to train dataframe will be performed
use_adversarial: bool = True - perform or not adversarial filtering
only_generated_data: bool = False - After generation get only newly generated, without concating input train dataframe.
@return: -> Tuple[pd.DataFrame, pd.DataFrame] - Newly generated train dataframe and test data

Thus, you may use this library to improve your dataset quality:

def fit_predict(clf, X_train, y_train, X_test, y_test):
    clf.fit(X_train, y_train)
    return sklearn.metrics.roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])


if __name__ == "__main__":
    dataset = sklearn.datasets.load_breast_cancer()
    clf = sklearn.ensemble.RandomForestClassifier(n_estimators=25, max_depth=6)
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
        pd.DataFrame(dataset.data), pd.DataFrame(dataset.target, columns=["target"]), test_size=0.33, random_state=42)
    print("initial metric", fit_predict(clf, X_train, y_train, X_test, y_test))

    new_train1, new_target1 = OriginalGenerator().generate_data_pipe(X_train, y_train, X_test, )
    print("OriginalGenerator metric", fit_predict(clf, new_train1, new_target1, X_test, y_test))

    new_train1, new_target1 = GANGenerator().generate_data_pipe(X_train, y_train, X_test, )
    print("GANGenerator metric", fit_predict(clf, new_train1, new_target1, X_test, y_test))

Timeseries GAN generation TimeGAN

You can easily adjust code to generate multidimensional timeseries data. Basically it extracts days, months and year from date. Demo how to use in the example below:

import pandas as pd
import numpy as np
from tabgan.utils import get_year_mnth_dt_from_date,make_two_digit,collect_dates
from tabgan.sampler import OriginalGenerator, GANGenerator


train_size = 100
train = pd.DataFrame(
        np.random.randint(-10, 150, size=(train_size, 4)), columns=list("ABCD")
    )
min_date = pd.to_datetime('2019-01-01')
max_date = pd.to_datetime('2021-12-31')
d = (max_date - min_date).days + 1

train['Date'] = min_date + pd.to_timedelta(pd.np.random.randint(d, size=train_size), unit='d')
train = get_year_mnth_dt_from_date(train, 'Date')

new_train, new_target = GANGenerator(gen_x_times=1.1, cat_cols=['year'], bot_filter_quantile=0.001,
                                     top_filter_quantile=0.999,
                                     is_post_process=True, pregeneration_frac=2, only_generated_data=False).\
                                     generate_data_pipe(train.drop('Date', axis=1), None,
                                                        train.drop('Date', axis=1)
                                                                    )
new_train = collect_dates(new_train)

Experiments

Datasets and experiment design

Running experiment

To run experiment follow these steps:

Clone the repository. All required dataset are stored in ./Research/data folder
Install requirements pip install -r requirements.txt
Run all experiments python ./Research/run_experiment.py. Run all experiments python run_experiment.py. You may add more datasets, adjust validation type and categorical encoders.
Observe metrics across all experiment in console or in ./Research/results/fit_predict_scores.txt

Acknowledgments

The author would like to thank Open Data Science community [7] for many valuable discussions and educational help in the growing field of machine and deep learning. Also, special big thanks to Sber [8] for allowing solving such tasks and providing computational resources.

References

[1] Jonathan Hui. GAN — What is Generative Adversarial Networks GAN? (2018), medium article

[2]Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Generative Adversarial Networks (2014). arXiv:1406.2661

[3] Lei Xu LIDS, Kalyan Veeramachaneni. Synthesizing Tabular Data using Generative Adversarial Networks (2018). arXiv:1811.11264v1 [cs.LG]

[4] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular Data using Conditional GAN (2019). arXiv:1907.00503v2 [cs.LG]

[5] Denis Vorotyntsev. Benchmarking Categorical Encoders (2019). Medium post

[6] Insaf Ashrapov. GAN-for-tabular-data (2020). Github repository.

[7] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila. Analyzing and Improving the Image Quality of StyleGAN (2019) arXiv:1912.04958v2 [cs.CV]

[8] ODS.ai: Open data science (2020), https://ods.ai/

[9] Sber (2020), https://www.sberbank.ru/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.2.4

Mar 15, 2025

2.2.3

Nov 3, 2024

2.2.2

Oct 9, 2024

2.2.1

Jun 1, 2024

2.2.0

May 26, 2024

2.0.5

Oct 3, 2023

2.0.4

Oct 3, 2023

2.0.3

Oct 3, 2023

2.0.1

Oct 3, 2023

2.0.0

Sep 30, 2023

1.3.3

Aug 28, 2023

1.3.2

May 4, 2023

1.3.0

Jan 6, 2023

1.2.1

Jul 6, 2022

This version

1.2.0

Dec 26, 2021

1.1.2

Dec 16, 2021

1.1.1

Oct 31, 2021

1.1.0

Jun 20, 2021

1.0.9

Jun 14, 2021

1.0.7

Mar 27, 2021

1.0.6

Mar 27, 2021

1.0.5

Mar 22, 2021

1.0.4

Mar 20, 2021

1.0.3

Feb 18, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tabgan-1.2.0-py2.py3-none-any.whl (45.8 kB view details)

Uploaded Dec 26, 2021 Python 2Python 3

File details

Details for the file tabgan-1.2.0-py2.py3-none-any.whl.

File metadata

Download URL: tabgan-1.2.0-py2.py3-none-any.whl
Upload date: Dec 26, 2021
Size: 45.8 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.2

File hashes

Hashes for tabgan-1.2.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`e9ff101686f6cfa4e9f3c2d40bda61e4b42eb286282fd0f8a98ad55402314e75`
MD5	`0fdb97d1f854c3fcb34a05c34a3aee7d`
BLAKE2b-256	`73f90d514a60ad472a64b02b7b0a0ba808c0200519354201b9b53bdbfa5d3e13`

See more details on using hashes here.

tabgan 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GANs for tabular data

Library goal

How to use library

Timeseries GAN generation TimeGAN

Experiments

Datasets and experiment design

Acknowledgments

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes