Skip to main content

Ganblr Toolbox

Project description

GANBLR Toolbox

GANBLR Toolbox contains GANBLR models proposed by Tulip Lab for tabular data generation, which can sample fully artificial data from real data.

Currently, this package contains following GANBLR models:

  • GANBLR
  • GANBLR++

For a quick start, you can check out this usage example in Google Colab. Open In Colab

Install

We recommend you to install ganblr through pip:

pip install ganblr

Alternatively, you can also clone the repository and install it from sources.

git clone git@github.com:tulip-lab/ganblr.git
cd ganblr
python setup.py install

Usage Example

In this example we load the Adult Dataset* which is a built-in demo dataset. We use GANBLR to learn from the real data and then generate some synthetic data.

from ganblr import get_demo_data
from ganblr.models import GANBLR

# this is a discrete version of adult since GANBLR requires discrete data.
df = get_demo_data('adult')
x, y = df.values[:,:-1], df.values[:,-1]

model = GANBLR()
model.fit(x, y, epochs = 10)

#generate synthetic data
synthetic_data = model.sample(1000)

The steps to generate synthetic data using GANBLR++ are similar to GANBLR, but require an additional parameter numerical_columns to tell the model the index of the numerical columns.

from ganblr import get_demo_data
from ganblr.models import GANBLRPP
import numpy as np

# raw adult
df = get_demo_data('adult-raw')
x, y = df.values[:,:-1], df.values[:,-1]

def is_numerical(dtype):
    return dtype.kind in 'iuf'

column_is_numerical = df.dtypes.apply(is_numerical).values
numerical_columns = np.argwhere(column_is_numerical).ravel()

model = GANBLRPP(numerical_columns)
model.fit(x, y, epochs = 10)

#generate synthetic data
synthetic_data = model.sample(1000)

Documentation

You can check the documentation at https://ganblr-docs.readthedocs.io/en/latest/.

Leaderboard

Here we show the results of the TSTR(Training on Synthetic data, Testing on Real data) evaluation on Adult dataset based on the experiments in our paper.

TRTR(Train on Real, Test on Real) will be used as the baseline for comparison. You are welcome to update this Leaderboard.

LR MLP RF XGBT
TRTR 0.8741 0.8561 0.8379 0.8562
GANBLR 0.74 0.842 0.81 0.851
CTGAN 0.787 0.831 0.792 0.839
... ... ... ... ...

Citation

If you use GANBLR, please cite the following work:

Y. Zhang, N. A. Zaidi, J. Zhou and G. Li, "GANBLR: A Tabular Data Generation Model," 2021 IEEE International Conference on Data Mining (ICDM), 2021, pp. 181-190, doi: 10.1109/ICDM51629.2021.00103.

@inproceedings{ganblr,
    author={Zhang, Yishuo and Zaidi, Nayyar A. and Zhou, Jiahui and Li, Gang},  
    booktitle={2021 IEEE International Conference on Data Mining (ICDM)},   
    title={GANBLR: A Tabular Data Generation Model},   
    year={2021},  
    pages={181-190},  
    doi={10.1109/ICDM51629.2021.00103}
}
@inbook{ganblrpp,
    author = {Yishuo Zhang and Nayyar Zaidi and Jiahui Zhou and Gang Li},
    title = {<bold>GANBLR++</bold>: Incorporating Capacity to Generate Numeric Attributes and Leveraging Unrestricted Bayesian Networks},
    booktitle = {Proceedings of the 2022 SIAM International Conference on Data Mining (SDM)},
    pages = {298-306},
    doi = {10.1137/1.9781611977172.34},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ganblr-0.1.3.tar.gz (41.7 kB view hashes)

Uploaded Source

Built Distribution

ganblr-0.1.3-py3-none-any.whl (45.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page