Skip to main content

A synthetic tabular data generation library.

Project description

Python

GenTab

Synthetic Tabular Data Generation Library

Overview

This Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...

Features

:nut_and_bolt: Pre-process your data.

:clock130: State-of-the-art models.

:recycle: Easy to use and customize.

Install

The gentab library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.

pip install gentab

Available Generators

Below is the list of the generators currently available in the library.

Linear

Model Example Paper
Random Over-Sampling Open In Colab link
SMOTE Open In Colab link
ADASYN Open In Colab link

PDF

Model Example Paper
Gaussian Copula Open In Colab link

AE

Model Example Paper
TVAE Open In Colab link

GAN

Model Example Paper
CTGAN Open In Colab link
CTAB-GAN Open In Colab link
CTAB-GAN+ Open In Colab link

Diffusion

Model Example Paper
ForestDiffusion Open In Colab link

LLM

Model Example Paper
GReaT Open In Colab link
Tabula Open In Colab link

Hybrid

Model Example Papers
Copula GAN Open In Colab link link
AutoDiffusion Open In Colab link

Examples

Generation

from gentab.generators import AutoDiffusion
from gentab.evaluators import MLP
from gentab.data import Config, Dataset
from gentab.utils import console

config = Config("configs/playnet.json")

dataset = Dataset(config)
dataset.reduce_size(
    {
        "left_attack": 0.97,
        "right_attack": 0.97,
        "right_transition": 0.9,
        "left_transition": 0.9,
        "time_out": 0.8,
        "left_penal": 0.5,
        "right_penal": 0.5,
    }
)
dataset.merge_classes(
    {
        "attack": ["left_attack", "right_attack"],
        "transition": ["left_transition", "right_transition"],
        "penalty": ["left_penal", "right_penal"],
    }
)
dataset.reduce_mem()

console.print(dataset.class_counts(), dataset.row_count())
generator = AutoDiffusion(dataset)
generator.generate()
console.print(dataset.generated_class_counts(), dataset.generated_row_count())

evaluator = MLP(generator)
evaluator.evaluate()

dataset.save_to_disk(generator)

Tuning

from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset

config = Config("configs/adult.json")

dataset = Dataset(config)
dataset.merge_classes({
    "<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()

generator = AutoDiffusion(dataset)

evaluator = LightGBM(generator)

trials = 10
time = 60 * 60 * 8
tuner = AutoDiffusionTuner(evaluator, trials, timeout=time)
tuner.tune()
tuner.save_to_disk()

Loading Stored Synthetic Datasets

from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset

config = Config("configs/adult.json")

dataset = Dataset(config)
dataset.merge_classes({
    "<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()

# Load previously saved dataset...
generator = AutoDiffusion(dataset)
generator.load_from_disk()

# Do work with previously generated but not tuned dataset...
evaluator = LightGBM(generator)
evaluator.evaluate()
evaluator.evaluate_baseline()

# Load previously tuned and saved dataset...
tuner = AutoDiffusionTuner(evaluator, 0)
tuner.load_from_disk()

# Do work with previously tuned dataset...
evaluator.evaluate()
evaluator.evaluate_baseline()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gentab-0.1.2.tar.gz (105.3 kB view hashes)

Uploaded Source

Built Distribution

gentab-0.1.2-py3-none-any.whl (133.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page