A synthetic tabular data generation library.
Project description
GenTab
Synthetic Tabular Data Generation Library
Overview
This Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...
Features
:nut_and_bolt: Pre-process your data.
:clock130: State-of-the-art models.
:recycle: Easy to use and customize.
Install
The gentab
library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.
pip install gentab
Available Generators
Below is the list of the generators currently available in the library.
Linear
Model | Example | Paper |
---|---|---|
Random Over-Sampling | link | |
SMOTE | link | |
ADASYN | link |
Model | Example | Paper |
---|---|---|
Gaussian Copula | link |
AE
Model | Example | Paper |
---|---|---|
TVAE | link |
GAN
Model | Example | Paper |
---|---|---|
CTGAN | link | |
CTAB-GAN | link | |
CTAB-GAN+ | link |
Diffusion
Model | Example | Paper |
---|---|---|
ForestDiffusion | link |
LLM
Model | Example | Paper |
---|---|---|
GReaT | link | |
Tabula | link |
Hybrid
Model | Example | Papers |
---|---|---|
Copula GAN | link link | |
AutoDiffusion | link |
Examples
Generation
from gentab.generators import AutoDiffusion
from gentab.evaluators import MLP
from gentab.data import Config, Dataset
from gentab.utils import console
config = Config("configs/playnet.json")
dataset = Dataset(config)
dataset.reduce_size(
{
"left_attack": 0.97,
"right_attack": 0.97,
"right_transition": 0.9,
"left_transition": 0.9,
"time_out": 0.8,
"left_penal": 0.5,
"right_penal": 0.5,
}
)
dataset.merge_classes(
{
"attack": ["left_attack", "right_attack"],
"transition": ["left_transition", "right_transition"],
"penalty": ["left_penal", "right_penal"],
}
)
dataset.reduce_mem()
console.print(dataset.class_counts(), dataset.row_count())
generator = AutoDiffusion(dataset)
generator.generate()
console.print(dataset.generated_class_counts(), dataset.generated_row_count())
evaluator = MLP(generator)
evaluator.evaluate()
dataset.save_to_disk(generator)
Tuning
from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset
config = Config("configs/adult.json")
dataset = Dataset(config)
dataset.merge_classes({
"<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()
generator = AutoDiffusion(dataset)
evaluator = LightGBM(generator)
trials = 10
time = 60 * 60 * 8
tuner = AutoDiffusionTuner(evaluator, trials, timeout=time)
tuner.tune()
tuner.save_to_disk()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.