Skip to main content

A synthetic tabular data generation library.

Project description

Python

GenTab

Synthetic Tabular Data Generation Library

Overview

This Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...

Features

:nut_and_bolt: Pre-process your data.

:clock130: State-of-the-art models.

:recycle: Easy to use and customize.

Install

The gentab library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.

pip install gentab

Available Generators

Below is the list of the generators currently available in the library.

Linear

Model Example Paper
Random Over-Sampling Open In Colab link
SMOTE Open In Colab link
ADASYN Open In Colab link

PDF

Model Example Paper
Gaussian Copula Open In Colab link

AE

Model Example Paper
TVAE Open In Colab link

GAN

Model Example Paper
CTGAN Open In Colab link
CTAB-GAN Open In Colab link
CTAB-GAN+ Open In Colab link

Diffusion

Model Example Paper
ForestDiffusion Open In Colab link

LLM

Model Example Paper
GReaT Open In Colab link
Tabula Open In Colab link

Hybrid

Model Example Papers
Copula GAN Open In Colab link link
AutoDiffusion Open In Colab link

Examples

Generation

from gentab.generators import AutoDiffusion
from gentab.evaluators import MLP
from gentab.data import Config, Dataset
from gentab.utils import console

config = Config("configs/playnet.json")

dataset = Dataset(config)
dataset.reduce_size(
    {
        "left_attack": 0.97,
        "right_attack": 0.97,
        "right_transition": 0.9,
        "left_transition": 0.9,
        "time_out": 0.8,
        "left_penal": 0.5,
        "right_penal": 0.5,
    }
)
dataset.merge_classes(
    {
        "attack": ["left_attack", "right_attack"],
        "transition": ["left_transition", "right_transition"],
        "penalty": ["left_penal", "right_penal"],
    }
)
dataset.reduce_mem()

console.print(dataset.class_counts(), dataset.row_count())
generator = AutoDiffusion(dataset)
generator.generate()
console.print(dataset.generated_class_counts(), dataset.generated_row_count())

evaluator = MLP(generator)
evaluator.evaluate()

dataset.save_to_disk(generator)

Tuning

from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset

config = Config("configs/adult.json")

dataset = Dataset(config)
dataset.merge_classes({
    "<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()

generator = AutoDiffusion(dataset)

evaluator = LightGBM(generator)

trials = 10
time = 60 * 60 * 8
tuner = AutoDiffusionTuner(evaluator, trials, timeout=time)
tuner.tune()
tuner.save_to_disk()

Loading Stored Synthetic Datasets

from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset

config = Config("configs/adult.json")

dataset = Dataset(config)
dataset.merge_classes({
    "<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()

# Load previously saved dataset...
generator = AutoDiffusion(dataset)
generator.load_from_disk()

# Do work with previously generated but not tuned dataset...
evaluator = LightGBM(generator)
evaluator.evaluate()
evaluator.evaluate_baseline()

# Load previously tuned and saved dataset...
tuner = AutoDiffusionTuner(evaluator, 0)
tuner.load_from_disk()

# Do work with previously tuned dataset...
evaluator.evaluate()
evaluator.evaluate_baseline()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gentab-0.1.2.tar.gz (105.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gentab-0.1.2-py3-none-any.whl (133.0 kB view details)

Uploaded Python 3

File details

Details for the file gentab-0.1.2.tar.gz.

File metadata

  • Download URL: gentab-0.1.2.tar.gz
  • Upload date:
  • Size: 105.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for gentab-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7036bc206158455bf8fec342b6f5fc4a0fdd032407bed736a8f65189f6cb127b
MD5 a2595f4506057819ab423756a6db3e1c
BLAKE2b-256 6d56c39995680f2aef83dd4b69cafa9e0c2739d9afbcaa74cb8666f892572b1d

See more details on using hashes here.

File details

Details for the file gentab-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: gentab-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 133.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for gentab-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5e0da06342304f20e469b685f4ce1a3d870dbeae99c6e26d4a4348aaeb044999
MD5 ff567a6ec347638894b0ed9999876156
BLAKE2b-256 946d43666d9314ea9950a73fe7d0f83890921c15f6d59ce5b3c77c120d7e404e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page