Skip to main content

NoGAN Tabular Synthetic Data Generation

Project description

NOGAN SYNTHESIZER

PyPI version Documentation

NoGANSynthesizer is a library which generates synthetic tabular data based on methods of multivariate binning. It offers faster, more accurate and less complex alternative to GAN.

Class

  • NoGANSynthesizer: Synthetic Data Generator that fits a tabular data

Functions

  • wrap_category_columns: Function to compress all specified categorical columns into one
  • unwrap_category_columns: Function to expand all wrapped categorical columns

Authors

Installation

The package can be installed with

pip install nogan_synthesizer

Tests

The test can be run by cloning the repo and running:

pytest tests

In case of any issues running the tests, please run them after installing the package locally:

pip install -e .

Usage

Start by importing the class

from nogan_synthesizer import NoGANSynth
from nogan_synthesizer.preprocessing import wrap_category_columns, unwrap_category_columns
from genai_evaluation import multivariate_ecdf, ks_statistic

Assuming we have a pandas dataframe (Real) having some categorical columns and we are interested in generating Synthetic based on that. We first prepocess the categorical columns which will return preprocessed real dataset & its corresponding flag vector index to key value dictionary

cat_cols = [category columns list...]
wrapped_real_data, idx_to_key, key_to_idx = \
                        wrap_category_columns(real_data, cat_cols)

We then fit the NoGANSynth Model on the wrapped dataset and generate synthetic data

nogan = NoGANSynth(real_data)
nogan.fit()

n_synth_rows = len(real_data)
synth_data = nogan.generate_synthetic_data(no_of_rows=n_synth_rows)

We can then evaluate the synthetic & real data distributions using genai_evaluation package

_, ecdf_val1, ecdf_synth = \
            multivariate_ecdf(wrapped_real_data, 
                              synth_data, 
                              n_nodes = 1000,
                              verbose = True,
                              random_seed=42)

ks_stat = ks_statistic(ecdf_val1, ecdf_synth)                              

Once we are satisfied with the evaluation results, we can unwrap the Generated Synthetic dataset (unwrap the categorical columns) using the previously generated flag vector index to key dictionary

unwrapped_synth_data = unwrap_category_columns(synth_data, idx_to_key, cat_cols)

Motivation

The motivation for this package comes from Dr. Vincent Granville's paper Generative AI Technology Break-through: Spectacular Performance of New Synthesizer

If you have any tips or suggestions, please contact us on email.

History

0.1.0 (2023-09-19)

  • First release on PyPI.

0.1.1 (2023-09-27)

Fixed

  • Resolved issues with single categorical columns

0.1.2 (2023-09-27)

Feature

  • Added Feature for flexible Uniform & Gaussian Sampling for columns in generate_synthetic_data method

0.1.3 (2023-10-10)

Fixed

  • Resolved issues with float column when selected as category column

0.1.4 (2023-10-16)

Fixed

  • Resolved issues with brackets "(" & ")" in category column values

0.1.5 (2023-10-24)

Feature

  • Added gen random seed to be set during generation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nogan_synthesizer-0.1.5.tar.gz (9.2 kB view hashes)

Uploaded Source

Built Distribution

nogan_synthesizer-0.1.5-py2.py3-none-any.whl (8.1 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page