Skip to main content

NoGAN Tabular Synthetic Data Generation

Project description

NOGAN SYNTHESIZER

PyPI version Documentation

NoGANSynthesizer is a library which generates synthetic tabular data based on methods of multivariate binning. It offers faster, more accurate and less complex alternative to GAN.

Class

  • NoGANSynthesizer: Synthetic Data Generator that fits a tabular data

Functions

  • wrap_category_columns: Function to compress all specified categorical columns into one
  • unwrap_category_columns: Function to expand all wrapped categorical columns

Authors

Installation

The package can be installed with

pip install nogan_synthesizer

Tests

The test can be run by cloning the repo and running:

pytest tests

In case of any issues running the tests, please run them after installing the package locally:

pip install -e .

Usage

Start by importing the class

from nogan_synthesizer import NoGANSynth
from nogan_synthesizer.preprocessing import wrap_category_columns, unwrap_category_columns
from genai_evaluation import multivariate_ecdf, ks_statistic

Assuming we have a pandas dataframe (Real) having some categorical columns and we are interested in generating Synthetic based on that. We first prepocess the categorical columns which will return preprocessed real dataset & its corresponding flag vector index to key value dictionary

cat_cols = [category columns list...]
wrapped_real_data, idx_to_key, key_to_idx = \
                        wrap_category_columns(real_data, cat_cols)

We then fit the NoGANSynth Model on the wrapped dataset and generate synthetic data

nogan = NoGANSynth(real_data)
nogan.fit()

n_synth_rows = len(real_data)
synth_data = nogan.generate_synthetic_data(no_of_rows=n_synth_rows)

We can then evaluate the synthetic & real data distributions using genai_evaluation package

_, ecdf_val1, ecdf_synth = \
            multivariate_ecdf(wrapped_real_data, 
                              synth_data, 
                              n_nodes = 1000,
                              verbose = True,
                              random_seed=42)

ks_stat = ks_statistic(ecdf_val1, ecdf_synth)                              

Once we are satisfied with the evaluation results, we can unwrap the Generated Synthetic dataset (unwrap the categorical columns) using the previously generated flag vector index to key dictionary

unwrapped_synth_data = unwrap_category_columns(synth_data, idx_to_key, cat_cols)

Motivation

The motivation for this package comes from Dr. Vincent Granville's paper Generative AI Technology Break-through: Spectacular Performance of New Synthesizer

If you have any tips or suggestions, please contact us on email.

History

0.1.0 (2023-09-19)

  • First release on PyPI.

0.1.1 (2023-09-27)

Fixed

  • Resolved issues with single categorical columns

0.1.2 (2023-09-27)

Feature

  • Added Feature for flexible Uniform & Gaussian Sampling for columns in generate_synthetic_data method

0.1.3 (2023-10-10)

Fixed

  • Resolved issues with float column when selected as category column

0.1.4 (2023-10-16)

Fixed

  • Resolved issues with brackets "(" & ")" in category column values

0.1.5 (2023-10-24)

Feature

  • Added gen random seed to be set during generation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nogan_synthesizer-0.1.5.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

nogan_synthesizer-0.1.5-py2.py3-none-any.whl (8.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file nogan_synthesizer-0.1.5.tar.gz.

File metadata

  • Download URL: nogan_synthesizer-0.1.5.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for nogan_synthesizer-0.1.5.tar.gz
Algorithm Hash digest
SHA256 44e9d5893c94ae8667c38e231d539a1cdf0ccc6a3cb6c25ff4efa4aecda92bc9
MD5 969d2ff46b192de68c56162cbfcff488
BLAKE2b-256 9839e286df2c2103f448e1a641399a6248bd4b8ca313d36c1dfa31c178ab4ad8

See more details on using hashes here.

File details

Details for the file nogan_synthesizer-0.1.5-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for nogan_synthesizer-0.1.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 226ab9dca9c392b1a561ef36873b773f95b43e534a0be92e22da171590f78de0
MD5 2d58fe4e47289380aa3fec7622ca8c63
BLAKE2b-256 b33de518f55d8b156b80edf40235325276a35886ccd28ac7520a65c934a4e7d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page