NoGAN Tabular Synthetic Data Generation
Project description
NOGAN SYNTHESIZER
NoGANSynthesizer is a library which generates synthetic tabular data based on methods of multivariate binning. It offers faster, more accurate and less complex alternative to GAN.
Class
- NoGANSynthesizer: Synthetic Data Generator that fits a tabular data
Functions
- wrap_category_columns: Function to compress all specified categorical columns into one
- unwrap_category_columns: Function to expand all wrapped categorical columns
Authors
- Dr. Vincent Granville - Research
- Rajiv Iyer - Development/Maintenance
Installation
The package can be installed with
pip install nogan_synthesizer
Tests
The test can be run by cloning the repo and running:
pytest tests
In case of any issues running the tests, please run them after installing the package locally:
pip install -e .
Usage
Start by importing the class
from nogan_synthesizer import NoGANSynth
from nogan_synthesizer.preprocessing import wrap_category_columns, unwrap_category_columns
from genai_evaluation import multivariate_ecdf, ks_statistic
Assuming we have a pandas dataframe (Real) having some categorical columns and we are interested in generating Synthetic based on that. We first prepocess the categorical columns which will return preprocessed real dataset & its corresponding flag vector index to key value dictionary
cat_cols = [category columns list...]
wrapped_real_data, idx_to_key, key_to_idx = \
wrap_category_columns(real_data, cat_cols)
We then fit the NoGANSynth Model on the wrapped dataset and generate synthetic data
nogan = NoGANSynth(real_data)
nogan.fit()
n_synth_rows = len(real_data)
synth_data = nogan.generate_synthetic_data(no_of_rows=n_synth_rows)
We can then evaluate the synthetic & real data distributions using genai_evaluation package
_, ecdf_val1, ecdf_synth = \
multivariate_ecdf(wrapped_real_data,
synth_data,
n_nodes = 1000,
verbose = True,
random_seed=42)
ks_stat = ks_statistic(ecdf_val1, ecdf_synth)
Once we are satisfied with the evaluation results, we can unwrap the Generated Synthetic dataset (unwrap the categorical columns) using the previously generated flag vector index to key dictionary
unwrapped_synth_data = unwrap_category_columns(synth_data, idx_to_key, cat_cols)
Motivation
The motivation for this package comes from Dr. Vincent Granville's paper Generative AI Technology Break-through: Spectacular Performance of New Synthesizer
If you have any tips or suggestions, please contact us on email.
History
0.1.0 (2023-09-19)
- First release on PyPI.
0.1.1 (2023-09-27)
Fixed
- Resolved issues with single categorical columns
0.1.2 (2023-09-27)
Feature
- Added Feature for flexible Uniform & Gaussian Sampling for columns in generate_synthetic_data method
0.1.3 (2023-10-10)
Fixed
- Resolved issues with float column when selected as category column
0.1.4 (2023-10-16)
Fixed
- Resolved issues with brackets "(" & ")" in category column values
0.1.5 (2023-10-24)
Feature
- Added gen random seed to be set during generation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nogan_synthesizer-0.1.5-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 226ab9dca9c392b1a561ef36873b773f95b43e534a0be92e22da171590f78de0 |
|
MD5 | 2d58fe4e47289380aa3fec7622ca8c63 |
|
BLAKE2b-256 | b33de518f55d8b156b80edf40235325276a35886ccd28ac7520a65c934a4e7d9 |