Skip to main content

The tool uncovers patterns, trends, and correlations hidden within your production datasets.

Project description

open-ppe

Syngen is an unsupervised tabular data generation tool, based on the variational autoencoder model (VAE). The Bayesian Gaussian Mixture model is used to further detangle the latent space. The linked tables i.e., tables sharing a key can also be generated using the simple statistical approach.

Quick Start guide

Use pip to install the library:

pip install syngen

The training and inference processes are separated with two cli entry points. The training one receives paths to the original table, metadata json file or table name and used hyperparameters. To start training with the sensible defaults run

train PATH_TO_ORIGINAL_CSV –table_name TABLE_NAME

This will train a model and save the model artifacts to disk.

To generate data simply call

infer SIZE TABLE_NAME

This will create a csv file with the synthetic table in ./model_artifacts/tmp_store/TABLE_NAME/merged_infer.csv

Here is a quick example:

pip install syngen
train ./data/Churn_modelling.csv –table_name Churn
infer 5000 Churn

Available hyperparameters

Training

You can add flexibility to the training and inference processes using additional hyperparameters.

train PATH_TO_ORIGINAL_CSV –metadata_path PATH_TO_METADATA_JSON –table_name TABLE_NAME –epochs INT –row_limit INT –dropna BOOL –keys_mode BOOL

  • PATH_TO_ORIGINAL_CSV – a path to the csv table that you want to use a reference
  • metadata_path – a path to the json file containing the metadata (see below)
  • table_name – an arbitrary string to name the directories. If table name is provided and –keys_mode is False the –metadata_path argument is optional
  • epochs – the number of training epochs. Since the early stopping mechanism is implemented the bigger is the better
  • row_limit – the number of rows to train over. A number less then the original table length will randomly subset the specified rows number
  • dropna – whether to drop rows with at least one missing value
  • keys_mode – whether to train linked tables (see below)

Inference

You can customize the inference processes by calling

infer SIZE TABLE_NAME –run_parallel BOOL –batch_size INT –keys_mode BOOL –metadata_path PATH_TO_METADATA –random_seed INT- --print_report BOOL

  • SIZE - the desired number of rows to generate
  • TABLE_NAME – the name of the table, same as in training
  • run_parallel – whether to use multiprocessing (feasible for tables > 5000 rows)
  • batch_size – if specified, the generation is split into batches. This can save the RAM
  • keys_mode – whether to generate linked tables (see below)
  • metadata_path – a path to metadata json file. If --keys mode is set to False the argument is optional
  • random_seed – if specified, generates a reproducible result
  • print_report – whether to generate plots of pairwise distributions, accuracy matrix and print the median accuracy

Linked tables generation

To generate linked tables, you need to train tables in the special order:

A table with the Primary key (training) -> a table with the Primary key (inference) -> a table with the foreign key (training) -> a table with the foreign key (inference)

You have to set --keys_mode to True in every step and provide the metadata for the Foreign key table training and inference as a json file with the following structure:

{"table_name": "NAME_OF_FK_TABLE", "fk": {"NAME_OF_FK_COLUMN": {"pk_table": "NAME_OF_PK_TABLE", "pk_column": "NAME_OF_PK_COLUMN (in PK table)"}}}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syngen-0.0.6.tar.gz (44.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syngen-0.0.6-py3-none-any.whl (49.8 kB view details)

Uploaded Python 3

File details

Details for the file syngen-0.0.6.tar.gz.

File metadata

  • Download URL: syngen-0.0.6.tar.gz
  • Upload date:
  • Size: 44.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.7

File hashes

Hashes for syngen-0.0.6.tar.gz
Algorithm Hash digest
SHA256 aca28bc0495f8bed53252d12f86298af3fe2547af2d93993457ffdd26b49a10a
MD5 95638891ce8923695f49eb304f0e1777
BLAKE2b-256 ffab8bb95dbcb5bac740e1ee4171afd385c1c25d16753fb04e3204197e23fcec

See more details on using hashes here.

File details

Details for the file syngen-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: syngen-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 49.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.7

File hashes

Hashes for syngen-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 d3926cb422d902385a86cd8701c2510d802528724cb885c9b68ec9dc4d49c78d
MD5 3d76f8dff669d1f539b22f06bd00887a
BLAKE2b-256 1e062623547071af554f8dae305bf6a99e95fbd91c7d96a8d436935e366d3a30

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page