Skip to main content

No project description provided

Project description

Build Status

Syngen

Syngen is an unsupervised tabular data generation tool. It is useful for generation of test data with a given table as a template. Most datatypes including floats, integers, datetime, text, categorical, binary are supported. The linked tables i.e., tables sharing a key can also be generated using the simple statistical approach.

The tool is based on the variational autoencoder model (VAE). The Bayesian Gaussian Mixture model is used to further detangle the latent space.

Getting started

Use pip to install the library:

pip install syngen

The training and inference processes are separated with two cli entry points. The training one receives paths to the original table, metadata json file or table name and used hyperparameters. To start training with the sensible defaults run

train PATH_TO_ORIGINAL_CSV –table_name TABLE_NAME

This will train a model and save the model artifacts to disk.

To generate data simply call

infer SIZE TABLE_NAME

This will create a csv file with the synthetic table in ./model_artifacts/tmp_store/TABLE_NAME/merged_infer.csv

Here is a quick example:

pip install syngen
train ./data/Churn_modelling.csv –table_name Churn
infer 5000 Churn

Features

Training

You can add flexibility to the training and inference processes using additional hyperparameters.

train PATH_TO_ORIGINAL_CSV --metadata_path PATH_TO_METADATA_JSON --table_name TABLE_NAME --epochs INT --row_limit INT --dropna BOOL

  • PATH_TO_ORIGINAL_CSV – a path to the csv table that you want to use a reference
  • metadata_path – a path to the json file containing the metadata for linked tables generation
  • table_name – an arbitrary string to name the directories
  • epochs – the number of training epochs. Since the early stopping mechanism is implemented the bigger is the better
  • row_limit – the number of rows to train over. A number less then the original table length will randomly subset the specified rows number
  • dropna – whether to drop rows with at least one missing value

Inference

You can customize the inference processes by calling

infer SIZE TABLE_NAME --run_parallel BOOL --batch_size INT --metadata_path PATH_TO_METADATA --random_seed INT --print_report BOOL

  • SIZE - the desired number of rows to generate
  • TABLE_NAME – the name of the table, same as in training
  • run_parallel – whether to use multiprocessing (feasible for tables > 5000 rows)
  • batch_size – if specified, the generation is split into batches. This can save the RAM
  • metadata_path – a path to metadata json file to generate linked tables
  • random_seed – if specified, generates a reproducible result
  • print_report – whether to generate plots of pairwise distributions, accuracy matrix and print the median accuracy

Linked tables generation

To generate linked tables, you need to train tables in the special order:

A table with the Primary key (training) -> a table with the Primary key (inference) -> a table with the foreign key (training) -> a table with the foreign key (inference)

You can train and infer the PK table as is. For the FK table training and inference you have to provide the metadata as a .json file with the following structure:

{"table_name": "NAME_OF_FK_TABLE", "fk": {"NAME_OF_FK_COLUMN": {"pk_table": "NAME_OF_PK_TABLE", "pk_column": "NAME_OF_PK_COLUMN (in PK table)"}}}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syngen-0.0.33.tar.gz (47.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syngen-0.0.33-py3-none-any.whl (54.6 kB view details)

Uploaded Python 3

File details

Details for the file syngen-0.0.33.tar.gz.

File metadata

  • Download URL: syngen-0.0.33.tar.gz
  • Upload date:
  • Size: 47.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for syngen-0.0.33.tar.gz
Algorithm Hash digest
SHA256 f662bb11d885f5a646828a4c8afc2f532ca47281769a2557dfdac2bd0c6e4a8f
MD5 c344455b02eab34d96e33b4a00dcfd02
BLAKE2b-256 9e4dbf6b513f5230a5e9a37ee0b146fd41e49f122b8771df329ee01048c9b4ae

See more details on using hashes here.

File details

Details for the file syngen-0.0.33-py3-none-any.whl.

File metadata

  • Download URL: syngen-0.0.33-py3-none-any.whl
  • Upload date:
  • Size: 54.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for syngen-0.0.33-py3-none-any.whl
Algorithm Hash digest
SHA256 a6544672f87b21f52b568cfe3e8ace4712f6e06a17a48fab74589b3366827f9e
MD5 b12e680f4adcf517ad055173a2522e99
BLAKE2b-256 28bbd013bfdd805c088e57c6c2698c3b62408612bd3a00d5e03488ba3699d15d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page