The tool uncovers patterns, trends, and correlations hidden within your production datasets.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

EPAM Syngen

EPAM Syngen is an unsupervised tabular data generation tool. It is useful for generation of test data with a given table as a template. Most datatypes including floats, integers, datetime, text, categorical, binary are supported. The linked tables i.e., tables sharing a key can also be generated using the simple statistical approach. The source of data might be in CSV, Avro format and should be located locally and be in UTF-8 encoding.

The tool is based on the variational autoencoder model (VAE). The Bayesian Gaussian Mixture model is used to further detangle the latent space.

Getting started

Use pip to install the library:

pip install syngen

The training and inference processes are separated with two cli entry points. The training one receives paths to the original table, metadata json file or table name and used hyperparameters.

To start training with defaults parameters run:

train --source PATH_TO_ORIGINAL_CSV \
    --table_name TABLE_NAME

This will train a model and save the model artifacts to disk.

To generate with defaults parameters data simply call:

infer --table_name TABLE_NAME

Please notice that the name should match the one you used in the training process.
This will create a csv file with the synthetic table in ./model_artifacts/tmp_store/TABLE_NAME/merged_infer_TABLE_NAME.csv.

Here is a quick example:

pip install syngen
train --source ./example-data/housing.csv –-table_name Housing
infer --table_name Housing

As the example you can use the dataset "Housing" in example-data/housing.csv. In this example, our real-world data is "Housing" from Kaggle.

Features

Training

You can add flexibility to the training and inference processes using additional hyperparameters.
For training of single table call:

train --source PATH_TO_ORIGINAL_CSV \
    --table_name TABLE_NAME \
    --epochs INT \
    --row_limit INT \
    --drop_null BOOL \
    --print_report BOOL \
    --batch_size INT

For training of the multiple linked tables call:

train --metadata_path PATH_TO_METADATA_YAML

The parameters which you can set up for training process:

source – required parameter for training of single table, a path to the file that you want to use as a reference
table_name – required parameter for training of single table, an arbitrary string to name the directories
epochs – a number of training epochs. Since the early stopping mechanism is implemented the bigger value of epochs is the better
row_limit – a number of rows to train over. A number less than the original table length will randomly subset the specified number of rows
drop_null – whether to drop rows with at least one missing value
batch_size – if specified, the training is split into batches. This can save the RAM
print_report - whether to generate plots of accuracy report and sample report
metadata_path – a path to the metadata file containing the metadata for linked tables
column_types - might include the section categorical which contains the listed columns defined as categorical by a user

Requirements for parameters of training process:

source - data type - string
table_name - data type - string
epochs - data type - integer, must be equal to or more than 1, default value is 10
row_limit - data type - integer
drop_null - data type - boolean, default value - False
batch_size - data type - integer, must be equal to or more than 1, default value - 32
print_report - data type - boolean, default value is False
metadata_path - data type - string
column_types - data type - dictionary with the key categorical - the list of columns (data type - string)

Inference (generation)

You can customize the inference processes by calling for one table:

infer --size INT \
    --table_name STR \
    --run_parallel BOOL \
    --batch_size INT \
    --random_seed INT \
    --print_report BOOL

For linked tables you can simply call:

infer --metadata_path PATH_TO_METADATA

The parameters which you can set up for generation process:

size - the desired number of rows to generate
table_name – required parameter for inference of single table, the name of the table, same as in training
run_parallel – whether to use multiprocessing (feasible for tables > 5000 rows)
batch_size – if specified, the generation is split into batches. This can save the RAM
random_seed – if specified, generates a reproducible result
print_report – whether to generate plots of accuracy report, sample report
metadata_path – a path to metadata file to generate linked tables

Requirements for parameters of generation process:

size - data type - integer, must be equal to or more than 1, default value is 100
table_name - data type - string
run_parallel - data type - boolean, default value is False
batch_size - data type - integer, must be equal to or more than 1
random_seed - data type - integer, must be equal to or more than 0
print_report - data type - boolean, default value is False
metadata_path - data type - string

The metadata can contain any of the arguments above for each table. If so, the duplicated arguments from the CLI will be ignored.

Linked tables generation

To generate linked tables, you should provide metadata in yaml format. It is used to handle complex relations for any number of tables. You can also specify additional parameters needed for training and inference in the metadata file and in this case, they will be ignored in the CLI call.

The yaml metadata file should match the following template:

CUSTOMER:                                       # Table name
    source: "./files/customer.csv"              # Supported formats include local files in CSV, Avro formats
             
    train_settings:                             # Settings for training process
        epochs: 10                              # Number of epochs if different from the default in the command line options
        drop_null: False                        # Drop rows with NULL values
        row_limit: None                         # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number
        batch_size: 32                          # If specified, the training is split into batches. This can save the RAM
        print_report: False                     # Turn on or turn off generation of the report
        column_types:
            categorical:                        # Force listed columns to have categorical type (use dictionary of values)
                - gender
                - marital_status
             
    infer_settings:                             # Settings for infer process
        size: 100                               # Size for generated data
        run_parallel: False                     # Turn on or turn off parallel training process
        print_report: False                     # Turn on or turn off generation of the report
        batch_size: None                        # If specified, the generation is split into batches. This can save the RAM
        random_seed: None                       # If specified, generates a reproducible result
    keys:
        PK_CUSTOMER_ID:                         # Name of a key. Only one PK per table.
            type: "PK"                          # The key type. Supported: PK - primary key, FK - foreign key, TKN - token key
            columns:                            # Array of column names
                - customer_id
 
        UQ1:                                    # Name of a key
            type: "UQ"                          # One or many unique keys
            columns:
                - e_mail
 
        FK1:                                    # One or many foreign keys
            type: "FK"
            columns:                            # Array of columns in the current table
                - e_mail
                - alias
            references:
                table: "PROFILE"                # Name of the parent table
                columns:                        # Array of columns in the parent table
                    - e_mail
                    - alias
   
        FK2:
            type: "FK"
            columns:
                - address_id
            references:
                table: "ADDRESS"
                columns:
                    - address_id

 
ORDER:
    source: "./files/order.csv"
 
    train_settings:
        epochs: 10                              # Number of epochs if different from the default in the command line options
        drop_null: False                        # Drop rows with NULL values
        row_limit: None                         # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number
        batch_size: 32                          # If specified, the training is split into batches. This can save the RAM
        print_report: False                     # Turn on or turn off generation of the report
        column_types:
            categorical:                        # Force listed columns to have categorical type (use dictionary of values)
                - gender
                - marital_status
 
    infer_settings:                             # Settings for infer process
        size: 100                               # Size for generated data
        run_parallel: False                     # Turn on or turn off parallel training process
        print_report: False                     # Turn on or turn off generation of the report
        batch_size: None                        # If specified, the generation is split into batches. This can save the RAM
        random_seed: None                       # If specified, generates a reproducible result
    keys:
        pk_order_id:
            type: "PK"
            columns:
                - order_id
 
        FK1:
            type: "FK"
            columns:
                - customer_id
            references:
                table: "CUSTOMER"
                columns:
                    - customer_id

You can find the example of metadata file in example-metadata/housing_metadata.yaml

For related tables training you can use the commands:

train --metadata_path=PATH_TO_YAML_METADATA_FILE
infer --metadata_path=PATH_TO_YAML_METADATA_FILE

Here is a quick example:

train --metadata_path="./example-metadata/housing_metadata.yaml"
infer --metadata_path="./example-metadata/housing_metadata.yaml"

If --metadata_path is present and the metadata contains the necessary parameters, other CLI parameters will be ignored.

Docker images

The train and inference components of syngen is available as public docker images:

https://hub.docker.com/r/tdspora/syngen-train

https://hub.docker.com/r/tdspora/syngen-infer

To run dockerized code (see parameters description in Training and Inference sections) for one table call:

docker pull tdspora/syngen-train:latest
docker run --rm \
  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-train \
  --table_name=TABLE_NAME \
  --source=./model_artifacts/YOUR_CSV_FILE.csv

docker pull tdspora/syngen-infer:latest
docker run --rm \
  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-infer \
  --table_name=TABLE_NAME

PATH_TO_LOCAL_FOLDER is an absolute path to the folder where your original csv is stored.

You can add any arguments listed in the corresponding sections for infer and training processes in the CLI call.

To run dockerized code for linked tables simply call:

docker pull tdspora/syngen-train:latest
docker run --rm \
  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-train \
  --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML

docker pull tdspora/syngen-infer:latest
docker run --rm \
  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-infer \
  --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML

You can add any arguments listed in the corresponding sections for infer and training processes in the CLI call, however, they will be overwritten by corresponding arguments in the metadata file.

Contribution

We welcome contributions from the community to help us improve and maintain our public GitHub repository. We appreciate any feedback, bug reports, or feature requests, and we encourage developers to submit fixes or new features using issues.

If you have found a bug or have a feature request, please submit an issue to our GitHub repository. Please provide as much detail as possible, including steps to reproduce the issue or a clear description of the feature request. Our team will review the issue and work with you to address any problems or discuss any potential new features.

If you would like to contribute a fix or a new feature, please submit a pull request to our GitHub repository. Please make sure your code follows our coding standards and best practices. Our team will review your pull request and work with you to ensure that it meets our standards and is ready for inclusion in our codebase.

We appreciate your contributions and thank you for your interest in helping us maintain and improve our public GitHub repository.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.8.0

Apr 23, 2024

0.7.22

Apr 19, 2024

0.7.21

Apr 11, 2024

0.7.20

Apr 10, 2024

0.7.19

Apr 10, 2024

0.7.18

Mar 29, 2024

0.7.17

Mar 28, 2024

0.7.16

Mar 27, 2024

0.7.15

Mar 15, 2024

0.7.15rc1 pre-release

Mar 15, 2024

0.7.14

Mar 14, 2024

0.7.13

Mar 5, 2024

0.7.12

Mar 4, 2024

0.7.11

Feb 20, 2024

0.7.10

Feb 15, 2024

0.7.9

Feb 8, 2024

0.7.7

Feb 7, 2024

0.7.6

Feb 1, 2024

0.7.5

Feb 1, 2024

0.7.4

Feb 1, 2024

0.7.3

Jan 29, 2024

0.7.2

Jan 26, 2024

0.7.1

Jan 24, 2024

0.7.0

Jan 19, 2024

0.6.2

Jan 18, 2024

0.6.2rc0 pre-release

Jan 16, 2024

0.6.1

Jan 15, 2024

0.6.1rc0 pre-release

Jan 11, 2024

0.6.0

Jan 11, 2024

0.5.10

Jan 10, 2024

0.5.9

Jan 4, 2024

0.5.8

Jan 2, 2024

0.5.7

Dec 19, 2023

0.5.6

Dec 15, 2023

0.5.5

Dec 15, 2023

0.5.4

Dec 6, 2023

0.5.3

Dec 6, 2023

0.5.3rc0 pre-release

Nov 30, 2023

0.5.2

Nov 30, 2023

0.5.2rc0 pre-release

Nov 29, 2023

0.5.1

Nov 28, 2023

0.5.1rc1 pre-release

Nov 24, 2023

0.5.1rc0 pre-release

Nov 24, 2023

0.5.0

Nov 22, 2023

0.4.11rc2 pre-release

Nov 21, 2023

0.4.11rc1 pre-release

Nov 20, 2023

0.4.10

Nov 16, 2023

0.4.10rc0 pre-release

Nov 16, 2023

0.4.9

Nov 14, 2023

0.4.9rc2 pre-release

Nov 14, 2023

0.4.9rc1 pre-release

Nov 10, 2023

0.4.9rc0 pre-release

Nov 3, 2023

0.4.8rc0 pre-release

Oct 31, 2023

0.4.7rc0 pre-release

Oct 26, 2023

0.4.6rc0 pre-release

Oct 26, 2023

0.4.5

Oct 31, 2023

0.4.5rc0 pre-release

Oct 25, 2023

0.4.4 yanked

Oct 20, 2023

Reason this release was yanked:

issues with the installation process

0.4.3

Oct 12, 2023

0.4.3rc0 pre-release

Oct 6, 2023

0.4.2

Oct 5, 2023

0.4.1

Sep 28, 2023

0.4.1rc0 pre-release

Sep 25, 2023

0.4.0rc0 pre-release

Sep 21, 2023

0.3.13rc0 pre-release

Sep 21, 2023

0.3.12

Sep 4, 2023

0.3.11

Aug 31, 2023

0.3.10

Aug 30, 2023

0.3.9

Aug 30, 2023

0.3.8

Aug 29, 2023

0.3.7

Aug 29, 2023

0.3.6 yanked

Aug 29, 2023

0.3.6rc1 pre-release

Aug 28, 2023

0.3.6rc0 pre-release

Aug 25, 2023

0.3.5

Aug 23, 2023

0.3.4

Aug 22, 2023

0.3.3

Aug 21, 2023

0.3.2

Aug 18, 2023

0.3.1

Aug 17, 2023

0.3.0

Aug 16, 2023

0.2.6

Aug 16, 2023

0.2.5

Aug 15, 2023

0.2.4

Aug 15, 2023

0.2.3

Aug 14, 2023

0.2.0

Aug 8, 2023

0.2.0rc1 pre-release

Aug 4, 2023

0.2.0rc0 pre-release

Aug 3, 2023

0.1.38

Aug 1, 2023

0.1.37

Jul 28, 2023

0.1.36

Jul 28, 2023

0.1.35

Jul 27, 2023

0.1.34

Jul 27, 2023

0.1.33

Jul 25, 2023

0.1.32

Jul 20, 2023

0.1.31

Jul 19, 2023

0.1.30

Jul 19, 2023

0.1.29

Jul 19, 2023

0.1.28

Jul 13, 2023

0.1.27

Jul 11, 2023

0.1.26

Jul 7, 2023

0.1.25

Jul 6, 2023

0.1.24

Jul 6, 2023

0.1.23

Jul 5, 2023

0.1.22

Jul 4, 2023

0.1.21

Jun 29, 2023

0.1.20

Jun 23, 2023

0.1.19

Jun 19, 2023

0.1.18

Jun 16, 2023

0.1.17

Jun 15, 2023

0.1.16

Jun 15, 2023

0.1.15

Jun 9, 2023

0.1.14

Jun 8, 2023

0.1.13

Jun 8, 2023

0.1.12

Jun 1, 2023

0.1.11

May 30, 2023

0.1.10

May 26, 2023

0.1.9

May 25, 2023

0.1.8

May 25, 2023

0.1.7

May 24, 2023

0.1.6

May 24, 2023

0.1.5

May 22, 2023

0.1.4

May 19, 2023

0.1.3

May 19, 2023

0.1.2

May 19, 2023

0.1.1

May 16, 2023

0.1.0

May 12, 2023

0.0.99

May 9, 2023

0.0.98

May 5, 2023

0.0.97

Apr 28, 2023

0.0.96

Apr 25, 2023

0.0.95

Apr 24, 2023

0.0.94

Apr 21, 2023

0.0.93

Apr 20, 2023

0.0.92

Apr 19, 2023

0.0.91

Apr 18, 2023

0.0.90

Apr 18, 2023

0.0.89

Apr 11, 2023

0.0.88

Apr 7, 2023

0.0.87 yanked

Apr 4, 2023

0.0.86

Apr 3, 2023

0.0.85

Apr 3, 2023

0.0.84

Mar 30, 2023

0.0.83

Mar 29, 2023

0.0.82

Mar 28, 2023

0.0.81

Mar 27, 2023

0.0.80

Mar 27, 2023

0.0.79

Mar 24, 2023

0.0.78

Mar 22, 2023

0.0.77

Mar 21, 2023

This version

0.0.76

Mar 20, 2023

0.0.75

Feb 27, 2023

0.0.74

Feb 22, 2023

0.0.73

Feb 21, 2023

0.0.72

Feb 17, 2023

0.0.71

Feb 13, 2023

0.0.70

Feb 7, 2023

0.0.69

Feb 2, 2023

0.0.68

Feb 1, 2023

0.0.67

Feb 1, 2023

0.0.66

Jan 31, 2023

0.0.65

Jan 27, 2023

0.0.64

Jan 27, 2023

0.0.63

Jan 23, 2023

0.0.62

Jan 19, 2023

0.0.61

Jan 18, 2023

0.0.60

Jan 17, 2023

0.0.59

Jan 12, 2023

0.0.58

Jan 11, 2023

0.0.57

Jan 6, 2023

0.0.56

Jan 4, 2023

0.0.55

Dec 21, 2022

0.0.54

Dec 20, 2022

0.0.53

Dec 19, 2022

0.0.52

Dec 12, 2022

0.0.51

Dec 8, 2022

0.0.50 yanked

Dec 8, 2022

0.0.49

Dec 2, 2022

0.0.48

Nov 30, 2022

0.0.47 yanked

Nov 30, 2022

0.0.46

Nov 28, 2022

0.0.45

Nov 25, 2022

0.0.44 yanked

Nov 25, 2022

Reason this release was yanked:

yanked

0.0.43

Nov 23, 2022

0.0.42

Nov 23, 2022

0.0.41

Nov 9, 2022

0.0.40

Nov 8, 2022

0.0.39

Oct 31, 2022

0.0.38

Oct 27, 2022

0.0.37

Oct 26, 2022

0.0.36

Oct 20, 2022

0.0.35

Oct 18, 2022

0.0.34

Oct 12, 2022

0.0.33

Oct 6, 2022

0.0.32

Sep 30, 2022

0.0.31

Sep 29, 2022

0.0.30

Sep 26, 2022

0.0.29 yanked

Sep 26, 2022

0.0.28 yanked

Sep 26, 2022

0.0.27

Sep 2, 2022

0.0.26 yanked

Sep 2, 2022

0.0.25

Aug 25, 2022

0.0.24

Aug 25, 2022

0.0.23

Aug 18, 2022

0.0.22

Aug 12, 2022

0.0.21 yanked

Aug 10, 2022

0.0.20

Aug 9, 2022

0.0.19

Aug 8, 2022

0.0.18

Aug 5, 2022

0.0.17

Aug 5, 2022

0.0.16

Aug 3, 2022

0.0.15

Aug 2, 2022

0.0.14

Aug 2, 2022

0.0.13

Aug 2, 2022

0.0.12

Aug 1, 2022

0.0.11

Aug 1, 2022

0.0.10

Aug 1, 2022

0.0.9

Jul 29, 2022

0.0.8

Jul 29, 2022

0.0.7

Jul 26, 2022

0.0.6

Jul 25, 2022

0.0.5

Jul 21, 2022

0.0.4

Jul 21, 2022

0.0.3

Jul 21, 2022

0.0.2

Jul 20, 2022

0.0.1

Jul 19, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syngen-0.0.76.tar.gz (1.2 MB view hashes)

Uploaded Mar 20, 2023 Source

Built Distribution

syngen-0.0.76-py3-none-any.whl (1.2 MB view hashes)

Uploaded Mar 20, 2023 Python 3

Hashes for syngen-0.0.76.tar.gz

Hashes for syngen-0.0.76.tar.gz
Algorithm	Hash digest
SHA256	`1f858ce231718912bdb7f3ca7ac74cf01022a9cb0eb620557cff7d0bb902a769`
MD5	`ea979fe8144e1c2dbfd347f189f50cde`
BLAKE2b-256	`c715a556b7539d1ce31622f79a76890100714750df6a6e32b4f162ed1e19f23c`

Hashes for syngen-0.0.76-py3-none-any.whl

Hashes for syngen-0.0.76-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3a5318a531b403bf3aa1dede9883fa51c7f1f957afd5922a8b30898014c7b20b`
MD5	`a876311e59cf2eb3f4c317ac5a8192ed`
BLAKE2b-256	`022f50be5916e3c2a4a611493e740574891aa19b29c1f3bcb2e5219a949bea3b`