Synthetic Data Generation with optional Differential Privacy

These details have not been verified by PyPI

Project links

Homepage

Project description

Gretel Synthetics

An open source synthetic data library from Gretel.ai

gretel-synthetics workflows GitHub

Documentation

Try it out now!

If you want to quickly discover gretel-synthetics, simply click the button below and follow the tutorials!

Check out additional examples here.

Getting Started

By default, we do not install Tensorflow via pip as many developers and cloud services such as Google Colab are running customized versions for their hardware.

pip install -U .

pip install gretel-synthetics

then...

$ pip install jupyter
$ jupyter notebook

When the UI launches in your browser, navigate to examples/synthetic_records.ipynb and get generating!

If you want to install gretel-synthetics locally and use a GPU (recommended):

Create a virtual environment (e.g. using conda)

$ conda create --name tf --python=3.8

Activate the virtual environment

$ conda activate tf

Run the setup script ./setup-utils/setup-gretel-synthetics-tensorflow24-with-gpu.sh

The last step will install all the necessary software packages for GPU usage, tensorflow=2.4 and gretel-synthetics. Note that this script works only for Ubuntu 18.04. You might need to modify it for other OS versions.

Overview

This package allows developers to quickly get immersed with synthetic data generation through the use of neural networks. The more complex pieces of working with libraries like Tensorflow and differential privacy are bundled into friendly Python classes and functions. There are two high level modes that can be utilized.

Simple Mode

The simple mode will train line-per-line on an input file of text. When generating data, the generator will yield a custom object that can be used a variety of different ways based on your use case. This notebook demonstrates this mode.

DataFrame Mode

This library supports CSV / DataFrames natively using the DataFrame "batch" mode. This module provided a wrapper around our simple mode that is geared for working with tabular data. Additionally, it is capabable of handling a high number of columns by breaking the input DataFrame up into "batches" of columns and training a model on each batch. This notebook shows an overview of using this library with DataFrames natively.

Components

There are four primary components to be aware of when using this library.

Configurations. Configurations are classes that are specific to an underlying ML engine used to train and generate data. An example would be using TensorFlowConfig to create all the necessary parameters to train a model based on TF. LocalConfig is aliased to TensorFlowConfig for backwards compatability with older versions of the library. A model is saved to a designated directory, which can optionally be archived and utilized later.
Tokenizers. Tokenizers convert input text into integer based IDs that are used by the underlying ML engine. These tokenizers can be created and sent to the training input. This is optional, and if no specific tokenizer is specified then a default one will be used. You can find an example here that uses a simple char-by-char tokenizer to build a model from an input CSV. When training in a non-differentially private mode, we suggest using the default SentencePiece tokenizer, an unsupervised tokenizer that learns subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) for faster training and increased accuracy of the synthetic model.
Training. Training a model combines the configuration and tokenizer and builds a model, which is stored in the designated directory, that can be used to generate new records.
Generation. Once a model is trained, any number of new lines or records can be generated. Optionally, a record validator can be provided to ensure that the generated data meets any constraints that are necessary. See our notebooks for examples on validators.

Differential Privacy

Differential privacy support for our TensorFlow mode is built on the great work being done by the Google TF team and their TensorFlow Privacy library.

When utilizing DP, we currently recommend using the character tokenizer as it will only create a vocabulary of single tokens and removes the risk of sensitive data being memorized as actual tokens that can be replayed during generation.

There are also a few configuration options that are notable such as:

predict_batch_size should be set to 1
dp should be enabled
learning_rate, dp_noise_multiplier, dp_l2_norm_clip, and dp_microbatches can be adjusted to achieve various epsilon values.
reset_states should be disabled

Please see our example Notebook for training a DP model based on the Netflix Prize dataset.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.22.20

Jun 24, 2025

0.22.19

Feb 25, 2025

0.22.17

Jan 28, 2025

0.22.16

Dec 10, 2024

0.22.15

Dec 5, 2024

0.22.14

Oct 29, 2024

0.22.13

Oct 8, 2024

0.22.12

Sep 10, 2024

0.22.11

Jun 26, 2024

0.22.10

May 14, 2024

0.22.9

Apr 30, 2024

0.22.8

Apr 2, 2024

0.22.7

Mar 19, 2024

0.22.6

Feb 20, 2024

0.22.5

Dec 21, 2023

0.22.4

Nov 28, 2023

0.22.3

Oct 17, 2023

0.22.2

Sep 5, 2023

0.22.1

Aug 15, 2023

0.22.0

Jul 18, 2023

0.21.0

Jun 13, 2023

0.20.0

Jan 24, 2023

0.19.0

Aug 30, 2022

0.18.1

Jul 27, 2022

0.18.0

May 24, 2022

0.18.0rc1 pre-release

May 23, 2022

0.17.0

Nov 11, 2021

0.16.12

Oct 14, 2021

0.16.11

Sep 22, 2021

0.16.10

Jul 13, 2021

0.15.10

May 20, 2021

0.15.9

May 19, 2021

This version

0.15.8

May 12, 2021

0.15.7

May 11, 2021

0.15.6

May 5, 2021

0.15.6rc1 pre-release

Apr 30, 2021

0.15.5

Apr 20, 2021

0.15.4

Apr 6, 2021

0.15.3

Jan 29, 2021

0.15.2

Dec 8, 2020

0.15.1

Nov 25, 2020

0.15.0

Nov 17, 2020

0.15.0rc0 pre-release

Nov 13, 2020

0.14.1

Oct 20, 2020

0.14.0

Oct 5, 2020

0.13.0

Sep 18, 2020

0.12.0

Sep 4, 2020

0.11.2

Aug 28, 2020

0.11.1

Aug 19, 2020

0.11.0

Aug 12, 2020

0.11.0rc5 pre-release

Aug 4, 2020

0.11.0rc4 pre-release

Aug 4, 2020

0.11.0rc3 pre-release

Aug 4, 2020

0.11.0rc1 pre-release

Aug 4, 2020

0.10.3

Jun 19, 2020

0.10.2

Jun 19, 2020

0.10.1

Jun 18, 2020

0.10.0 yanked

Jun 16, 2020

Reason this release was yanked:

Bug fix, released 0.10.1 in lieu of

0.9.3

Jun 3, 2020

0.9.2

May 26, 2020

0.9.1 yanked

May 22, 2020

Reason this release was yanked:

Corrupted wheel.

0.9.0

May 19, 2020

0.8.0

May 10, 2020

0.7.1

Apr 30, 2020

0.7.0

Apr 30, 2020

0.6.1

Apr 21, 2020

0.6.0

Mar 24, 2020

0.5.0

Mar 2, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gretel-synthetics-0.15.8.tar.gz (987.8 kB view details)

Uploaded May 12, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gretel_synthetics-0.15.8-py3-none-any.whl (54.2 kB view details)

Uploaded May 12, 2021 Python 3

File details

Details for the file gretel-synthetics-0.15.8.tar.gz.

File metadata

Download URL: gretel-synthetics-0.15.8.tar.gz
Upload date: May 12, 2021
Size: 987.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5

File hashes

Hashes for gretel-synthetics-0.15.8.tar.gz
Algorithm	Hash digest
SHA256	`90a70faa2a538386f91ba5d8dfe4226e21f94f4fff0627d4903b8877d1a8c366`
MD5	`b827cea2ca75f9ac46be5bea61901933`
BLAKE2b-256	`11bc5d6f8dbb1af4dd8409488c020c9b136ccbea3099abe73ef89023432a7f83`

See more details on using hashes here.

File details

Details for the file gretel_synthetics-0.15.8-py3-none-any.whl.

File metadata

Download URL: gretel_synthetics-0.15.8-py3-none-any.whl
Upload date: May 12, 2021
Size: 54.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5

File hashes

Hashes for gretel_synthetics-0.15.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bc6e65869004c7e5c62b5c679bc9d2f2d0031c3efe46d49be91cb4ea4089665c`
MD5	`879043ee015fff6d64d730f81ea9a8db`
BLAKE2b-256	`f20b5b008430f9afd02211ee686de7459750ab9c5607c29fe057bdf088355458`

See more details on using hashes here.

gretel-synthetics 0.15.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Gretel Synthetics

Documentation

Try it out now!

Getting Started

Overview

Simple Mode

DataFrame Mode

Components

Differential Privacy

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes