Benchmark tabular synthetic data generators using a variety of datasets

These details have been verified by PyPI

Maintainers

amontanez24 fealho kveerama mit_dai_lab npatki pvkdeveloper

These details have not been verified by PyPI

Project links

Project description

This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Overview

The Synthetic Data Gym (SDGym) is a benchmarking framework for modeling and generating synthetic data. Measure performance and memory usage across different synthetic data modeling techniques – classical statistics, deep learning and more!

The SDGym library integrates with the Synthetic Data Vault ecosystem. You can use any of its synthesizers, datasets or metrics for benchmarking. You can also customize the process to include your own work.

Datasets: Select any of the publicly available datasets from the SDV project, or input your own data.
Synthesizers: Choose from any of the SDV synthesizers and baselines. Or write your own custom machine learning model.
Evaluation: In addition to performance and memory usage, you can also measure synthetic data quality and privacy through a variety of metrics.

Install

Install SDGym using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

pip install sdgym

conda install -c pytorch -c conda-forge sdgym

For more information about using SDGym, visit the SDGym Documentation.

Usage

Let's benchmark synthetic data generation for single tables. First, let's define which modeling techniques we want to use. Let's choose a few synthesizers from the SDV library and a few others to use as baselines.

# these synthesizers come from the SDV library
# each one uses different modeling techniques
sdv_synthesizers = ['GaussianCopulaSynthesizer', 'CTGANSynthesizer']

# these basic synthesizers are available in SDGym
# as baselines
baseline_synthesizers = ['UniformSynthesizer']

Now, we can benchmark the different techniques:

import sdgym

sdgym.benchmark_single_table(
    synthesizers=(sdv_synthesizers + baseline_synthesizers)
)

The result is a detailed performance, memory and quality evaluation across the synthesizers on a variety of publicly available datasets.

Supplying a custom synthesizer

Benchmark your own synthetic data generation techniques. Define your synthesizer by specifying the training logic (using machine learning) and the sampling logic.

def my_training_logic(data, metadata):
    # create an object to represent your synthesizer
    # train it using the data
    return synthesizer

def my_sampling_logic(trained_synthesizer, num_rows):
    # use the trained synthesizer to create
    # num_rows of synthetic data
    return synthetic_data

Learn more in the Custom Synthesizers Guide.

Customizing your datasets

The SDGym library includes many publicly available datasets that you can include right away. List these using the get_available_datasets feature.

sdgym.get_available_datasets()

dataset_name   size_MB     num_tables
KRK_v1         0.072128    1
adult          3.907448    1
alarm          4.520128    1
asia           1.280128    1
...

You can also include any custom, private datasets that are stored on your computer on an Amazon S3 bucket.

my_datasets_folder = 's3://my-datasets-bucket'

For more information, see the docs for Customized Datasets.

What's next?

Visit the SDGym Documentation to learn more!

The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

Project details

These details have been verified by PyPI

Maintainers

amontanez24 fealho kveerama mit_dai_lab npatki pvkdeveloper

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.10.0

Feb 7, 2025

0.10.0.dev0 pre-release

Feb 6, 2025

0.9.1

Aug 29, 2024

0.9.1.dev0 pre-release

Aug 28, 2024

0.9.0

Aug 7, 2024

0.9.0.dev0 pre-release

Aug 6, 2024

0.8.0

Jun 7, 2024

0.8.0.dev1 pre-release

Jun 7, 2024

0.8.0.dev0 pre-release

Jun 4, 2024

0.7.0

Jun 14, 2023

0.7.0.dev0 pre-release

Jun 13, 2023

0.6.0

Feb 1, 2023

0.6.0.dev1 pre-release

Feb 1, 2023

0.6.0.dev0 pre-release

Jan 27, 2023

0.5.0

Dec 13, 2021

0.5.0.dev0 pre-release

Dec 13, 2021

0.4.1

Aug 20, 2021

0.4.1.dev2 pre-release

Aug 20, 2021

0.4.1.dev1 pre-release

Jul 12, 2021

0.4.1.dev0 pre-release

Jul 12, 2021

0.4.0

Jun 17, 2021

0.4.0.dev1 pre-release

Jun 16, 2021

0.4.0.dev0 pre-release

Jun 14, 2021

0.3.1

May 21, 2021

0.3.1.dev2 pre-release

May 20, 2021

0.3.1.dev1 pre-release

Apr 12, 2021

0.3.1.dev0 pre-release

Apr 6, 2021

0.3.0

Jan 28, 2021

0.3.0.dev0 pre-release

Jan 28, 2021

0.2.2

Oct 17, 2020

0.2.2.dev0 pre-release

Oct 9, 2020

0.2.1

May 12, 2020

0.2.1.dev0 pre-release

May 12, 2020

0.2.0

Apr 10, 2020

0.2.0.dev1 pre-release

Apr 10, 2020

0.2.0.dev0 pre-release

Apr 10, 2020

0.1.0

Aug 8, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdgym-0.10.0.tar.gz (38.5 kB view details)

Uploaded Feb 7, 2025 Source

Built Distribution

sdgym-0.10.0-py3-none-any.whl (39.7 kB view details)

Uploaded Feb 7, 2025 Python 3

File details

Details for the file sdgym-0.10.0.tar.gz.

File metadata

Download URL: sdgym-0.10.0.tar.gz
Upload date: Feb 7, 2025
Size: 38.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.11

File hashes

Hashes for sdgym-0.10.0.tar.gz
Algorithm	Hash digest
SHA256	`2245fdb0a5f6f769d82903e76871d1598fe92333cedcf656181ea85e5a5e6e8b`
MD5	`432cd35b820b707b6f994a5b357e2a14`
BLAKE2b-256	`3ff42d6bcfdd4ae9fd0644059504bd3d483cb445ed8332a39c010b09e4f17541`

See more details on using hashes here.

File details

Details for the file sdgym-0.10.0-py3-none-any.whl.

File metadata

Download URL: sdgym-0.10.0-py3-none-any.whl
Upload date: Feb 7, 2025
Size: 39.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.11

File hashes

Hashes for sdgym-0.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`be863cf45e04c0b7a51ca53808b41bc2a688d2ec0cb0a377b6b621b24e96e3cb`
MD5	`7d286fa48773ec612948e282ba76895c`
BLAKE2b-256	`cf395fcf4cf7a3e0f7888a8e3dbe030e76e80c025d723fba30c20615c87f23ed`

See more details on using hashes here.

sdgym 0.10.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Install

Usage

Supplying a custom synthesizer

Customizing your datasets

What's next?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes