Benchmark tabular synthetic data generators using a variety of datasets

These details have been verified by PyPI

Maintainers

amontanez24 fealho kveerama mit_dai_lab npatki pvkdeveloper

These details have not been verified by PyPI

Project links

Homepage

Project description

This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Overview

Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators based on SDV and SDMetrics.

Important Links
:computer: Website	Check out the SDV Website for more information about the project.
:orange_book: SDV Blog	Regular publshing of useful content about Synthetic Data Generation.
:book: Documentation	Quickstarts, User and Development Guides, and API Reference.
:octocat: Repository	The link to the Github Repository of this library.
:keyboard: Development Status	This software is in its Pre-Alpha stage.
Community	Join our Slack Workspace for announcements and discussions.
Tutorials	Run the SDV Tutorials in a Binder environment.

What is a Synthetic Data Generator?

A Synthetic Data Generator is a Python function (or method) that takes as input some data, which we call the real data, learns a model from it, and outputs new synthetic data that has the same structure and similar mathematical properties as the real one.

Please refer to the synthesizers documentation for instructions about how to implement your own Synthetic Data Generator and integrate with SDGym. You can also read about how to use the ones already included in SDGym and see how to run them.

Benchmark datasets

SDGym evaluates the performance of Synthetic Data Generators using single table, multi table and timeseries datasets stored as CSV files alongside an SDV Metadata JSON file.

Further details about the list of available datasets and how to add your own datasets to the collection can be found in the datasets documentation.

Install

SDGym can be installed using the following commands:

Using pip:

pip install sdgym

Using conda:

conda install -c pytorch -c conda-forge sdgym

For more installation options please visit the SDGym installation Guide

Usage

Benchmarking your own Synthesizer

SDGym evaluates Synthetic Data Generators, which are Python functions (or classes) that take as input some data, which we call the real data, learn a model from it, and output new synthetic data that has the same structure and similar mathematical properties as the real one.

As an example, let use define a synthesizer function that applies the GaussianCopula model from SDV with gaussian distribution.

import numpy as np
from sdv.tabular import GaussianCopula


def create_gaussian_copula(real_data, metadata):
    gc = GaussianCopula(default_distribution='gaussian')
    table_name = metadata.get_tables()[0]
    gc.fit(real_data[table_name])
    num_rows = len(real_data[table_name])
    return (table_name, num_rows, gc)

def sample_gaussian_copula(synthesizer, num_samples):
    table_name, num_rows, gc = synthesizer
    return {table_name: gc.sample(num_rows)}

:information_source: You can learn how to create your own synthesizer function here.

We can now try to evaluate this function on the asia and alarm datasets:

import sdgym

scores = sdgym.benchmark_single_table(
    synthesizers=(create_gaussian_copula, sample_gaussian_copula), sdv_datasets=['asia', 'alarm'])

:information_source: You can learn about different arguments for `sdgym.run` function here.

The output of the sdgym.run function will be a pd.DataFrame containing the results obtained by your synthesizer on each dataset.

synthesizer	dataset	modality	metric	score	metric_time	model_time
gaussian_copula	asia	single-table	BNLogLikelihood	-2.842690	2.762427	0.752364
gaussian_copula	alarm	single-table	BNLogLikelihood	-20.223178	7.009401	3.173832

Benchmarking the SDGym Synthesizers

If you want to run the SDGym benchmark on the SDGym Synthesizers you can directly pass the corresponding class, or a list of classes, to the sdgym.run function.

For example, if you want to run the complete benchmark suite to evaluate all the existing synthesizers you can run (:warning: this will take a lot of time to run!):

from sdgym.synthesizers import (
    CLBN, CopulaGAN, CTGAN, HMA1, Identity, Independent,
    MedGAN, PAR, PrivBN, SDV, TableGAN, TVAE,
    Uniform, VEEGAN)

all_synthesizers = [
    CLBN,
    CTGAN,
    CopulaGAN,
    HMA1,
    Identity,
    Independent,
    MedGAN,
    PAR,
    PrivBN,
    SDV,
    TVAE,
    TableGAN,
    Uniform,
    VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)

For further details about all the arguments and possibilities that the benchmark function offers please refer to the benchmark documentation

Additional References

Datasets used in SDGym are detailed here.
How to write a synthesizer is detailed here.
How to use benchmark function is detailed here.
Detailed leaderboard results for all the releases are available here.

The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

History

v0.5.0 - 2021-12-13

This release adds support for Python 3.9, and updates dependencies to accept the latest versions when possible.

Issues closed

Add support for Python 3.9 - Issue #127 by @katxiao
Add pip check worflow - Issue #124 by @pvk-developer
Fix meta.yaml dependencies - PR #119 by @fealho
Upgrade dependency ranges - Issue #118 by @katxiao

v0.4.1 - 2021-08-20

This release fixed a bug where passing a json file as configuration for a multi-table synthesizer crashed the model. It also adds a number of fixes and enhancements, including: (1) a function and CLI command to list the available synthesizer names, (2) a curate set of dependencies and making Gretel into an optional dependency, (3) updating Gretel to use temp directories, (4) using nvidia-smi to get the number of gpus and (5) multiple dockerfile updates to improve functionality.

Issues closed

Bug when using JSON configuration for multiple multi-table evaluation - Issue #115 by @pvk-developer
Use nvidia-smi to get number of gpus - PR #113 by @katxiao
List synthesizer names - Issue #82 by @fealho
Use nvidia base for dockerfile - PR #108 by @katxiao
Add Makefile target to install gretel and ydata - PR #107 by @katxiao
Curate dependencies and make Gretel optional - PR #106 by @csala
Update gretel checkpoints to use temp directory - PR #105 by @katxiao
Initialize variable before reference - PR #104 by @katxiao

v0.4.0 - 2021-06-17

This release adds new synthesizers for Gretel and ydata, and creates a Docker image for SDGym. It also includes enhancements to the accepted SDGym arguments, adds a summary command to aggregate metrics, and adds the normalized score to the benchmark results.

New Features

Add normalized score to benchmark results - Issue #102 by @katxiao
Add max rows and max columns args - Issue #96 by @katxiao
Automatically detect number of workers - Issue #97 by @katxiao
Add summary function and command - Issue #92 by @amontanez24
Allow jobs list/JSON to be passed - Issue #93 by @fealho
Add ydata to sdgym - Issue #90 by @fealho
Add dockerfile for sdgym - Issue #88 by @katxiao
Add Gretel to SDGym synthesizer - Issue #87 by @amontanez24

v0.3.1 - 2021-05-20

This release adds new features to store results and cache contents into an S3 bucket as well as a script to collect results from a cache dir and compile a single results CSV file.

Issues closed

Collect cached results from s3 bucket - Issue #85 by @katxiao
Store cache contents into an S3 bucket - Issue #81 by @katxiao
Store SDGym results into an S3 bucket - Issue #80 by @katxiao
Add a way to collect cached results - Issue #79 by @katxiao
Allow reading datasets from private s3 bucket - Issue #74 by @katxiao
Typos in the sdgym.run function docstring documentation - Issue #69 by @sbrugman

v0.3.0 - 2021-01-27

Major rework of the SDGym functionality to support a collection of new features:

Add relational and timeseries model benchmarking.
Use SDMetrics for model scoring.
Update datasets format to match SDV metadata based storage format.
Centralize default datasets collection in the sdv-datasets S3 bucket.
Add options to download and use datasets from different S3 buckets.
Rename synthesizers to baselines and adapt to the new metadata format.
Add model execution and metric computation time logging.
Add optional synthetic data and error traceback caching.

v0.2.2 - 2020-10-17

This version adds a rework of the benchmark function and a few new synthesizers.

New Features

New CLI with run, make-leaderboard and make-summary commands
Parallel execution via Dask or Multiprocessing
Download datasets without executing the benchmark
Support for python from 3.6 to 3.8

New Synthesizers

sdv.tabular.CTGAN
sdv.tabular.CopulaGAN
sdv.tabular.GaussianCopulaOneHot
sdv.tabular.GaussianCopulaCategorical
sdv.tabular.GaussianCopulaCategoricalFuzzy

v0.2.1 - 2020-05-12

New updated leaderboard and minor improvements.

New Features

Add parameters for PrivBNSynthesizer - Issue #37 by @csala

v0.2.0 - 2020-04-10

New Becnhmark API and lots of improved documentation.

New Features

The benchmark function now returns a complete leaderboard instead of only one score
Class Synthesizers can be directly passed to the benchmark function

Bug Fixes

One hot encoding errors in the Independent, VEEGAN and Medgan Synthesizers.
Proper usage of the eval mode during sampling.
Fix improperly configured datasets.

v0.1.0 - 2019-08-07

First release to PyPi

Project details

These details have been verified by PyPI

Maintainers

amontanez24 fealho kveerama mit_dai_lab npatki pvkdeveloper

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.10.0

Feb 7, 2025

0.10.0.dev0 pre-release

Feb 6, 2025

0.9.1

Aug 29, 2024

0.9.1.dev0 pre-release

Aug 28, 2024

0.9.0

Aug 7, 2024

0.9.0.dev0 pre-release

Aug 6, 2024

0.8.0

Jun 7, 2024

0.8.0.dev1 pre-release

Jun 7, 2024

0.8.0.dev0 pre-release

Jun 4, 2024

0.7.0

Jun 14, 2023

0.7.0.dev0 pre-release

Jun 13, 2023

0.6.0

Feb 1, 2023

0.6.0.dev1 pre-release

Feb 1, 2023

This version

0.6.0.dev0 pre-release

Jan 27, 2023

0.5.0

Dec 13, 2021

0.5.0.dev0 pre-release

Dec 13, 2021

0.4.1

Aug 20, 2021

0.4.1.dev2 pre-release

Aug 20, 2021

0.4.1.dev1 pre-release

Jul 12, 2021

0.4.1.dev0 pre-release

Jul 12, 2021

0.4.0

Jun 17, 2021

0.4.0.dev1 pre-release

Jun 16, 2021

0.4.0.dev0 pre-release

Jun 14, 2021

0.3.1

May 21, 2021

0.3.1.dev2 pre-release

May 20, 2021

0.3.1.dev1 pre-release

Apr 12, 2021

0.3.1.dev0 pre-release

Apr 6, 2021

0.3.0

Jan 28, 2021

0.3.0.dev0 pre-release

Jan 28, 2021

0.2.2

Oct 17, 2020

0.2.2.dev0 pre-release

Oct 9, 2020

0.2.1

May 12, 2020

0.2.1.dev0 pre-release

May 12, 2020

0.2.0

Apr 10, 2020

0.2.0.dev1 pre-release

Apr 10, 2020

0.2.0.dev0 pre-release

Apr 10, 2020

0.1.0

Aug 8, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdgym-0.6.0.dev0.tar.gz (57.0 kB view details)

Uploaded Jan 27, 2023 Source

Built Distribution

sdgym-0.6.0.dev0-py2.py3-none-any.whl (51.0 kB view details)

Uploaded Jan 27, 2023 Python 2Python 3

File details

Details for the file sdgym-0.6.0.dev0.tar.gz.

File metadata

Download URL: sdgym-0.6.0.dev0.tar.gz
Upload date: Jan 27, 2023
Size: 57.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.9.2 readme-renderer/37.3 requests/2.28.1 requests-toolbelt/0.10.1 urllib3/1.26.13 tqdm/4.64.1 importlib-metadata/5.1.0 keyring/23.11.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.8.15

File hashes

Hashes for sdgym-0.6.0.dev0.tar.gz
Algorithm	Hash digest
SHA256	`ddf336da9f8d7654c0a9e797bb115535b3fbd01d34eced6b20a59cf7503e88fd`
MD5	`fd68abcf04d85afef83aee62d09ce8f3`
BLAKE2b-256	`0ea89e8a8d005b876b2cc02ae73f032bac896ac61c4c74acc89ab7fb6c06a038`

See more details on using hashes here.

File details

Details for the file sdgym-0.6.0.dev0-py2.py3-none-any.whl.

File metadata

Download URL: sdgym-0.6.0.dev0-py2.py3-none-any.whl
Upload date: Jan 27, 2023
Size: 51.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.9.2 readme-renderer/37.3 requests/2.28.1 requests-toolbelt/0.10.1 urllib3/1.26.13 tqdm/4.64.1 importlib-metadata/5.1.0 keyring/23.11.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.8.15

File hashes

Hashes for sdgym-0.6.0.dev0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`aabeba4624664b0f40ea6978f1a3abbfd297c70a4c20e5ba29c4d438ce4239d3`
MD5	`392547f560e8bce1bc9631826069d8d8`
BLAKE2b-256	`761410d1758815681cbf78b6da6b88a5b44ac9a3f6d4a8778b24ad8012e2e7f3`

See more details on using hashes here.

sdgym 0.6.0.dev0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

What is a Synthetic Data Generator?

Benchmark datasets

Install

Usage

Benchmarking your own Synthesizer

Benchmarking the SDGym Synthesizers

Additional References

History

v0.5.0 - 2021-12-13

Issues closed

v0.4.1 - 2021-08-20

Issues closed

v0.4.0 - 2021-06-17

New Features

v0.3.1 - 2021-05-20

Issues closed

v0.3.0 - 2021-01-27

v0.2.2 - 2020-10-17

New Features

New Synthesizers

v0.2.1 - 2020-05-12

New Features

v0.2.0 - 2020-04-10

New Features

Bug Fixes

v0.1.0 - 2019-08-07

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes