A framework to benchmark the performance of synthetic data generators for non-temporal tabular data
Project description
An open source project from Data to AI Lab at MIT.
- License: MIT
- Development Status: Pre-Alpha
- Homepage: https://github.com/sdv-dev/SDGym
Overview
Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators for tabular data. SDGym is a project of the Data to AI Laboratory at MIT.
What is a Synthetic Data Generator?
A Synthetic Data Generator is a Python function (or class method) that takes as input some data, which we call the real data, learns a model from it, and outputs new synthetic data that has similar mathematical properties as the real one.
Please refer to the synthesizers documentation for instructions about how to implement your own Synthetic Data Generator and integrate with SDGym. You can also read about how to use the ones included in SDGym and see the current leaderboard.
Benchmark datasets
SDGym evaluates the performance of Synthetic Data Generators using datasets that are in three families:
- Simulated data generated using Gaussian Mixtures
- Simulated data generated using Bayesian Networks
- Real world datasets
Further details about how these datasets were generated can be found in the Modeling Tabular data using Conditional GAN paper and in the datasets documentation.
Current Leaderboard
This is a summary of the current SDGym leaderboard, showing the number of datasets in which each Synthesizer obtained the best score.
The complete scores table can be found in the synthesizers document and it can also be downloaded as a CSV file form here: sdgym/leaderboard.csv
Detailed leaderboard results for all the releases are available in this Google Docs Spreadsheet.
Gaussian Mixture Simulated Data
Synthesizer | 0.2.2 | 0.2.1 | 0.2.0 |
---|---|---|---|
CLBNSynthesizer | 0 | 0.0 | 1.0 |
CTGAN | 0 | N/E | N/E |
CTGANSynthesizer | 0 | 0.0 | 1.0 |
CopulaGAN | 0 | N/E | N/E |
GaussianCopulaCategorical | 1 | N/E | N/E |
GaussianCopulaCategoricalFuzzy | 0 | N/E | N/E |
GaussianCopulaOneHot | 0 | N/E | N/E |
MedganSynthesizer | 0 | 0.0 | 0.0 |
PrivBNSynthesizer | 0 | 0.0 | 0.0 |
TVAESynthesizer | 5 | 5.0 | 4.0 |
TableganSynthesizer | 0 | 1.0 | 0.0 |
VEEGANSynthesizer | 0 | 0.0 | 0.0 |
Bayesian Networks Simulated Data
Synthesizer | 0.2.2 | 0.2.1 | 0.2.0 |
---|---|---|---|
CLBNSynthesizer | 0 | 0.0 | 0.0 |
CTGAN | 0 | N/E | N/E |
CTGANSynthesizer | 0 | 0.0 | 0.0 |
CopulaGAN | 0 | N/E | N/E |
GaussianCopulaCategorical | 0 | N/E | N/E |
GaussianCopulaCategoricalFuzzy | 0 | N/E | N/E |
GaussianCopulaOneHot | 0 | N/E | N/E |
MedganSynthesizer | 4 | 4.0 | 1.0 |
PrivBNSynthesizer | 3 | 3.0 | 6.0 |
TVAESynthesizer | 1 | 1.0 | 3.0 |
TableganSynthesizer | 0 | 0.0 | 0.0 |
VEEGANSynthesizer | 0 | 0.0 | 0.0 |
Real World Datasets
Synthesizer | 0.2.2 | 0.2.1 | 0.2.0 |
---|---|---|---|
CLBNSynthesizer | 0 | 0.0 | 0.0 |
CTGAN | 1 | N/E | N/E |
CTGANSynthesizer | 0 | 3.0 | 3.0 |
CopulaGAN | 3 | N/E | N/E |
GaussianCopulaCategorical | 0 | N/E | N/E |
GaussianCopulaCategoricalFuzzy | 0 | N/E | N/E |
GaussianCopulaOneHot | 0 | N/E | N/E |
MedganSynthesizer | 0 | 0.0 | 0.0 |
PrivBNSynthesizer | 0 | 0.0 | 0.0 |
TVAESynthesizer | 4 | 5.0 | 5.0 |
TableganSynthesizer | 0 | 0.0 | 0.0 |
VEEGANSynthesizer | 0 | 0.0 | 0.0 |
Install
Requirements
SDGym has been developed and tested on Python 3.6, 3.7 and 3.8
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDGym is run.
Install with pip
The easiest and recommended way to install SDGym is using pip:
pip install sdgym
This will pull and install the latest stable release from PyPi.
If you want to install it from source or contribute to the project please read the Contributing Guide for more details about how to do it.
Usage
Benchmarking your own synthesizer
All you need to do in order to use the SDGym Benchmark, is to import sdgym
and call its
run
function passing it your synthesizer function and the settings that you want to use
for the evaluation.
For example, if we want to evaluate a simple synthesizer function in the adult
dataset
we can execute:
import numpy as np
import sdgym
def my_synthesizer_function(real_data, categorical_columns, ordinal_columns):
"""dummy synthesizer that just returns a permutation of the real data."""
return np.random.permutation(real_data)
scores = sdgym.run(synthesizers=my_synthesizer_function, datasets=['adult'])
- You can learn how to create your own synthesizer function here.
- You can learn about different arguments for
sdgym.run
function here.
The output of the sdgym.run
function will be a pd.DataFrame
containing the results obtained
by your synthesizer on each dataset, as well as the results obtained previously by the SDGym
synthesizers:
adult/accuracy adult/f1 ... ring/test_likelihood
IndependentSynthesizer 0.56530 0.134593 ... -1.958888
UniformSynthesizer 0.39695 0.273753 ... -2.519416
IdentitySynthesizer 0.82440 0.659250 ... -1.705487
... ... ... ... ...
my_synthesizer_function 0.64865 0.210103 ... -1.964966
Benchmarking the SDGym Synthesizers
If you want to run the SDGym benchmark on the SDGym Synthesizers you can directly pass the
corresponding class, or a list of classes, to the sdgym.run
function.
For example, if you want to run the complete benchmark suite to evaluate all the existing synthesizers you can run (this will take a lot of time to run!):
from sdgym.synthesizers import (
CLBNSynthesizer, CTGANSynthesizer, IdentitySynthesizer, IndependentSynthesizer,
MedganSynthesizer, PrivBNSynthesizer, TableganSynthesizer, TVAESynthesizer,
UniformSynthesizer, VEEGANSynthesizer)
all_synthesizers = [
CLBNSynthesizer,
IdentitySynthesizer,
IndependentSynthesizer,
MedganSynthesizer,
PrivBNSynthesizer,
TableganSynthesizer,
CTGANSynthesizer,
TVAESynthesizer,
UniformSynthesizer,
VEEGANSynthesizer,
]
scores = sdgym.run(synthesizers=all_synthesizers)
For further details about all the arguments and possibilities that the benchmark
function offers
please refer to the benchmark documentation
Additional References
- Datasets used in SDGym are detailed here.
- How to write a synthesizer is detailed here.
- How to use benchmark function is detailed here.
- Detailed leaderboard results for all the releases are available here.
Related Projects
SDV
SDV, for Synthetic Data Vault, is the end-user library for synthesizing data in development under the HDI Project. SDV allows you to easily model and sample relational datasets using Copulas through a simple API. Other features include anonymization of Personal Identifiable Information (PII) and preserving relational integrity on sampled records.
CTGAN
CTGAN is the GAN based model for synthesizing tabular data presented in the Modeling Tabular data using Conditional GAN paper. It's also developed by the MIT's Data to AI Lab and is under active development.
TGAN
TGAN is another GAN based model for synthesizing tabular data. It's also developed by the MIT's Data to AI Lab and is under active development.
History
v0.2.2 - 2020-10-17
This version adds a rework of the the benchmark function and a few new synthetsizers.
New Features
- New CLI with
run
,make-leaderboard
andmake-summary
commands - Parallel execution via Dask or Multiprocessing
- Download datasets without executing the benchmark
- Support for python from 3.6 to 3.8
New Synthesizers
sdv.tabular.CTGAN
sdv.tabular.CopulaGAN
sdv.tabular.GaussianCopulaOneHot
sdv.tabular.GaussianCopulaCategorical
sdv.tabular.GaussianCopulaCategoricalFuzzy
v0.2.1 - 2020-05-12
New updated leaderboard and minor improvements.
New Features
- Add parameters for PrivBNSynthesizer - Issue #37 by @csala
v0.2.0 - 2020-04-10
New Becnhmark API and lots of improved documentation.
New Features
- The benchmark function now returns a complete leaderboard instead of only one score
- Class Synthesizers can be directly passed to the benchmark function
Bug Fixes
- One hot encoding errors in the Independent, VEEGAN and Medgan Synthesizers.
- Proper usage of the
eval
mode during sampling. - Fix improperly configured datasets.
v0.1.0 - 2019-08-07
First release to PyPi
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sdgym-0.2.2.tar.gz
.
File metadata
- Download URL: sdgym-0.2.2.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 953b8db54cf445022c90e4d027dff738c732580ec63a52ef16c4e94a6d3fe252 |
|
MD5 | db6975e1f1b3da03f5a0763179caffd2 |
|
BLAKE2b-256 | 6ec6bc3d2dea0618934ce334f4b4b049be87bfc8f20e783114837c226a9af71a |
File details
Details for the file sdgym-0.2.2-py2.py3-none-any.whl
.
File metadata
- Download URL: sdgym-0.2.2-py2.py3-none-any.whl
- Upload date:
- Size: 43.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f254d7d3d4541fc25d9099ffb81e8a6fe0106fac124b4d1f8570c0c933de532e |
|
MD5 | 5ea72a9b67125a861585cc33e8114197 |
|
BLAKE2b-256 | e62f80872f0723324ac3ea3c088473369a8a2f64a0fcfab81e6f64715e51ac1a |