A framework to benchmark the performance of synthetic data generators for non-temporal tabular data
Project description
An Open Source Project from the Data to AI Lab, at MIT
Benchmarking framework for Synthetic Data Generators
- Website: https://sdv.dev
- Documentation: https://sdv.dev/SDV
- Repository: https://github.com/sdv-dev/SDGym
- License: MIT
- Development Status: Pre-Alpha
Overview
Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators based on SDV and SDMetrics.
SDGym is a part of the The Synthetic Data Vault project.
What is a Synthetic Data Generator?
A Synthetic Data Generator is a Python function (or method) that takes as input some data, which we call the real data, learns a model from it, and outputs new synthetic data that has the same structure and similar mathematical properties as the real one.
Please refer to the synthesizers documentation for instructions about how to implement your own Synthetic Data Generator and integrate with SDGym. You can also read about how to use the ones already included in SDGym and see how to run them.
Benchmark datasets
SDGym evaluates the performance of Synthetic Data Generators using single table, multi table and timeseries datasets stored as CSV files alongside an SDV Metadata JSON file.
Further details about the list of available datasets and how to add your own datasets to the collection can be found in the datasets documentation.
Install
SDGym can be installed using the following commands:
Using pip
:
pip install sdgym
Using conda
:
conda install -c sdv-dev -c conda-forge sdgym
For more installation options please visit the SDGym installation Guide
Usage
Benchmarking your own Synthesizer
SDGym evaluates Synthetic Data Generators, which are Python functions (or classes) that take as input some data, which we call the real data, learn a model from it, and output new synthetic data that has the same structure and similar mathematical properties as the real one.
As an example, let use define a synthesizer function that applies the GaussianCopula model from SDV
with gaussian
distribution.
import numpy as np
from sdv.tabular import GaussianCopula
def gaussian_copula(real_data, metadata):
gc = GaussianCopula(default_distribution='gaussian')
table_name = metadata.get_tables()[0]
gc.fit(real_data[table_name])
return {table_name: gc.sample()}
:information_source: You can learn how to create your own synthesizer function here. |
---|
We can now try to evaluate this function on the asia
and alarm
datasets:
import sdgym
scores = sdgym.run(synthesizers=gaussian_copula, datasets=['asia', 'alarm'])
:information_source: You can learn about different arguments for sdgym.run function here. |
---|
The output of the sdgym.run
function will be a pd.DataFrame
containing the results obtained
by your synthesizer on each dataset.
synthesizer | dataset | modality | metric | score | metric_time | model_time |
---|---|---|---|---|---|---|
gaussian_copula | asia | single-table | BNLogLikelihood | -2.842690 | 2.762427 | 0.752364 |
gaussian_copula | alarm | single-table | BNLogLikelihood | -20.223178 | 7.009401 | 3.173832 |
Benchmarking the SDGym Synthesizers
If you want to run the SDGym benchmark on the SDGym Synthesizers you can directly pass the
corresponding class, or a list of classes, to the sdgym.run
function.
For example, if you want to run the complete benchmark suite to evaluate all the existing synthesizers you can run (:warning: this will take a lot of time to run!):
from sdgym.synthesizers import (
CLBN, CopulaGAN, CTGAN, HMA1, Identity, Independent,
MedGAN, PAR, PrivBN, SDV, TableGAN, TVAE,
Uniform, VEEGAN)
all_synthesizers = [
CLBN,
CTGAN,
CopulaGAN,
HMA1,
Identity,
Independent,
MedGAN,
PAR,
PrivBN,
SDV,
TVAE,
TableGAN,
Uniform,
VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)
For further details about all the arguments and possibilities that the benchmark
function offers
please refer to the benchmark documentation
Additional References
- Datasets used in SDGym are detailed here.
- How to write a synthesizer is detailed here.
- How to use benchmark function is detailed here.
- Detailed leaderboard results for all the releases are available here.
The Synthetic Data Vault
This repository is part of The Synthetic Data Vault Project
- Website: https://sdv.dev
- Documentation: https://sdv.dev/SDV
History
v0.3.1 - 2021-05-20
This release adds new features to store results and cache contents into an S3 bucket as well as a script to collect results from a cache dir and compile a single results CSV file.
Issues closed
- Collect cached results from s3 bucket - Issue #85 by @katxiao
- Store cache contents into an S3 bucket - Issue #81 by @katxiao
- Store SDGym results into an S3 bucket - Issue #80 by @katxiao
- Add a way to collect cached results - Issue #79 by @katxiao
- Allow reading datasets from private s3 bucket - Issue #74 by @katxiao
- Typos in the sdgym.run function docstring documentation - Issue #69 by @sbrugman
v0.3.0 - 2021-01-27
Major rework of the SDGym functionality to support a collection of new features:
- Add relational and timeseries model benchmarking.
- Use SDMetrics for model scoring.
- Update datasets format to match SDV metadata based storage format.
- Centralize default datasets collection in the
sdv-datasets
S3 bucket. - Add options to download and use datasets from different S3 buckets.
- Rename synthesizers to baselines and adapt to the new metadata format.
- Add model execution and metric computation time logging.
- Add optional synthetic data and error traceback caching.
v0.2.2 - 2020-10-17
This version adds a rework of the benchmark function and a few new synthesizers.
New Features
- New CLI with
run
,make-leaderboard
andmake-summary
commands - Parallel execution via Dask or Multiprocessing
- Download datasets without executing the benchmark
- Support for python from 3.6 to 3.8
New Synthesizers
sdv.tabular.CTGAN
sdv.tabular.CopulaGAN
sdv.tabular.GaussianCopulaOneHot
sdv.tabular.GaussianCopulaCategorical
sdv.tabular.GaussianCopulaCategoricalFuzzy
v0.2.1 - 2020-05-12
New updated leaderboard and minor improvements.
New Features
- Add parameters for PrivBNSynthesizer - Issue #37 by @csala
v0.2.0 - 2020-04-10
New Becnhmark API and lots of improved documentation.
New Features
- The benchmark function now returns a complete leaderboard instead of only one score
- Class Synthesizers can be directly passed to the benchmark function
Bug Fixes
- One hot encoding errors in the Independent, VEEGAN and Medgan Synthesizers.
- Proper usage of the
eval
mode during sampling. - Fix improperly configured datasets.
v0.1.0 - 2019-08-07
First release to PyPi
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sdgym-0.4.0.dev0.tar.gz
.
File metadata
- Download URL: sdgym-0.4.0.dev0.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6770c2f9e78cc5f8357cc6b2e1be9a59a1ce6190250bbee89a7a7437b183797d |
|
MD5 | 9ecc471f54dfd49608f890470003b14f |
|
BLAKE2b-256 | 685a934c8e03bec02e046f989713a93b1bee44747ebea410583f05455bfc13e9 |
File details
Details for the file sdgym-0.4.0.dev0-py2.py3-none-any.whl
.
File metadata
- Download URL: sdgym-0.4.0.dev0-py2.py3-none-any.whl
- Upload date:
- Size: 45.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e85fe8e0e433833672dc63ceb0acbfea0393f3d22b68799ebeee31171d571582 |
|
MD5 | d715c51f7a95aed8576bff4b711a6afe |
|
BLAKE2b-256 | f6219a9aaa4a7082ccaf6d443f0a31059a0eaa316c5ea4be6d0644afddff534c |