Skip to main content

A framework to benchmark the performance of synthetic data generators for non-temporal tabular data

Project description

DAI-Lab An Open Source Project from the Data to AI Lab, at MIT

Development Status Travis PyPi Shield Downloads

Benchmarking framework for Synthetic Data Generators

Overview

Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators based on SDV and SDMetrics.

SDGym is a part of the The Synthetic Data Vault project.

What is a Synthetic Data Generator?

A Synthetic Data Generator is a Python function (or method) that takes as input some data, which we call the real data, learns a model from it, and outputs new synthetic data that has the same structure and similar mathematical properties as the real one.

Please refer to the synthesizers documentation for instructions about how to implement your own Synthetic Data Generator and integrate with SDGym. You can also read about how to use the ones already included in SDGym and see how to run them.

Benchmark datasets

SDGym evaluates the performance of Synthetic Data Generators using single table, multi table and timeseries datasets stored as CSV files alongside an SDV Metadata JSON file.

Further details about the list of available datasets and how to add your own datasets to the collection can be found in the datasets documentation.

Install

SDGym can be installed using the following commands:

Using pip:

pip install sdgym

Using conda:

conda install -c sdv-dev -c conda-forge sdgym

For more installation options please visit the SDGym installation Guide

Usage

Benchmarking your own Synthesizer

SDGym evaluates Synthetic Data Generators, which are Python functions (or classes) that take as input some data, which we call the real data, learn a model from it, and output new synthetic data that has the same structure and similar mathematical properties as the real one.

As an example, let use define a synthesizer function that applies the GaussianCopula model from SDV with gaussian distribution.

import numpy as np
from sdv.tabular import GaussianCopula


def gaussian_copula(real_data, metadata):
    gc = GaussianCopula(default_distribution='gaussian')
    table_name = metadata.get_tables()[0]
    gc.fit(real_data[table_name])
    return {table_name: gc.sample()}
:information_source: You can learn how to create your own synthesizer function here.

We can now try to evaluate this function on the asia and alarm datasets:

import sdgym

scores = sdgym.run(synthesizers=gaussian_copula, datasets=['asia', 'alarm'])
:information_source: You can learn about different arguments for sdgym.run function here.

The output of the sdgym.run function will be a pd.DataFrame containing the results obtained by your synthesizer on each dataset.

synthesizer dataset modality metric score metric_time model_time
gaussian_copula asia single-table BNLogLikelihood -2.842690 2.762427 0.752364
gaussian_copula alarm single-table BNLogLikelihood -20.223178 7.009401 3.173832

Benchmarking the SDGym Synthesizers

If you want to run the SDGym benchmark on the SDGym Synthesizers you can directly pass the corresponding class, or a list of classes, to the sdgym.run function.

For example, if you want to run the complete benchmark suite to evaluate all the existing synthesizers you can run (:warning: this will take a lot of time to run!):

from sdgym.synthesizers import (
    CLBN, CopulaGAN, CTGAN, HMA1, Identity, Independent,
    MedGAN, PAR, PrivBN, SDV, TableGAN, TVAE,
    Uniform, VEEGAN)

all_synthesizers = [
    CLBN,
    CTGAN,
    CopulaGAN,
    HMA1,
    Identity,
    Independent,
    MedGAN,
    PAR,
    PrivBN,
    SDV,
    TVAE,
    TableGAN,
    Uniform,
    VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)

For further details about all the arguments and possibilities that the benchmark function offers please refer to the benchmark documentation

Additional References

  • Datasets used in SDGym are detailed here.
  • How to write a synthesizer is detailed here.
  • How to use benchmark function is detailed here.
  • Detailed leaderboard results for all the releases are available here.

The Synthetic Data Vault

This repository is part of The Synthetic Data Vault Project

History

v0.4.0 - 2021-06-17

This release fixed a bug where passing a json file as configuration for a multi-table synthesizer crashed the model. It also adds a number of fixes and enhancements, including: (1) a function and CLI command to list the available synthesizer names, (2) a curate set of dependencies and making Gretel into an optional dependency, (3) updating Gretel to use temp directories, (4) using nvidia-smi to get the number of gpus and (5) multiple dockerfile updates to improve functionality.

Issues closed

  • Bug when using JSON configuration for multiple multi-table evaluation - Issue #115 by @pvk-developer
  • Use nvidia-smi to get number of gpus - PR #113 by @katxiao
  • List synthesizer names - Issue #82 by @fealho
  • Use nvidia base for dockerfile - PR #108 by @katxiao
  • Add Makefile target to install gretel and ydata - PR #107 by @katxiao
  • Curate dependencies and make Gretel optional - PR #106 by @csala
  • Update gretel checkpoints to use temp directory - PR #105 by @katxiao
  • Initialize variable before reference - PR #104 by @katxiao

v0.4.0 - 2021-06-17

This release adds new synthesizers for Gretel and ydata, and creates a Docker image for SDGym. It also includes enhancements to the accepted SDGym arguments, adds a summary command to aggregate metrics, and adds the normalized score to the benchmark results.

New Features

  • Add normalized score to benchmark results - Issue #102 by @katxiao
  • Add max rows and max columns args - Issue #96 by @katxiao
  • Automatically detect number of workers - Issue #97 by @katxiao
  • Add summary function and command - Issue #92 by @amontanez24
  • Allow jobs list/JSON to be passed - Issue #93 by @fealho
  • Add ydata to sdgym - Issue #90 by @fealho
  • Add dockerfile for sdgym - Issue #88 by @katxiao
  • Add Gretel to SDGym synthesizer - Issue #87 by @amontanez24

v0.3.1 - 2021-05-20

This release adds new features to store results and cache contents into an S3 bucket as well as a script to collect results from a cache dir and compile a single results CSV file.

Issues closed

  • Collect cached results from s3 bucket - Issue #85 by @katxiao
  • Store cache contents into an S3 bucket - Issue #81 by @katxiao
  • Store SDGym results into an S3 bucket - Issue #80 by @katxiao
  • Add a way to collect cached results - Issue #79 by @katxiao
  • Allow reading datasets from private s3 bucket - Issue #74 by @katxiao
  • Typos in the sdgym.run function docstring documentation - Issue #69 by @sbrugman

v0.3.0 - 2021-01-27

Major rework of the SDGym functionality to support a collection of new features:

  • Add relational and timeseries model benchmarking.
  • Use SDMetrics for model scoring.
  • Update datasets format to match SDV metadata based storage format.
  • Centralize default datasets collection in the sdv-datasets S3 bucket.
  • Add options to download and use datasets from different S3 buckets.
  • Rename synthesizers to baselines and adapt to the new metadata format.
  • Add model execution and metric computation time logging.
  • Add optional synthetic data and error traceback caching.

v0.2.2 - 2020-10-17

This version adds a rework of the benchmark function and a few new synthesizers.

New Features

  • New CLI with run, make-leaderboard and make-summary commands
  • Parallel execution via Dask or Multiprocessing
  • Download datasets without executing the benchmark
  • Support for python from 3.6 to 3.8

New Synthesizers

  • sdv.tabular.CTGAN
  • sdv.tabular.CopulaGAN
  • sdv.tabular.GaussianCopulaOneHot
  • sdv.tabular.GaussianCopulaCategorical
  • sdv.tabular.GaussianCopulaCategoricalFuzzy

v0.2.1 - 2020-05-12

New updated leaderboard and minor improvements.

New Features

  • Add parameters for PrivBNSynthesizer - Issue #37 by @csala

v0.2.0 - 2020-04-10

New Becnhmark API and lots of improved documentation.

New Features

  • The benchmark function now returns a complete leaderboard instead of only one score
  • Class Synthesizers can be directly passed to the benchmark function

Bug Fixes

  • One hot encoding errors in the Independent, VEEGAN and Medgan Synthesizers.
  • Proper usage of the eval mode during sampling.
  • Fix improperly configured datasets.

v0.1.0 - 2019-08-07

First release to PyPi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdgym-0.4.1.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

sdgym-0.4.1-py2.py3-none-any.whl (46.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file sdgym-0.4.1.tar.gz.

File metadata

  • Download URL: sdgym-0.4.1.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.11

File hashes

Hashes for sdgym-0.4.1.tar.gz
Algorithm Hash digest
SHA256 5f8bcfec64316037b906a6087c03abd4b02f3da11b77dbba6255588d337b09eb
MD5 c9e10898f23f116abdef9726ba5fe4e9
BLAKE2b-256 4da8cdcd9b8d412bf7b3b58f4ce99e3df9597ba0c1161bb60a3fe7646e6c0305

See more details on using hashes here.

File details

Details for the file sdgym-0.4.1-py2.py3-none-any.whl.

File metadata

  • Download URL: sdgym-0.4.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 46.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.11

File hashes

Hashes for sdgym-0.4.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b6d916ce0fe80a1d8f1f83a3f890ac08285d29835fbe823621fdbfd7f8a10033
MD5 a6404b2212c262b2c7de88a549d4fcb9
BLAKE2b-256 718b0020687c12b56c1a49ef22ec8ee88b5187122cf2b5feb249bada688fecfe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page