Skip to main content

Metrics for Synthetic Data Generation Projects

Project description

DAI-Lab An Open Source Project from the Data to AI Lab, at MIT

Development Status PyPI Shield Downloads Tests Coverage Status

Metrics for Synthetic Data Generation Projects

Overview

The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after.

It supports multiple data modalities:

  • Single Columns: Compare 1 dimensional numpy arrays representing individual columns.
  • Column Pairs: Compare how columns in a pandas.DataFrame relate to each other, in groups of 2.
  • Single Table: Compare an entire table, represented as a pandas.DataFrame.
  • Multi Table: Compare multi-table and relational datasets represented as a python dict with multiple tables passed as pandas.DataFrames.
  • Time Series: Compare tables representing ordered sequences of events.

It includes a variety of metrics such as:

  • Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
  • Detection metrics which use machine learning to try to distinguish between real and synthetic data.
  • Efficacy metrics which compare the performance of machine learning models when run on the synthetic and real data.
  • Bayesian Network and Gaussian Mixture metrics which learn the distribution of the real data and evaluate the likelihood of the synthetic data belonging to the learned distribution.
  • Privacy metrics which evaluate whether the synthetic data is leaking information about the real data.

Install

Requirements

SDMetrics has been developed and tested on Python 3.6, 3.7 and 3.8

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDMetrics is run.

Install with pip

The easiest and recommended way to install SDMetrics is using pip:

pip install sdmetrics

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Install with conda

SDMetrics can also be installed using conda:

conda install -c sdv-dev -c conda-forge sdmetrics

This will pull and install the latest stable release from Anaconda.

Basic Usage

In this small code snippet we show an example of how to use SDMetrics to evaluate how similar a toy multi-table dataset and its synthetic replica are:

  1. The demo data is loaded.
  2. The list of available multi-table metrics is retreived.
  3. All the metrics are run to compare the real and synthetic data.
  4. A pandas.DataFrame is built with the results.
import pandas as pd
import sdmetrics

# Load the demo data, which includes:
# - A dict containing the real tables as pandas.DataFrames.
# - A dict containing the synthetic clones of the real data.
# - A dict containing metadata about the tables.
real_data, synthetic_data, metadata = sdmetrics.load_demo()

# Obtain the list of multi table metrics, which is returned as a dict
# containing the metric names and the corresponding metric classes.
metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()

# Iterate over the metrics and compute them, capturing the scores obtained.
scores = []
for name, metric in metrics.items():
    try:
        scores.append({
        'metric': name,
        'score': metric.compute(real_data, synthetic_data, metadata)
        })
    except ValueError:
        pass   # Ignore metrics that do not support this data

# Put the results in a DataFrame for pretty printing.
scores = pd.DataFrame(scores)

The result will be a table containing the list of metrics that have been computed and the scores obtained, similar to this one:

metric score
CSTest 0.76651
KSTest 0.75
KSTestExtended 0.777778
LogisticDetection 0.925926
SVCDetection 0.703704
LogisticParentChildDetection 0.541667
SVCParentChildDetection 0.923611

What's next?

For more details about SDMetrics and SDV please visit the documentation site.

More details about each individual type of metrics can also be found here:

The Synthetic Data Vault

This repository is part of The Synthetic Data Vault Project

History

v0.0.4 - 2020-11-27

Patch release to relax dependencies and avoid conflicts when using the latest SDV version.

v0.0.3 - 2020-11-20

Fix error on detection metrics when input data contains infinity or NaN values.

Issues closed

  • ValueError: Input contains infinity or a value too large for dtype('float64') - Issue #11 by @csala

v0.0.2 - 2020-08-08

Add support for Python 3.8 and a broader range of dependencies.

v0.0.1 - 2020-06-26

First release to PyPI.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmetrics-0.1.0.dev0.tar.gz (144.3 kB view hashes)

Uploaded Source

Built Distribution

sdmetrics-0.1.0.dev0-py2.py3-none-any.whl (36.9 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page