Skip to main content

Metrics for Synthetic Data Generation Projects

Project description

“DAI-Lab” An open source project from Data to AI Lab at MIT.

Development Status PyPI Shield Downloads Tests Coverage Status

Metrics for Synthetic Data Generation Projects

Overview

The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after. It includes a variety of metrics such as:

  • Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
  • Detection metrics which use machine learning to try to distinguish between real and synthetic data.
  • Descriptive metrics which compute descriptive statistics on the real and synthetic datasets independently and then compare the values.

Install

Requirements

SDMetrics has been developed and tested on Python 3.6, 3.7 and 3.8

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDMetrics is run.

Install with pip

The easiest and recommended way to install SDMetrics is using pip:

pip install sdmetrics

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Basic Usage

Let's run the demo code from SDV to generate a simple synthetic dataset:

from sdv import load_demo, SDV

metadata, real_tables = load_demo(metadata=True)

sdv = SDV()
sdv.fit(metadata, real_tables)

synthetic_tables = sdv.sample_all(20)

Now that we have a synthetic dataset, we can evaluate it using SDMetrics by calling the evaluate function which returns an instance of MetricsReport with the default metrics:

from sdmetrics import evaluate

report = evaluate(metadata, real_tables, synthetic_tables)

Examining Metrics

This report object makes it easy to examine the metrics at different levels of granularity. For example, the overall method returns a single scalar value which functions as a composite score combining all of the metrics. This score can be passed to an optimization routine (i.e. to tune the hyperparameters in a model) and minimized in order to obtain higher quality synthetic data.

print(report.overall())

In addition, the report provides a highlights method which identifies the worst performing metrics. This provides useful hints to help users identify where their synthetic data falls short (i.e. which tables/columns/relationships are not being modeled properly).

print(report.highlights())

Visualizing Metrics

Finally, the report object provides a visualize method which generates a figure showing some of the key metrics.

figure = report.visualize()
figure.savefig("sdmetrics-report.png")

Advanced Usage

Specifying Metrics

Instead of running all the default metrics, you can specify exactly what metrics you want to run by creating an empty MetricsReport and adding the metrics yourself. For example, the following code only computes the machine learning detection-based metrics.

The MetricsReport object includes a details method which returns all of the metrics that were computed.

from sdmetrics import detection
from sdmetrics.report import MetricsReport

report = MetricsReport()
report.add_metrics(detection.metrics(metadata, real_tables, synthetic_tables))

Creating Metrics

Suppose you want to add some new metrics to this library. To do this, you simply need to write a function which yields instances of the Metric object:

from sdmetrics.report import Metric

def my_custom_metrics(metadata, real_tables, synthetic_tables):
    name = "abs-diff-in-number-of-rows"

    for table_name in metadata.get_tables():

        # Absolute difference in number of rows
        nb_real_rows = len(real_tables[table_name])
        nb_synthetic_rows = len(synthetic_tables[table_name])
        value = float(abs(nb_real_rows - nb_synthetic_rows))

        # Specify some useful tags for the user
        tags = set([
            "priority:high",
            "table:%s" % table_name
        ])

        yield Metric(name, value, tags)

To attach your metrics to a MetricsReport object, you can use the add_metrics method and provide your custom metrics iterator:

from sdmetrics.report import MetricsReport

report = MetricsReport()
report.add_metrics(my_custom_metrics(metadata, real_tables, synthetic_tables))

See sdmetrics.detection, sdmetrics.efficacy, and sdmetrics.statistical for more examples of how to implement metrics.

Filtering Metrics

The MetricsReport object includes a details method which returns all of the metrics that were computed.

from sdmetrics.report import MetricsReport

report = evaluate(metadata, real_tables, synthetic_tables)
report.details()

To filter these metrics, you can provide a filter function. For example, to only see metrics that are associated with the users table, you can run

def my_custom_filter(metric):
  if "table:users" in metric.tags:
    return True
  return False

report.details(my_custom_filter)

Examples of standard tags implemented by the built-in metrics are shown below.

Tag Description
priority:high This tag tells the user to pay extra attention to this metric. It typically indicates that the objects being evaluated by the metric are unusually bad (i.e. the synthetic values look very different from the real values).
table:TABLE_NAME This tag indicates that the metric involves the table specified by TABLE_NAME.
column:COL_NAME This tag indicates that the metric involves the column specified by COL_NAME. If the column names are not unique across the entire database, then it needs to be combined with the table:TABLE_NAME tag to uniquely identify a specific column.

As this library matures, we will define additional standard tags and/or promote them to first class attributes.

What's next?

For more details about SDMetrics and all its possibilities and features, please check the documentation site.

History

v0.0.4 - 2020-11-27

Patch release to relax dependencies and avoid conflicts when using the latest SDV version.

v0.0.3 - 2020-11-20

Fix error on detection metrics when input data contains infinity or NaN values.

Issues closed

  • ValueError: Input contains infinity or a value too large for dtype('float64') - Issue #11 by @csala

v0.0.2 - 2020-08-08

Add support for Python 3.8 and a broader range of dependencies.

v0.0.1 - 2020-06-26

First release to PyPI.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmetrics-0.0.4.tar.gz (140.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdmetrics-0.0.4-py2.py3-none-any.whl (24.3 kB view details)

Uploaded Python 2Python 3

File details

Details for the file sdmetrics-0.0.4.tar.gz.

File metadata

  • Download URL: sdmetrics-0.0.4.tar.gz
  • Upload date:
  • Size: 140.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.52.0 CPython/3.8.6

File hashes

Hashes for sdmetrics-0.0.4.tar.gz
Algorithm Hash digest
SHA256 5244aef229153dd6a778f13ef32a5b14ec87d25dabf3be799621c386636ae81e
MD5 6f074a77b20281d4b1d16beecb68bab8
BLAKE2b-256 b776dce396f085abf41faa144791ab3aa0552e26d9bce72fc2c852921a8da2e5

See more details on using hashes here.

File details

Details for the file sdmetrics-0.0.4-py2.py3-none-any.whl.

File metadata

  • Download URL: sdmetrics-0.0.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.52.0 CPython/3.8.6

File hashes

Hashes for sdmetrics-0.0.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b713cb097fcb3eb73ae0b04dea03def0f8b0baa364ab1b23ec77c8b512c17cfe
MD5 dbf09a4be5dcef9508d35a9ae993f694
BLAKE2b-256 48e1b809bd3a6cbea5433cf4bd693acfaeab707031d339a7d423056c1ebaf3da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page