Metrics for Synthetic Data Generation Projects

These details have not been verified by PyPI

Project links

Homepage

Project description

“DAI-Lab” An open source project from Data to AI Lab at MIT.

Metrics for Synthetic Data Generation Projects

License: MIT
Development Status: Pre-Alpha
Documentation: https://sdv-dev.github.io/SDMetrics
Homepage: https://github.com/sdv-dev/SDMetrics

Overview

The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after. It includes a variety of metrics such as:

Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
Detection metrics which use machine learning to try to distinguish between real and synthetic data.
Descriptive metrics which compute descriptive statistics on the real and synthetic datasets independently and then compare the values.

Install

Requirements

SDMetrics has been developed and tested on Python 3.6, 3.7 and 3.8

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDMetrics is run.

Install with pip

The easiest and recommended way to install SDMetrics is using pip:

pip install sdmetrics

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Basic Usage

Let's run the demo code from SDV to generate a simple synthetic dataset:

from sdv import load_demo, SDV

metadata, real_tables = load_demo(metadata=True)

sdv = SDV()
sdv.fit(metadata, real_tables)

synthetic_tables = sdv.sample_all(20)

Now that we have a synthetic dataset, we can evaluate it using SDMetrics by calling the evaluate function which returns an instance of MetricsReport with the default metrics:

from sdmetrics import evaluate

report = evaluate(metadata, real_tables, synthetic_tables)

Examining Metrics

This report object makes it easy to examine the metrics at different levels of granularity. For example, the overall method returns a single scalar value which functions as a composite score combining all of the metrics. This score can be passed to an optimization routine (i.e. to tune the hyperparameters in a model) and minimized in order to obtain higher quality synthetic data.

print(report.overall())

In addition, the report provides a highlights method which identifies the worst performing metrics. This provides useful hints to help users identify where their synthetic data falls short (i.e. which tables/columns/relationships are not being modeled properly).

print(report.highlights())

Visualizing Metrics

Finally, the report object provides a visualize method which generates a figure showing some of the key metrics.

figure = report.visualize()
figure.savefig("sdmetrics-report.png")

Advanced Usage

Specifying Metrics

Instead of running all the default metrics, you can specify exactly what metrics you want to run by creating an empty MetricsReport and adding the metrics yourself. For example, the following code only computes the machine learning detection-based metrics.

The MetricsReport object includes a details method which returns all of the metrics that were computed.

from sdmetrics import detection
from sdmetrics.report import MetricsReport

report = MetricsReport()
report.add_metrics(detection.metrics(metadata, real_tables, synthetic_tables))

Creating Metrics

Suppose you want to add some new metrics to this library. To do this, you simply need to write a function which yields instances of the Metric object:

from sdmetrics.report import Metric

def my_custom_metrics(metadata, real_tables, synthetic_tables):
    name = "abs-diff-in-number-of-rows"

    for table_name in metadata.get_tables():

        # Absolute difference in number of rows
        nb_real_rows = len(real_tables[table_name])
        nb_synthetic_rows = len(synthetic_tables[table_name])
        value = float(abs(nb_real_rows - nb_synthetic_rows))

        # Specify some useful tags for the user
        tags = set([
            "priority:high",
            "table:%s" % table_name
        ])

        yield Metric(name, value, tags)

To attach your metrics to a MetricsReport object, you can use the add_metrics method and provide your custom metrics iterator:

from sdmetrics.report import MetricsReport

report = MetricsReport()
report.add_metrics(my_custom_metrics(metadata, real_tables, synthetic_tables))

See sdmetrics.detection, sdmetrics.efficacy, and sdmetrics.statistical for more examples of how to implement metrics.

Filtering Metrics

The MetricsReport object includes a details method which returns all of the metrics that were computed.

from sdmetrics.report import MetricsReport

report = evaluate(metadata, real_tables, synthetic_tables)
report.details()

To filter these metrics, you can provide a filter function. For example, to only see metrics that are associated with the users table, you can run

def my_custom_filter(metric):
  if "table:users" in metric.tags:
    return True
  return False

report.details(my_custom_filter)

Examples of standard tags implemented by the built-in metrics are shown below.

Tag	Description
`priority:high`	This tag tells the user to pay extra attention to this metric. It typically indicates that the objects being evaluated by the metric are unusually bad (i.e. the synthetic values look very different from the real values).
`table:TABLE_NAME`	This tag indicates that the metric involves the table specified by `TABLE_NAME`.
`column:COL_NAME`	This tag indicates that the metric involves the column specified by `COL_NAME`. If the column names are not unique across the entire database, then it needs to be combined with the `table:TABLE_NAME` tag to uniquely identify a specific column.

As this library matures, we will define additional standard tags and/or promote them to first class attributes.

What's next?

For more details about SDMetrics and all its possibilities and features, please check the documentation site.

History

v0.0.4 - 2020-11-27

Patch release to relax dependencies and avoid conflicts when using the latest SDV version.

v0.0.3 - 2020-11-20

Fix error on detection metrics when input data contains infinity or NaN values.

Issues closed

ValueError: Input contains infinity or a value too large for dtype('float64') - Issue #11 by @csala

v0.0.2 - 2020-08-08

Add support for Python 3.8 and a broader range of dependencies.

v0.0.1 - 2020-06-26

First release to PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.27.2

Feb 26, 2026

0.27.2.dev0 pre-release

Feb 26, 2026

0.27.1

Feb 13, 2026

0.27.1.dev0 pre-release

Feb 13, 2026

0.27.0

Jan 29, 2026

0.26.1.dev0 pre-release

Jan 29, 2026

0.26.0

Jan 27, 2026

0.25.1.dev0 pre-release

Jan 23, 2026

0.25.0

Jan 8, 2026

0.24.1.dev0 pre-release

Jan 8, 2026

0.24.0

Nov 3, 2025

0.23.1.dev0 pre-release

Oct 30, 2025

0.23.0

Aug 14, 2025

0.22.1.dev0 pre-release

Aug 13, 2025

0.22.0

Jul 24, 2025

0.21.1.dev0 pre-release

Jul 24, 2025

0.21.0

May 29, 2025

0.21.0.dev0 pre-release

May 29, 2025

0.20.1

Apr 14, 2025

0.20.1.dev0 pre-release

Apr 14, 2025

0.20.0 yanked

Apr 11, 2025

Reason this release was yanked:

Imports crashed unless torch was installed

0.20.0.dev0 pre-release

Apr 10, 2025

0.19.0

Feb 25, 2025

0.19.0.dev0 pre-release

Feb 24, 2025

0.18.0

Dec 13, 2024

0.18.0.dev0 pre-release

Dec 13, 2024

0.17.1

Dec 4, 2024

0.17.1.dev0 pre-release

Dec 4, 2024

0.17.0

Nov 15, 2024

0.17.0.dev0 pre-release

Nov 14, 2024

0.16.0

Sep 25, 2024

0.16.0.dev0 pre-release

Sep 25, 2024

0.15.1

Aug 13, 2024

0.15.1.dev0 pre-release

Aug 13, 2024

0.15.0

Jul 15, 2024

0.15.0.dev0 pre-release

Jul 12, 2024

0.14.1

May 13, 2024

0.14.1.dev0 pre-release

May 13, 2024

0.14.0

Apr 11, 2024

0.14.0.dev0 pre-release

Apr 10, 2024

0.13.1

Mar 14, 2024

0.13.1.dev0 pre-release

Mar 14, 2024

0.13.0

Dec 4, 2023

0.13.0.dev0 pre-release

Nov 30, 2023

0.12.1

Nov 1, 2023

0.12.1.dev0 pre-release

Nov 1, 2023

0.12.0

Nov 1, 2023

0.12.0.dev0 pre-release

Oct 31, 2023

0.11.1

Sep 14, 2023

0.11.1.dev0 pre-release

Sep 14, 2023

0.11.0

Aug 10, 2023

0.11.0.dev0 pre-release

Aug 10, 2023

0.10.1

Jun 6, 2023

0.10.1.dev0 pre-release

Jun 5, 2023

0.10.0

May 4, 2023

0.10.0.dev2 pre-release

May 3, 2023

0.10.0.dev1 pre-release

May 3, 2023

0.10.0.dev0 pre-release

May 2, 2023

0.9.3

Apr 12, 2023

0.9.3.dev0 pre-release

Apr 11, 2023

0.9.2

Mar 8, 2023

0.9.2.dev0 pre-release

Mar 7, 2023

0.9.1

Feb 17, 2023

0.9.1.dev0 pre-release

Feb 16, 2023

0.9.0

Jan 18, 2023

0.9.0.dev0 pre-release

Jan 18, 2023

0.8.1

Dec 10, 2022

0.8.1.dev0 pre-release

Dec 8, 2022

0.8.0

Nov 2, 2022

0.8.0.dev0 pre-release

Nov 2, 2022

0.7.0

Sep 27, 2022

0.7.0.dev0 pre-release

Sep 27, 2022

0.6.0

Aug 12, 2022

0.6.0.dev1 pre-release

Aug 12, 2022

0.6.0.dev0 pre-release

Aug 12, 2022

0.5.1.dev0 pre-release

Jul 10, 2022

0.5.0

May 11, 2022

0.5.0.dev0 pre-release

May 11, 2022

0.4.2 yanked

May 10, 2022

Reason this release was yanked:

dependency conflict

0.4.2.dev0 pre-release

May 10, 2022

0.4.1

Dec 9, 2021

0.4.1.dev0 pre-release

Dec 9, 2021

0.4.0

Nov 16, 2021

0.4.0.dev0 pre-release

Nov 16, 2021

0.3.3.dev0 pre-release

Nov 5, 2021

0.3.2

Aug 17, 2021

0.3.2.dev1 pre-release

Aug 17, 2021

0.3.2.dev0 pre-release

Aug 17, 2021

0.3.1

Jul 12, 2021

0.3.1.dev1 pre-release

Jul 7, 2021

0.3.1.dev0 pre-release

Jul 2, 2021

0.3.0

Mar 31, 2021

0.3.0.dev1 pre-release

Mar 31, 2021

0.3.0.dev0 pre-release

Mar 29, 2021

0.2.1.dev0 pre-release

Mar 29, 2021

0.2.0

Feb 24, 2021

0.2.0.dev0 pre-release

Feb 23, 2021

0.1.3

Feb 13, 2021

0.1.3.dev0 pre-release

Feb 13, 2021

0.1.2

Jan 27, 2021

0.1.2.dev2 pre-release

Jan 27, 2021

0.1.2.dev1 pre-release

Jan 27, 2021

0.1.2.dev0 pre-release

Jan 27, 2021

0.1.1

Dec 30, 2020

0.1.1.dev0 pre-release

Dec 29, 2020

0.1.0

Dec 18, 2020

0.1.0.dev2 pre-release

Dec 18, 2020

0.1.0.dev1 pre-release

Dec 18, 2020

0.1.0.dev0 pre-release

Dec 16, 2020

This version

0.0.4

Nov 27, 2020

0.0.4.dev0 pre-release

Nov 27, 2020

0.0.3

Nov 20, 2020

0.0.3.dev1 pre-release

Nov 20, 2020

0.0.3.dev0 pre-release

Nov 20, 2020

0.0.2

Aug 8, 2020

0.0.2.dev1 pre-release

Aug 7, 2020

0.0.2.dev0 pre-release

Jul 9, 2020

0.0.1

Jun 26, 2020

0.0.1.dev2 pre-release

Jun 26, 2020

0.0.1.dev1 pre-release

Jun 25, 2020

0.0.1.dev0 pre-release

Jun 25, 2020

0.0.0

Mar 20, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmetrics-0.0.4.tar.gz (140.2 kB view details)

Uploaded Nov 27, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sdmetrics-0.0.4-py2.py3-none-any.whl (24.3 kB view details)

Uploaded Nov 27, 2020 Python 2Python 3

File details

Details for the file sdmetrics-0.0.4.tar.gz.

File metadata

Download URL: sdmetrics-0.0.4.tar.gz
Upload date: Nov 27, 2020
Size: 140.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.52.0 CPython/3.8.6

File hashes

Hashes for sdmetrics-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`5244aef229153dd6a778f13ef32a5b14ec87d25dabf3be799621c386636ae81e`
MD5	`6f074a77b20281d4b1d16beecb68bab8`
BLAKE2b-256	`b776dce396f085abf41faa144791ab3aa0552e26d9bce72fc2c852921a8da2e5`

See more details on using hashes here.

File details

Details for the file sdmetrics-0.0.4-py2.py3-none-any.whl.

File metadata

Download URL: sdmetrics-0.0.4-py2.py3-none-any.whl
Upload date: Nov 27, 2020
Size: 24.3 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.52.0 CPython/3.8.6

File hashes

Hashes for sdmetrics-0.0.4-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`b713cb097fcb3eb73ae0b04dea03def0f8b0baa364ab1b23ec77c8b512c17cfe`
MD5	`dbf09a4be5dcef9508d35a9ae993f694`
BLAKE2b-256	`48e1b809bd3a6cbea5433cf4bd693acfaeab707031d339a7d423056c1ebaf3da`

See more details on using hashes here.

sdmetrics 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Install

Requirements

Install with pip

Basic Usage

Examining Metrics

Visualizing Metrics

Advanced Usage

Specifying Metrics

Creating Metrics

Filtering Metrics

What's next?

History

v0.0.4 - 2020-11-27

v0.0.3 - 2020-11-20

Issues closed

v0.0.2 - 2020-08-08

v0.0.1 - 2020-06-26

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes