Metrics for Synthetic Data Generation Projects

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

“DAI-Lab” An open source project from Data to AI Lab at MIT.

Metrics for Synthetic Data Generation Projects

License: MIT
Development Status: Pre-Alpha
Documentation: https://sdv-dev.github.io/SDMetrics
Homepage: https://github.com/sdv-dev/SDMetrics

Overview

The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after. It includes a variety of metrics such as:

Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
Detection metrics which use machine learning to try to distinguish between real and synthetic data.
Descriptive metrics which compute descriptive statistics on the real and synthetic datasets independently and then compare the values.

Install

Requirements

SDMetrics has been developed and tested on Python 3.5, 3.6, and 3.7

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDMetrics is run.

Install with pip

The easiest and recommended way to install SDMetrics is using pip:

pip install sdmetrics

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Basic Usage

Let's run the demo code from SDV to generate a simple synthetic dataset:

from sdv import load_demo, SDV

metadata, real_tables = load_demo(metadata=True)

sdv = SDV()
sdv.fit(metadata, real_tables)

synthetic_tables = sdv.sample_all(20)

Now that we have a synthetic dataset, we can evaluate it using SDMetrics by calling the evaluate function which returns an instance of MetricsReport with the default metrics:

from sdmetrics import evaluate

report = evaluate(metadata, real_tables, synthetic_tables)

Examining Metrics

This report object makes it easy to examine the metrics at different levels of granularity. For example, the overall method returns a single scalar value which functions as a composite score combining all of the metrics. This score can be passed to an optimization routine (i.e. to tune the hyperparameters in a model) and minimized in order to obtain higher quality synthetic data.

print(report.overall())

In addition, the report provides a highlights method which identifies the worst performing metrics. This provides useful hints to help users identify where their synthetic data falls short (i.e. which tables/columns/relationships are not being modeled properly).

print(report.highlights())

Visualizing Metrics

Finally, the report object provides a visualize method which generates a figure showing some of the key metrics.

figure = report.visualize()
figure.savefig("sdmetrics-report.png")

Advanced Usage

Specifying Metrics

Instead of running all the default metrics, you can specify exactly what metrics you want to run by creating an empty MetricsReport and adding the metrics yourself. For example, the following code only computes the machine learning detection-based metrics.

The MetricsReport object includes a details method which returns all of the metrics that were computed.

from sdmetrics import detection
from sdmetrics.report import MetricsReport

report = MetricsReport()
report.add_metrics(detection.metrics(metadata, real_tables, synthetic_tables))

Creating Metrics

Suppose you want to add some new metrics to this library. To do this, you simply need to write a function which yields instances of the Metric object:

from sdmetrics.report import Metric

def my_custom_metrics(metadata, real_tables, synthetic_tables):
    name = "abs-diff-in-number-of-rows"

    for table_name in metadata.get_tables():

        # Absolute difference in number of rows
        nb_real_rows = len(real_tables[table_name])
        nb_synthetic_rows = len(synthetic_tables[table_name])
        value = float(abs(nb_real_rows - nb_synthetic_rows))

        # Specify some useful tags for the user
        tags = set([
            "priority:high",
            "table:%s" % table_name
        ])

        yield Metric(name, value, tags)

To attach your metrics to a MetricsReport object, you can use the add_metrics method and provide your custom metrics iterator:

from sdmetrics.report import MetricsReport

report = MetricsReport()
report.add_metrics(my_custom_metrics(metadata, real_tables, synthetic_tables))

See sdmetrics.detection, sdmetrics.efficacy, and sdmetrics.statistical for more examples of how to implement metrics.

Filtering Metrics

The MetricsReport object includes a details method which returns all of the metrics that were computed.

from sdmetrics.report import MetricsReport

report = evaluate(metadata, real_tables, synthetic_tables)
report.details()

To filter these metrics, you can provide a filter function. For example, to only see metrics that are associated with the users table, you can run

def my_custom_filter(metric):
  if "table:users" in metric.tags:
    return True
  return False

report.details(my_custom_filter)

Examples of standard tags implemented by the built-in metrics are shown below.

Tag	Description
`priority:high`	This tag tells the user to pay extra attention to this metric. It typically indicates that the objects being evaluated by the metric are unusually bad (i.e. the synthetic values look very different from the real values).
`table:TABLE_NAME`	This tag indicates that the metric involves the table specified by `TABLE_NAME`.
`column:COL_NAME`	This tag indicates that the metric involves the column specified by `COL_NAME`. If the column names are not unique across the entire database, then it needs to be combined with the `table:TABLE_NAME` tag to uniquely identify a specific column.

As this library matures, we will define additional standard tags and/or promote them to first class attributes.

What's next?

For more details about SDMetrics and all its possibilities and features, please check the documentation site.

History

v0.0.1 - 2020-06-26

First release to PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.15.1

Aug 13, 2024

0.15.1.dev0 pre-release

Aug 13, 2024

0.15.0

Jul 15, 2024

0.15.0.dev0 pre-release

Jul 12, 2024

0.14.1

May 13, 2024

0.14.1.dev0 pre-release

May 13, 2024

0.14.0

Apr 11, 2024

0.14.0.dev0 pre-release

Apr 10, 2024

0.13.1

Mar 14, 2024

0.13.1.dev0 pre-release

Mar 14, 2024

0.13.0

Dec 4, 2023

0.13.0.dev0 pre-release

Nov 30, 2023

0.12.1

Nov 1, 2023

0.12.1.dev0 pre-release

Nov 1, 2023

0.12.0

Nov 1, 2023

0.12.0.dev0 pre-release

Oct 31, 2023

0.11.1

Sep 14, 2023

0.11.1.dev0 pre-release

Sep 14, 2023

0.11.0

Aug 10, 2023

0.11.0.dev0 pre-release

Aug 10, 2023

0.10.1

Jun 6, 2023

0.10.1.dev0 pre-release

Jun 5, 2023

0.10.0

May 4, 2023

0.10.0.dev2 pre-release

May 3, 2023

0.10.0.dev1 pre-release

May 3, 2023

0.10.0.dev0 pre-release

May 2, 2023

0.9.3

Apr 12, 2023

0.9.3.dev0 pre-release

Apr 11, 2023

0.9.2

Mar 8, 2023

0.9.2.dev0 pre-release

Mar 7, 2023

0.9.1

Feb 17, 2023

0.9.1.dev0 pre-release

Feb 16, 2023

0.9.0

Jan 18, 2023

0.9.0.dev0 pre-release

Jan 18, 2023

0.8.1

Dec 10, 2022

0.8.1.dev0 pre-release

Dec 8, 2022

0.8.0

Nov 2, 2022

0.8.0.dev0 pre-release

Nov 2, 2022

0.7.0

Sep 27, 2022

0.7.0.dev0 pre-release

Sep 27, 2022

0.6.0

Aug 12, 2022

0.6.0.dev1 pre-release

Aug 12, 2022

0.6.0.dev0 pre-release

Aug 12, 2022

0.5.1.dev0 pre-release

Jul 10, 2022

0.5.0

May 11, 2022

0.5.0.dev0 pre-release

May 11, 2022

0.4.2 yanked

May 10, 2022

Reason this release was yanked:

dependency conflict

0.4.2.dev0 pre-release

May 10, 2022

0.4.1

Dec 9, 2021

0.4.1.dev0 pre-release

Dec 9, 2021

0.4.0

Nov 16, 2021

0.4.0.dev0 pre-release

Nov 16, 2021

0.3.3.dev0 pre-release

Nov 5, 2021

0.3.2

Aug 17, 2021

0.3.2.dev1 pre-release

Aug 17, 2021

0.3.2.dev0 pre-release

Aug 17, 2021

0.3.1

Jul 12, 2021

0.3.1.dev1 pre-release

Jul 7, 2021

0.3.1.dev0 pre-release

Jul 2, 2021

0.3.0

Mar 31, 2021

0.3.0.dev1 pre-release

Mar 31, 2021

0.3.0.dev0 pre-release

Mar 29, 2021

0.2.1.dev0 pre-release

Mar 29, 2021

0.2.0

Feb 24, 2021

0.2.0.dev0 pre-release

Feb 23, 2021

0.1.3

Feb 13, 2021

0.1.3.dev0 pre-release

Feb 13, 2021

0.1.2

Jan 27, 2021

0.1.2.dev2 pre-release

Jan 27, 2021

0.1.2.dev1 pre-release

Jan 27, 2021

0.1.2.dev0 pre-release

Jan 27, 2021

0.1.1

Dec 30, 2020

0.1.1.dev0 pre-release

Dec 29, 2020

0.1.0

Dec 18, 2020

0.1.0.dev2 pre-release

Dec 18, 2020

0.1.0.dev1 pre-release

Dec 18, 2020

0.1.0.dev0 pre-release

Dec 16, 2020

0.0.4

Nov 27, 2020

0.0.4.dev0 pre-release

Nov 27, 2020

0.0.3

Nov 20, 2020

0.0.3.dev1 pre-release

Nov 20, 2020

0.0.3.dev0 pre-release

Nov 20, 2020

0.0.2

Aug 8, 2020

This version

0.0.2.dev1 pre-release

Aug 7, 2020

0.0.2.dev0 pre-release

Jul 9, 2020

0.0.1

Jun 26, 2020

0.0.1.dev2 pre-release

Jun 26, 2020

0.0.1.dev1 pre-release

Jun 25, 2020

0.0.1.dev0 pre-release

Jun 25, 2020

0.0.0

Mar 20, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmetrics-0.0.2.dev1.tar.gz (60.9 kB view hashes)

Uploaded Aug 7, 2020 Source

Built Distribution

sdmetrics-0.0.2.dev1-py2.py3-none-any.whl (24.4 kB view hashes)

Uploaded Aug 7, 2020 Python 2 Python 3

Hashes for sdmetrics-0.0.2.dev1.tar.gz

Hashes for sdmetrics-0.0.2.dev1.tar.gz
Algorithm	Hash digest
SHA256	`023e50ef13c039738f593f88650c714c0ddfdf6b6784a1c81d44f88b71ae15b9`
MD5	`4436658fdd686050e5c18bb1d6b49fb8`
BLAKE2b-256	`10a14895532af67671c45c2f7bb80698786100bf49d5afedfcf9512788459f32`

Hashes for sdmetrics-0.0.2.dev1-py2.py3-none-any.whl

Hashes for sdmetrics-0.0.2.dev1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d125c44881eba124b8ab8cc5c7bc994608de4fe0b523d12900a4bf974432640`
MD5	`4d01b3e1a063d2027c6f8fe10e0e7d78`
BLAKE2b-256	`1540172ba7427495e91e6fabceaa0e62a294df7023a5586945497a45e5a38051`