Metrics for Synthetic Data Generation Projects
Project description
An open source project from Data to AI Lab at MIT.
Metrics for Synthetic Data Generation Projects
- License: MIT
- Development Status: Pre-Alpha
- Documentation: https://sdv-dev.github.io/SDMetrics
- Homepage: https://github.com/sdv-dev/SDMetrics
Overview
The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after. It includes a variety of metrics such as:
- Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
- Detection metrics which use machine learning to try to distinguish between real and synthetic data.
- Descriptive metrics which compute descriptive statistics on the real and synthetic datasets independently and then compare the values.
Install
Requirements
SDMetrics has been developed and tested on Python 3.5, 3.6, and 3.7
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDMetrics is run.
Install with pip
The easiest and recommended way to install SDMetrics is using pip:
pip install sdmetrics
This will pull and install the latest stable release from PyPi.
If you want to install from source or contribute to the project please read the Contributing Guide.
Basic Usage
Let's run the demo code from SDV to generate a simple synthetic dataset:
from sdv import load_demo, SDV
metadata, real_tables = load_demo(metadata=True)
sdv = SDV()
sdv.fit(metadata, real_tables)
synthetic_tables = sdv.sample_all(20)
Now that we have a synthetic dataset, we can evaluate it using SDMetrics by calling the evaluate
function which returns an instance of MetricsReport
with the default metrics:
from sdmetrics import evaluate
report = evaluate(metadata, real_tables, synthetic_tables)
Examining Metrics
This report
object makes it easy to examine the metrics at different levels of granularity. For example, the overall
method returns a single scalar value which functions as a composite score combining all of the metrics. This score can be passed to an optimization routine (i.e. to tune the hyperparameters in a model) and minimized in order to obtain higher quality synthetic data.
print(report.overall())
In addition, the report
provides a highlights
method which identifies the worst performing metrics. This provides useful hints to help users identify where their synthetic data falls short (i.e. which tables/columns/relationships are not being modeled properly).
print(report.highlights())
Visualizing Metrics
Finally, the report
object provides a visualize
method which generates a figure showing some of the key metrics.
figure = report.visualize()
figure.savefig("sdmetrics-report.png")
Advanced Usage
Specifying Metrics
Instead of running all the default metrics, you can specify exactly what metrics you
want to run by creating an empty MetricsReport
and adding the metrics yourself. For
example, the following code only computes the machine learning detection-based metrics.
The MetricsReport
object includes a details
method which returns all of the
metrics that were computed.
from sdmetrics import detection
from sdmetrics.report import MetricsReport
report = MetricsReport()
report.add_metrics(detection.metrics(metadata, real_tables, synthetic_tables))
Creating Metrics
Suppose you want to add some new metrics to this library. To do this, you simply
need to write a function which yields instances of the Metric
object:
from sdmetrics.report import Metric
def my_custom_metrics(metadata, real_tables, synthetic_tables):
name = "abs-diff-in-number-of-rows"
for table_name in metadata.get_tables():
# Absolute difference in number of rows
nb_real_rows = len(real_tables[table_name])
nb_synthetic_rows = len(synthetic_tables[table_name])
value = float(abs(nb_real_rows - nb_synthetic_rows))
# Specify some useful tags for the user
tags = set([
"priority:high",
"table:%s" % table_name
])
yield Metric(name, value, tags)
To attach your metrics to a MetricsReport
object, you can use the add_metrics
method and provide your custom metrics iterator:
from sdmetrics.report import MetricsReport
report = MetricsReport()
report.add_metrics(my_custom_metrics(metadata, real_tables, synthetic_tables))
See sdmetrics.detection
, sdmetrics.efficacy
, and sdmetrics.statistical
for
more examples of how to implement metrics.
Filtering Metrics
The MetricsReport
object includes a details
method which returns all of the
metrics that were computed.
from sdmetrics.report import MetricsReport
report = evaluate(metadata, real_tables, synthetic_tables)
report.details()
To filter these metrics, you can provide a filter function. For example, to only
see metrics that are associated with the users
table, you can run
def my_custom_filter(metric):
if "table:users" in metric.tags:
return True
return False
report.details(my_custom_filter)
Examples of standard tags implemented by the built-in metrics are shown below.
Tag | Description |
---|---|
priority:high |
This tag tells the user to pay extra attention to this metric. It typically indicates that the objects being evaluated by the metric are unusually bad (i.e. the synthetic values look very different from the real values). |
table:TABLE_NAME |
This tag indicates that the metric involves the table specified by TABLE_NAME .
|
column:COL_NAME |
This tag indicates that the metric involves the column specified by COL_NAME . If the column names are not unique across the entire database, then it needs to be combined with the table:TABLE_NAME tag to uniquely identify a specific column. |
As this library matures, we will define additional standard tags and/or promote them to first class attributes.
What's next?
For more details about SDMetrics and all its possibilities and features, please check the documentation site.
History
v0.0.1 - 2020-06-26
First release to PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sdmetrics-0.0.2.dev1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d125c44881eba124b8ab8cc5c7bc994608de4fe0b523d12900a4bf974432640 |
|
MD5 | 4d01b3e1a063d2027c6f8fe10e0e7d78 |
|
BLAKE2b-256 | 1540172ba7427495e91e6fabceaa0e62a294df7023a5586945497a45e5a38051 |