Metrics for Synthetic Data Generation Projects
Project description
An Open Source Project from the Data to AI Lab, at MIT
Metrics for Synthetic Data Generation Projects
- Website: https://sdv.dev
- Documentation: https://sdv.dev/SDV
- Repository: https://github.com/sdv-dev/SDMetrics
- License: MIT
- Development Status: Pre-Alpha
Overview
The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after.
It supports multiple data modalities:
- Single Columns: Compare 1 dimensional
numpy
arrays representing individual columns. - Column Pairs: Compare how columns in a
pandas.DataFrame
relate to each other, in groups of 2. - Single Table: Compare an entire table, represented as a
pandas.DataFrame
. - Multi Table: Compare multi-table and relational datasets represented as a python
dict
with multiple tables passed aspandas.DataFrame
s. - Time Series: Compare tables representing ordered sequences of events.
It includes a variety of metrics such as:
- Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
- Detection metrics which use machine learning to try to distinguish between real and synthetic data.
- Efficacy metrics which compare the performance of machine learning models when run on the synthetic and real data.
- Bayesian Network and Gaussian Mixture metrics which learn the distribution of the real data and evaluate the likelihood of the synthetic data belonging to the learned distribution.
- Privacy metrics which evaluate whether the synthetic data is leaking information about the real data.
Install
SDMetrics is part of the SDV project and is automatically installed alongside it. For details about this process please visit the SDV Installation Guide
Optionally, SDMetrics can also be installed as a standalone library using the following commands:
Using pip
:
pip install sdmetrics
Using conda
:
conda install -c sdv-dev -c conda-forge -c pytorch sdmetrics
For more installation options please visit the SDMetrics installation Guide
Usage
SDMetrics is included as part of the framework offered by SDV to evaluate the quality of your synthetic dataset. For more details about how to use it please visit the corresponding User Guide:
Standalone usage
SDMetrics can also be used as a standalone library to run metrics individually.
In this short example we show how to use it to evaluate a toy multi-table dataset and its synthetic replica by running all the compatible multi-table metrics on it:
import sdmetrics
# Load the demo data, which includes:
# - A dict containing the real tables as pandas.DataFrames.
# - A dict containing the synthetic clones of the real data.
# - A dict containing metadata about the tables.
real_data, synthetic_data, metadata = sdmetrics.load_demo()
# Obtain the list of multi table metrics, which is returned as a dict
# containing the metric names and the corresponding metric classes.
metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()
# Run all the compatible metrics and get a report
sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata)
The output will be a table with all the details about the executed metrics and their score:
metric | name | score | min_value | max_value | goal |
---|---|---|---|---|---|
CSTest | Chi-Squared | 0.76651 | 0 | 1 | MAXIMIZE |
KSTest | Inverted Kolmogorov-Smirnov D statistic | 0.75 | 0 | 1 | MAXIMIZE |
KSTestExtended | Inverted Kolmogorov-Smirnov D statistic | 0.777778 | 0 | 1 | MAXIMIZE |
LogisticDetection | LogisticRegression Detection | 0.882716 | 0 | 1 | MAXIMIZE |
SVCDetection | SVC Detection | 0.833333 | 0 | 1 | MAXIMIZE |
BNLikelihood | BayesianNetwork Likelihood | nan | 0 | 1 | MAXIMIZE |
BNLogLikelihood | BayesianNetwork Log Likelihood | nan | -inf | 0 | MAXIMIZE |
LogisticParentChildDetection | LogisticRegression Detection | 0.619444 | 0 | 1 | MAXIMIZE |
SVCParentChildDetection | SVC Detection | 0.916667 | 0 | 1 | MAXIMIZE |
What's next?
If you want to read more about each individual metric, please visit the following folders:
- Single Column Metrics: sdmetrics/single_column
- Single Table Metrics: sdmetrics/single_table
- Multi Table Metrics: sdmetrics/multi_table
- Time Series Metrics: sdmetrics/timeseries
The Synthetic Data Vault
This repository is part of The Synthetic Data Vault Project
- Website: https://sdv.dev
- Documentation: https://sdv.dev/SDV
History
v0.3.0 - 2021-03-30
This release includes privacy metrics to evaluate if the real data could be obtained or
deduced from the synthetic samples. Additionally all the metrics have a normalize
method
which takes the raw_score
generated by the metric and returns a value between 0
and 1
.
Issues closed
- Add normalize method to metrics - Issue #51 by @csala and @fealho
- Implement privacy metrics - Issue #36 by @ZhuofanXie and @fealho
v0.2.0 - 2021-02-24
Dependency upgrades to ensure compatibility with the rest of the SDV ecosystem.
v0.1.3 - 2021-02-13
Updates the required dependecies to facilitate a conda release.
Issues closed
- Upgrade sktime - Issue #49 by @fealho
v0.1.2 - 2021-01-27
Big fixing release that addresses several minor errors.
Issues closed
- More splits than classes - Issue #46 by @fealho
- Scipy 1.6.0 causes an AttributeError - Issue #44 by @fealho
- Time series metrics fails with variable length timeseries - Issue #42 by @fealho
- ParentChildDetection metrics KeyError - Issue #39 by @csala
v0.1.1 - 2020-12-30
This version adds Time Series Detection and Efficacy metrics, as well as a fix to ensure that Single Table binary classification efficacy metrics work well with binary targets which are not boolean.
Issues closed
- Timeseries efficacy metrics - Issue #35 by @csala
- Timeseries detection metrics - Issue #34 by @csala
- Ensure binary classification targets are bool - Issue #33 by @csala
v0.1.0 - 2020-12-18
This release introduces a new project organization and API, with metrics grouped by data modality, with a common API:
- Single Column
- Column Pair
- Single Table
- Multi Table
- Time Series
Within each data modality, different families of metrics have been implemented:
- Statistical
- Detection
- Bayesian Network and Gaussian Mixture Likelihood
- Machine Learning Efficacy
v0.0.4 - 2020-11-27
Patch release to relax dependencies and avoid conflicts when using the latest SDV version.
v0.0.3 - 2020-11-20
Fix error on detection metrics when input data contains infinity or NaN values.
Issues closed
- ValueError: Input contains infinity or a value too large for dtype('float64') - Issue #11 by @csala
v0.0.2 - 2020-08-08
Add support for Python 3.8 and a broader range of dependencies.
v0.0.1 - 2020-06-26
First release to PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sdmetrics-0.3.1.dev0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c17e540f71c67d8b32abf50ace84434ed878c8424e5776d3f55d537a97ac01bb |
|
MD5 | a3305950706efe35ca0537070f8cd2d6 |
|
BLAKE2b-256 | 2114b7ff2f9af45bbe95b26f6d2482650cbb15b944e09e6bb28c464275dfcdd1 |