Metrics for Synthetic Data Generation Projects
Project description
An Open Source Project from the Data to AI Lab, at MIT
Metrics for Synthetic Data Generation Projects
- Website: https://sdv.dev
- Documentation: https://sdv.dev/SDV
- Repository: https://github.com/sdv-dev/SDMetrics
- License: MIT
- Development Status: Pre-Alpha
Overview
The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after.
It supports multiple data modalities:
- Single Columns: Compare 1 dimensional
numpy
arrays representing individual columns. - Column Pairs: Compare how columns in a
pandas.DataFrame
relate to each other, in groups of 2. - Single Table: Compare an entire table, represented as a
pandas.DataFrame
. - Multi Table: Compare multi-table and relational datasets represented as a python
dict
with multiple tables passed aspandas.DataFrame
s. - Time Series: Compare tables representing ordered sequences of events.
It includes a variety of metrics such as:
- Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
- Detection metrics which use machine learning to try to distinguish between real and synthetic data.
- Efficacy metrics which compare the performance of machine learning models when run on the synthetic and real data.
- Bayesian Network and Gaussian Mixture metrics which learn the distribution of the real data and evaluate the likelihood of the synthetic data belonging to the learned distribution.
- Privacy metrics which evaluate whether the synthetic data is leaking information about the real data.
Install
Requirements
SDMetrics has been developed and tested on Python 3.6, 3.7 and 3.8
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDMetrics is run.
Install with pip
The easiest and recommended way to install SDMetrics is using pip:
pip install sdmetrics
This will pull and install the latest stable release from PyPi.
If you want to install from source or contribute to the project please read the Contributing Guide.
Install with conda
SDMetrics can also be installed using conda:
conda install -c sdv-dev -c conda-forge sdmetrics
This will pull and install the latest stable release from Anaconda.
Basic Usage
In this small code snippet we show an example of how to use SDMetrics to evaluate how similar a toy multi-table dataset and its synthetic replica are:
- The demo data is loaded.
- The list of available multi-table metrics is retreived.
- All the metrics are run to compare the real and synthetic data.
- A
pandas.DataFrame
is built with the results.
import pandas as pd
import sdmetrics
# Load the demo data, which includes:
# - A dict containing the real tables as pandas.DataFrames.
# - A dict containing the synthetic clones of the real data.
# - A dict containing metadata about the tables.
real_data, synthetic_data, metadata = sdmetrics.load_demo()
# Obtain the list of multi table metrics, which is returned as a dict
# containing the metric names and the corresponding metric classes.
metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()
# Iterate over the metrics and compute them, capturing the scores obtained.
scores = []
for name, metric in metrics.items():
try:
scores.append({
'metric': name,
'score': metric.compute(real_data, synthetic_data, metadata)
})
except ValueError:
pass # Ignore metrics that do not support this data
# Put the results in a DataFrame for pretty printing.
scores = pd.DataFrame(scores)
The result will be a table containing the list of metrics that have been computed and the scores obtained, similar to this one:
metric | score |
---|---|
CSTest | 0.76651 |
KSTest | 0.75 |
KSTestExtended | 0.777778 |
LogisticDetection | 0.925926 |
SVCDetection | 0.703704 |
LogisticParentChildDetection | 0.541667 |
SVCParentChildDetection | 0.923611 |
What's next?
For more details about SDMetrics and SDV please visit the documentation site.
More details about each individual type of metrics can also be found here:
- Single Column Metrics: sdmetrics/single_column
- Single Table Metrics: sdmetrics/single_table
- Multi Table Metrics: sdmetrics/multi_table
The Synthetic Data Vault
This repository is part of The Synthetic Data Vault Project
- Website: https://sdv.dev
- Documentation: https://sdv.dev/SDV
History
v0.0.4 - 2020-11-27
Patch release to relax dependencies and avoid conflicts when using the latest SDV version.
v0.0.3 - 2020-11-20
Fix error on detection metrics when input data contains infinity or NaN values.
Issues closed
- ValueError: Input contains infinity or a value too large for dtype('float64') - Issue #11 by @csala
v0.0.2 - 2020-08-08
Add support for Python 3.8 and a broader range of dependencies.
v0.0.1 - 2020-06-26
First release to PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sdmetrics-0.1.0.dev0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb353b30de448c0bb0f896126c66d7366c6424e1fdedfed02a316d8cf84d401c |
|
MD5 | e73af94dccc823a57f075268bb0da66f |
|
BLAKE2b-256 | c9488c57a1c90513623067a2deca0e9432f0cafb427dc5e3173b085281a93401 |