Skip to main content

Metrics for Synthetic Data Generation Projects

Project description


This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Development Status PyPI Shield Downloads Tests Coverage Status

Overview

The SDMetrics library evaluates synthetic data by comparing it to the real data that you're trying to mimic. It includes a variety of metrics to capture different aspects of the data, for example quality and privacy. It also includes reports that you can run to generate insights and share with your team.

The SDMetrics library is model-agnostic, meaning you can use any synthetic data. The library does not need to know how you created the data.

Important Links
:computer: Website Check out the SDV Website for more information about the project.
:orange_book: Blog A deeper look at open source, synthetic data creation and evaluation.
:book: Documentation Quickstarts, User and Development Guides, and API Reference.
:octocat: Repository The link to the Github Repository of this library.
:scroll: License The library is published under the MIT License.
:keyboard: Development Status This software is in its Pre-Alpha stage.
Community Join our Slack Workspace for announcements and discussions.
Tutorials Get started with SDMetrics in a notebook.

Features

Quickly generate insights and share results with your team using SDMetrics Reports. For example, the Diagnostic Report quickly checks for common problems, and the Quality Report provides visualizations comparing the real and synthetic data.

You can also explore and apply individual metrics as needed. The SDMetrics library includes a variety of metrics for different goals:

  • Privacy metrics evaluate whether the synthetic data is leaking information about the real data
  • ML Efficacy metrics estimate the outcomes of using the synthetic data to solve machine learning problems
  • … and more!

Some of these metrics are experimental and actively being researched by the data science community.

Install

Install SDMetrics using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

Using pip:

pip install sdmetrics

Using conda:

conda install -c conda-forge -c pytorch sdmetrics

For more installation options please visit the SDMetrics installation Guide.

Usage

Get started with SDMetrics Reports using some demo data,

from sdmetrics import load_demo
from sdmetrics.reports.single_table import QualityReport

real_data, synthetic_data, metadata = load_demo(modality='single_table')

my_report = QualityReport()
my_report.generate(real_data, synthetic_data, metadata)
Creating report: 100%|██████████| 4/4 [00:00<00:00,  5.22it/s]

Overall Quality Score: 82.84%

Properties:
Column Shapes: 82.78%
Column Pair Trends: 82.9%

Once you generate the report, you can drill down on the details and visualize the results.

my_report.get_visualization(property_name='Column Pair Trends')

Save the report and share it with your team.

my_report.save(filepath='demo_data_quality_report.pkl')

# load it at any point in the future
my_report = QualityReport.load(filepath='demo_data_quality_report.pkl')

Want more metrics? You can also manually apply any of the metrics in this library to your data.

# calculate whether the synthetic data respects the min/max bounds
# set by the real data
from sdmetrics.single_table import BoundaryAdherence

BoundaryAdherence.compute(
    real_data['start_date'],
    synthetic_data['start_date']
)
0.8503937007874016
# calculate whether an attacker will be able to guess sensitive 
# information based on combination of synthetic data and their
# own information
from sdmetrics.single_table import CategoricalCAP

CategoricalCAP.compute(
    real_data,
    synthetic_data,
    key_fields=['gender', 'work_experience'],
    sensitive_fields=['degree_type']
)
0.4601209799017264

What's next?

To learn more about the reports and metrics, visit the SDMetrics Documentation.




The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

History

v0.6.0 - 2022-08-12

This release removes SDMetric's dependency on the RDT library, and also introduces new quality and diagnostic metrics. Additionally, we introduce a new compute_breakdown method that returns a breakdown of metric results.

New Features

  • Handle null values correctly - Issue #194 by @katxiao
  • Add wrapper classes for new single and multi table metrics - Issue #169 by @katxiao
  • Add CorrelationSimilarity metric - Issue #143 by @katxiao
  • Add CardinalityShapeSimilarity metric - Issue #160 by @katxiao
  • Add CardinalityStatisticSimilarity metric - Issue #145 by @katxiao
  • Add ContingencySimilarity Metric - Issue #159 by @katxiao
  • Add TVComplement metric - Issue #142 by @katxiao
  • Add MissingValueSimilarity metric - Issue #139 by @katxiao
  • Add CategoryCoverage metric - Issue #140 by @katxiao
  • Add compute breakdown column for single column - Issue #152 by @katxiao
  • Add BoundaryAdherence metric - Issue #138 by @katxiao
  • Get KSComplement Score Breakdown - Issue #130 by @katxiao
  • Add StatisticSimilarity Metric - Issue #137 by @katxiao
  • New features for KSTest.compute - Issue #129 by @amontanez24

Internal Improvements

  • Add integration tests and fixes - Issue #183 by @katxiao
  • Remove rdt hypertransformer dependency in timeseries metrics - Issue #176 by @katxiao
  • Replace rdt LabelEncoder with sklearn - Issue #178 by @katxiao
  • Remove rdt as a dependency - Issue #182 by @katxiao
  • Use sklearn's OneHotEncoder instead of rdt - Issue #170 by @katxiao
  • Remove KSTestExtended - Issue #180 by @katxiao
  • Remove TSFClassifierEfficacy and TSFCDetection metrics - Issue #171 by @katxiao
  • Update the default tags for a feature request - Issue #172 by @katxiao
  • Bump github macos version - Issue #174 by @katxiao
  • Fix pydocstyle to check sdmetrics - Issue #153 by @pvk-developer
  • Update the RDT version to 1.0 - Issue #150 by @pvk-developer
  • Update slack invite link - Issue #132 by @pvk-developer

v0.5.0 - 2022-05-11

This release fixes an error where the relational KSTest crashes if a table doesn't have numerical columns. It also includes some housekeeping, updating the pomegranate and copulas version requirements.

Issues closed

  • Cap pomegranate to <0.14.7 - Issue #116 by @csala
  • Relational KSTest crashes with IncomputableMetricError if a table doesn't have numerical columns - Issue #109 by @katxiao

v0.4.1 - 2021-12-09

This release improves the handling of metric errors, and updates the default transformer behavior used in SDMetrics.

Issues closed

  • Report metric errors from compute_metrics - Issue #107 by @katxiao
  • Specify default categorical transformers - Issue #105 by @katxiao

v0.4.0 - 2021-11-16

This release adds support for Python 3.9 and updates dependencies to ensure compatibility with the rest of the SDV ecosystem, and upgrades to the latests RDT release.

Issues closed

  • Replace sktime for pyts - Issue #103 by @pvk-developer
  • Add support for Python 3.9 - Issue #102 by @pvk-developer
  • Increase code style lint - Issue #80 by @fealho
  • Add pip check to CI workflows - Issue #79 by @pvk-developer
  • Upgrade dependency ranges - Issue #69 by @katxiao

v0.3.2 - 2021-08-16

This release makes pomegranate an optional dependency.

Issues closed

  • Make pomegranate an optional dependency - Issue #63 by @fealho

v0.3.1 - 2021-07-12

This release fixes a bug to make the privacy metrics available in the API docs. It also updates dependencies to ensure compatibility with the rest of the SDV ecosystem.

Issues closed

  • CategoricalSVM not being imported - Issue #65 by @csala

v0.3.0 - 2021-03-30

This release includes privacy metrics to evaluate if the real data could be obtained or deduced from the synthetic samples. Additionally all the metrics have a normalize method which takes the raw_score generated by the metric and returns a value between 0 and 1.

Issues closed

  • Add normalize method to metrics - Issue #51 by @csala and @fealho
  • Implement privacy metrics - Issue #36 by @ZhuofanXie and @fealho

v0.2.0 - 2021-02-24

Dependency upgrades to ensure compatibility with the rest of the SDV ecosystem.

v0.1.3 - 2021-02-13

Updates the required dependecies to facilitate a conda release.

Issues closed

  • Upgrade sktime - Issue #49 by @fealho

v0.1.2 - 2021-01-27

Big fixing release that addresses several minor errors.

Issues closed

  • More splits than classes - Issue #46 by @fealho
  • Scipy 1.6.0 causes an AttributeError - Issue #44 by @fealho
  • Time series metrics fails with variable length timeseries - Issue #42 by @fealho
  • ParentChildDetection metrics KeyError - Issue #39 by @csala

v0.1.1 - 2020-12-30

This version adds Time Series Detection and Efficacy metrics, as well as a fix to ensure that Single Table binary classification efficacy metrics work well with binary targets which are not boolean.

Issues closed

  • Timeseries efficacy metrics - Issue #35 by @csala
  • Timeseries detection metrics - Issue #34 by @csala
  • Ensure binary classification targets are bool - Issue #33 by @csala

v0.1.0 - 2020-12-18

This release introduces a new project organization and API, with metrics grouped by data modality, with a common API:

  • Single Column
  • Column Pair
  • Single Table
  • Multi Table
  • Time Series

Within each data modality, different families of metrics have been implemented:

  • Statistical
  • Detection
  • Bayesian Network and Gaussian Mixture Likelihood
  • Machine Learning Efficacy

v0.0.4 - 2020-11-27

Patch release to relax dependencies and avoid conflicts when using the latest SDV version.

v0.0.3 - 2020-11-20

Fix error on detection metrics when input data contains infinity or NaN values.

Issues closed

  • ValueError: Input contains infinity or a value too large for dtype('float64') - Issue #11 by @csala

v0.0.2 - 2020-08-08

Add support for Python 3.8 and a broader range of dependencies.

v0.0.1 - 2020-06-26

First release to PyPI.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmetrics-0.7.0.tar.gz (508.9 kB view hashes)

Uploaded Source

Built Distribution

sdmetrics-0.7.0-py2.py3-none-any.whl (132.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page