Skip to main content

Metrics for Synthetic Data Generation Projects

Project description


This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Development Status PyPI Shield Downloads Tests Coverage Status Slack Tutorial

Overview

The SDMetrics library evaluates synthetic data by comparing it to the real data that you're trying to mimic. It includes a variety of metrics to capture different aspects of the data, for example quality and privacy. It also includes reports that you can run to generate insights, visualize data and share with your team.

The SDMetrics library is model-agnostic, meaning you can use any synthetic data. The library does not need to know how you created the data.

Install

Install SDMetrics using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

pip install sdmetrics
conda install -c conda-forge sdmetrics

For more information about using SDMetrics, visit the SDMetrics Documentation.

Usage

Get started with SDMetrics Reports using some demo data,

from sdmetrics import load_demo
from sdmetrics.reports.single_table import QualityReport

real_data, synthetic_data, metadata = load_demo(modality='single_table')

my_report = QualityReport()
my_report.generate(real_data, synthetic_data, metadata)
Creating report: 100%|██████████| 4/4 [00:00<00:00,  5.22it/s]

Overall Quality Score: 82.84%

Properties:
Column Shapes: 82.78%
Column Pair Trends: 82.9%

Once you generate the report, you can drill down on the details and visualize the results.

my_report.get_visualization(property_name='Column Pair Trends')

Save the report and share it with your team.

my_report.save(filepath='demo_data_quality_report.pkl')

# load it at any point in the future
my_report = QualityReport.load(filepath='demo_data_quality_report.pkl')

Want more metrics? You can also manually apply any of the metrics in this library to your data.

# calculate whether the synthetic data respects the min/max bounds
# set by the real data
from sdmetrics.single_column import BoundaryAdherence

BoundaryAdherence.compute(
    real_data['start_date'],
    synthetic_data['start_date']
)
0.8503937007874016
# calculate whether the synthetic data is new or whether it's an exact copy of the real data
from sdmetrics.single_table import NewRowSynthesis

NewRowSynthesis.compute(
    real_data,
    synthetic_data,
    metadata
)
1.0

What's next?

To learn more about the reports and metrics, visit the SDMetrics Documentation.




The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmetrics-0.14.0.dev0.tar.gz (114.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdmetrics-0.14.0.dev0-py3-none-any.whl (170.0 kB view details)

Uploaded Python 3

File details

Details for the file sdmetrics-0.14.0.dev0.tar.gz.

File metadata

  • Download URL: sdmetrics-0.14.0.dev0.tar.gz
  • Upload date:
  • Size: 114.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.10.0 readme-renderer/43.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.2.1 tqdm/4.66.2 importlib-metadata/7.0.2 keyring/24.3.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.13

File hashes

Hashes for sdmetrics-0.14.0.dev0.tar.gz
Algorithm Hash digest
SHA256 ed9a0d0ffddf6b653c0a9217e5f3f6d75770762a47c1b23fc6cf9b9fdb6faea8
MD5 5d9f08d0cdb99b35901f90a962c21b09
BLAKE2b-256 bc3c8e384030d5a6d2e72be0b575c9f0db2b429dcb1cc8ab23c98235a95e107b

See more details on using hashes here.

File details

Details for the file sdmetrics-0.14.0.dev0-py3-none-any.whl.

File metadata

  • Download URL: sdmetrics-0.14.0.dev0-py3-none-any.whl
  • Upload date:
  • Size: 170.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.10.0 readme-renderer/43.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.2.1 tqdm/4.66.2 importlib-metadata/7.0.2 keyring/24.3.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.13

File hashes

Hashes for sdmetrics-0.14.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 e6c80bebaa15619c0eda5e8bb8a10ecde379ff9788890e2d9cebb86d63edc649
MD5 f1abb8267171c6b4367bcf5f84a1bf76
BLAKE2b-256 bca2d3c2c44c7e7585eed29baf48b6899959c5c48c1720f1619ba999ba9328ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page