Skip to main content

Utility metrics for tabular data

Project description

TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Evaluation - Utility Metrics

Extensive evaluation of the utility of synthetic data sets. The original and synthetic data are compared on distinguishability and on a univariate, bivariate and multivariate level. All four metrics are visualized in one plot with a spiderplot. Where one equals 'complete overlap' and zero equals 'no overlap' between original and synthetic data. This plot can depict multiple synthetic data sets. Therefore it can be used to evaluate different levels of privacy protection in synthetic data sets, varying parameter settings in synthetic data generators, or completely different synthetic data generators.

All individual metrics depicted in the spiderplot can be visualized as well. The example_script.py shows you step by step how to generate all visualizations. The main functionalities of the scripts are:

  • Univariate distributions: shows the distributions of one variable for the original and synthetic data.
  • Bivariate correlations: visualizes a Pearson-r correlation matrix for all variables.
  • Multivariate predictions: shows an SVM classifier predicts accuracies for each variable training on either original or synthetic data tested on original data.
  • Distinguishability: shows the AUC of a logistic classifier that classifies samples as either original or synthetic.
  • Spiderplot: generates spiderplot for these four metrics.

Note that any required pre-processing of the (synthetic) data sets should be done prior. Take into account addressing NANs, missing values, outliers and scaling the data.

For more information on the selected metrics, please refer to the paper (link will be added upon publication) or contact madelon.molhoek@tno.nl. As we aim to keep developing our code feedback and tips are welcome.

Utility depicted in spider plot for adult data set, for different values of epsilon. Data are generated with CTGAN and can be found in scripts/datasets.

PET Lab

The TNO PET Lab consists of generic software components, procedures, and functionalities developed and maintained on a regular basis to facilitate and aid in the development of PET solutions. The lab is a cross-project initiative allowing us to integrate and reuse previously developed PET functionalities to boost the development of new protocols and solutions.

The package tno.sdg.tabular.eval.utility_metrics is part of the TNO Python Toolbox.

Limitations in (end-)use: the content of this software package may solely be used for applications that comply with international export control laws.
This implementation of cryptographic software has not been audited. Use at your own risk.

Documentation

Documentation of the tno.sdg.tabular.eval.utility_metrics package can be found here.

Install

Easily install the tno.sdg.tabular.eval.utility_metrics package using pip:

$ python -m pip install tno.sdg.tabular.eval.utility_metrics

Note: If you are cloning the repository and wish to edit the source code, be sure to install the package in editable mode:

$ python -m pip install -e 'tno.sdg.tabular.eval.utility_metrics'

If you wish to run the tests you can use:

$ python -m pip install 'tno.sdg.tabular.eval.utility_metrics[tests]'

Usage

See the script in the scripts directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tno_sdg_tabular_eval_utility_metrics-0.4.1.tar.gz (264.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file tno_sdg_tabular_eval_utility_metrics-0.4.1.tar.gz.

File metadata

File hashes

Hashes for tno_sdg_tabular_eval_utility_metrics-0.4.1.tar.gz
Algorithm Hash digest
SHA256 0cf27d73c74457c7df1bb4c17f85ff5cf648e7e7f37a4925f57d145fc758bcdb
MD5 7ccdfe493427ccc5f9d224015abdb413
BLAKE2b-256 5be5602559c5a760c24afdef56933cceb8d958806540b74c14714fb28f6bdd4e

See more details on using hashes here.

File details

Details for the file tno.sdg.tabular.eval.utility_metrics-0.4.1-py3-none-any.whl.

File metadata

File hashes

Hashes for tno.sdg.tabular.eval.utility_metrics-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f9612930152b4d3877d94ebfa1ac83af6dcd35e762721c2930dbe334bd4fd27b
MD5 6a9841cbf7e2fcdbbc7d5814f0dc8bcf
BLAKE2b-256 12059fa3367f90bb81074a8263ef845fc8cc94d394f6be8bb144d8584a1b0003

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page