Skip to main content

A package for evaluating synthetic data fidelity on various performance dimensions.

Project description

PyPI version

SynthEval

The SynthEval library is a tool made for evaluating the quality of tabularised synthetic data compared with real data. Synthetic data is microdata that is artificially generated and thus does not directly correspond to real-world individuals, making it a possible alternative to regular data anonymity. This tool builds on many previous works, and compile them in a single tool to make evaluation of synthetic data utility easier for data scientists and reasearchers alike.

Latest version

The current version of the tool offers a wide selection of utility metrics, to evaluate how well your synthetic data aligns on quality, resemblance and usability. In the current version we include only three high level privacy tools, but it is the aim to provide a more extensive assesment of disclosure risk in a future version.

Installation

Installation with PyPI using

pip install syntheval

User guide

In this section we breifly outline how to run the main test, for further details see the notebook. The library is made to be run with two datasets that look similar, i.e. same number of columns, same variable types and same column and variable names. The data should be supplied as a pandas dataframe. In Python the library is acessed and run in the following way;

from syntheval import SynthEval

evaluator = Syntheval(df_real, hold_out = df_test, cat_cols = class_cat_col)
evaluator.full_eval(df_fake, class_lab_col)

Where the user supply df_real, df_test, df_fake as pandas dataframes, as well as the class_cat_col list of column names for the categorical variables and class_lab_col string for designating one column with discrete values as target for usability predictions and coloration.

Results are saved to a csv file, multiple runs of the same SynthEval instance with different synthetic data files will save new rows allowing for various uses such as snapshots, checkpoints and benchmarking.

Command line interface

SynthEval can also be run from the commandline with the following syntax:

> SynthEval [OPTIONS] [EVALUATE]

Options:
  -r, --real-data-file PATH   Path to csv file with real data.
  -s, --synt-data-file PATH   Path to csv file with synthetic data.
  -h, --test-data-file PATH   Path to csv file with real data that wasn't used
                              for training.
  -l, --category-labels PATH  Path to txt file with comma separated labels.
  -c, --class-label TEXT      Label to use for prediction usability and
                              coloring on plots.
  --help                      Show this message and exit.

Included metrics overview

The SynthEval library comes equipped with a broard selection of metrics to evaluate various aspects of synthetic tabular data.

Quality evaluation

Quality metrics are used for checking if the statistical properties of the real data carries over into the synthetic version. This is mainly done by checking pairwise properties, such as correlation and distributional similarity.

In the code we implemented:

  • Correlation matrix difference (for the nummericals only)
  • Pairwise mutual information matrix difference (for all datatypes)
  • Kolmogorov–Smirnov test (avg. distance, avg. p-value and number and fraction of significant tests)

Resemblance evaluation

Resemblance metrics are for assessing if the synthetic data can be distinguished from the real data. While the preliminary tests already are visualizing the data, additional tools are used in checking synthetic data resemblance. We include:

  • Confidence interval overlap (average and count of nonoverlaps)
  • Hellinger distance (average)
  • propensity mean squared error
  • Nearest neighbour adversarial accuracy

Usability evaluation

Useability is a core attribute of utility, and entails how well the synthetic data can act as a replacement for real data and provide a similar analysis. In this tool we test useability by training four different sklearn classifiers on real and synthetic data with 5-fold cross-validation (testing both models on the real validation fold).

  • DecisionTreeClassifier
  • AdaBoostClassifier
  • RandomForestClassifier
  • LogisticRegression

The average accuracy is reported together with the accuracy difference from models trained on real and synthetic data. If a test set is provided, the classifiers are also trained once on the entire training set, and again the accuracy and accuracy differences are reported, but now on the test data.

By default the results are given in terms of accuracy (micro F1 scores). To change, use {‘micro’, ‘macro’, ‘weighted’} for the SynthEval.F1_type attribute.

Utility score

finally, a summary utility score is calculated based on the tests described above. Specifically we calculate the utility score in the following way $$\mathrm{UTIL} = \frac{1}{10} [ (1-\tanh{\mathrm{corr. diff.}})+(1-\tanh{\mathrm{MI diff.}})+ (1-\mathrm{KS dist.}) + (1-\mathrm{KS sig.frac.}) + \mathrm{CIO}+ (1-\mathrm{H dist.}) + \left(1-\frac{\mathrm{pMSE}}{0.25}\right) +(1-\mathrm{NNAA})+ (1-\mathrm{train F1 diff.})+(1-\mathrm{test F1 diff.})]$$

Privacy evaluation

Privacy is a crucial aspect of evaluating synthetic data, we include only three highlevel metrics with more to be added in the future.

  • average distance to closest record (normed, and divided by avg. NN dist)
  • hitting rate (for nummericals defined to be within the attribute range / 30)
  • privacy loss (difference in NNAA between test and training set, also works for checking overfitting)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntheval-1.1.1.tar.gz (466.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page