Skip to main content

A package for evaluating synthetic data fidelity on various performance dimensions.

Project description

SynthEval

The SynthEval library is a tool made for evaluating the quality of tabularised synthetic data compared with real data. Synthetic data is microdata that is artificially generated and thus does not directly correspond to real-world individuals, making it a possible alternative to regular data anonymity. This tool builds on many previous works, and compile them in a single tool to make evaluation of synthetic data utility easier for data scientists and reasearchers alike.

Latest version

The current version of the tool offers a wide selection of utility metrics, to evaluate how well your synthetic data aligns on quality, resemblance and usability. In the current version we include only three high level privacy tools, but it is the aim to provide a more extensive assesment of disclosure risk in a future version.

Installation

User guide

In this section we breifly outline how to run the main test, for further details see the "syntheval_guide.ipynb". The library is made to be run with two datasets that look similar, i.e. same number of columns, same variable types and same column and variable names. The data should be supplied as a pandas dataframe. In Python the library is acessed and run in the following way;

from syntheval import SynthEval

evaluator = Syntheval(df_real, hold_out = df_test, cat_cols = class_cat_col)
evaluator.full_eval(df_fake, class_lab_col)

Where the user supply df_real, df_test, df_fake as pandas dataframes, as well as the class_cat_col list of column names for the categorical variables and class_lab_col string for designating one column with discrete values as target for usability predictions and coloration.

Results are saved to a csv file, multiple runs of the same SynthEval instance with different synthetic data files will save new rows allowing for various uses such as snapshots, checkpoints and benchmarking.

Included metrics overview

The SynthEval library comes equipped with a broard selection of metrics to evaluate various aspects of synthetic tabular data.

Quality evaluation

Quality metrics are used for checking if the statistical properties of the real data carries over into the synthetic version. This is mainly done by checking pairwise properties, such as correlation and distributional similarity.

In the code we implemented:

  • Correlation matrix difference (for the nummericals only)
  • Pairwise mutual information matrix difference (for all datatypes)
  • Kolmogorov–Smirnov test (avg. distance, avg. p-value and number and fraction of significant tests)

Resemblance evaluation

Resemblance metrics are for assessing if the synthetic data can be distinguished from the real data. While the preliminary tests already are visualizing the data, additional tools are used in checking synthetic data resemblance. We include:

  • Confidence interval overlap (average and count of nonoverlaps)
  • Hellinger distance (average)
  • propensity mean squared error
  • Nearest neighbour adversarial accuracy

Usability evaluation

Useability is a core attribute of utility, and entails how well the synthetic data can act as a replacement for real data and provide a similar analysis. In this tool we test useability by training four different sklearn classifiers on real and synthetic data with 5-fold cross-validation (testing both models on the real validation fold).

  • DecisionTreeClassifier
  • AdaBoostClassifier
  • RandomForestClassifier
  • LogisticRegression

The average accuracy is reported together with the accuracy difference from models trained on real and synthetic data. If a test set is provided, the classifiers are also trained once on the entire training set, and again the accuracy and accuracy differences are reported, but now on the test data.

By default the results are given in terms of accuracy (micro F1 scores). To change, use {‘micro’, ‘macro’, ‘weighted’} for the SynthEval.F1_type attribute.

Utility score

finally, a summary utility score is calculated based on the tests described above. Specifically we calculate the utility score in the following way $$\mathrm{UTIL} = \frac{1}{10} [ (1-\tanh{\mathrm{corr. diff.}})+(1-\tanh{\mathrm{MI diff.}})+ (1-\mathrm{KS dist.}) + (1-\mathrm{KS sig.frac.}) + \mathrm{CIO}+ (1-\mathrm{H dist.}) + \left(1-\frac{\mathrm{pMSE}}{0.25}\right) +(1-\mathrm{NNAA})+ (1-\mathrm{train F1 diff.})+(1-\mathrm{test F1 diff.})]$$

Privacy evaluation

Privacy is a crucial aspect of evaluating synthetic data, we include only three highlevel metrics with more to be added in the future.

  • average distance to closest record (normed, and divided by avg. NN dist)
  • hitting rate (for nummericals defined to be within the attribute range / 30)
  • privacy loss (difference in NNAA between test and training set, also works for checking overfitting)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntheval-1.tar.gz (454.6 kB view details)

Uploaded Source

Built Distribution

syntheval-1-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file syntheval-1.tar.gz.

File metadata

  • Download URL: syntheval-1.tar.gz
  • Upload date:
  • Size: 454.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for syntheval-1.tar.gz
Algorithm Hash digest
SHA256 3d6caee2a3820aae803b5e47e2fadc1f442ad18aa5e116e413329ee04103e83f
MD5 539eeaf7545cd4c4838214b832d64129
BLAKE2b-256 d6e1e006d9a5eeddcb85c22eef60d8c7330f641594f588aae83f4e84e96401e1

See more details on using hashes here.

File details

Details for the file syntheval-1-py3-none-any.whl.

File metadata

  • Download URL: syntheval-1-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for syntheval-1-py3-none-any.whl
Algorithm Hash digest
SHA256 11d9612e09e7f162c9a384905db2de8c5dc785601d02971941f3d7ee5c89d728
MD5 664fd13d07478f0f86788ac6168a1283
BLAKE2b-256 31db39fb1f6bb51af62fc9b8a6c2c3d4e3522f2662d17a9eed1bf5e5b0be7665

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page