Skip to main content

No project description provided

Project description

Documentation Status PyPI Downloads GitHub stars

Synthetic Data: Utility, Regulatory compliance, and Ethical privacy

The SURE package is an open-source Python library intended to be used for the assessment of the utility and privacy performance of any tabular synthetic dataset.

The SURE library features multiple Python modules that can be easily imported and seamlessly integrated into any Python script after installing the library.

[!WARNING] This is a beta version of the library and only runs on Linux and MacOS for the moment.

[!IMPORTANT] Requires Python >= 3.10

Installation

To install the library run the following command in your terminal:

$ pip install clearbox-sure

Modules overview

The SURE library features the following modules:

  1. Preprocessor
  2. Statistical similarity metrics
  3. Model garden
  4. ML utility metrics
  5. Distance metrics
  6. Privacy attack sandbox
  7. Report generator

Preprocessor

The input datasets undergo manipulation by the preprocessor module, tailored to conform to the standard structure utilized across the subsequent processes. The Polars library used in the preprocessor makes this operation significantly faster compared to the use of other data processing libraries.

Utility

The statistical similarity metrics, the ML utility metrics and the model garden modules constitute the data utility evaluation part.

The statistical similarity module and the distance metrics module take as input the pre-processed datasets and carry out the operation to assess the statistical similarity between the datasets and how different the content of the synthetic dataset is from the one of the original dataset. In particular, The real and synthetic input datasets are used in the statistical similarity metrics module to assess how close the two datasets are in terms of statistical properties, such as mean, correlation, distribution.

The model garden executes a classification or regression task on the given dataset with multiple machine learning models, returning the performance metrics of each of the models tested on the given task and dataset.

The model garden module’s best performing models are employed in the machine learning utility metrics module to compute the usefulness of the synthetic data on a given ML task (classification or regression).

Privacy

The distance metrics and the privacy attack sandbox make up the synthetic data privacy assessment modules.

The distance metrics module computes the Gower distance between the two input datasets and the distance to the closest record for each line of the first dataset.

The ML privacy attack sandbox allows to simulate a Membership Inference Attack for re-identification of vulnerable records identified with the distance metrics module and evaluate how exposed the synthetic dataset is to this kind of assault.

Report

Eventually, the report generator provides a summary of the utility and privacy metrics computed in the previous modules, providing a visual digest with charts and tables of the results.

This following diagram serves as a visual representation of how each module contributes to the utility-privacy assessment process and highlights the seamless interconnection and synergy between individual blocks.

drawing

Usage

The library leverages Polars, which ensures faster computations compared to other data manipulation libraries. It supports both Polars and Pandas dataframes.

The user must provide both the original real training dataset (which was used to train the generative model that produced the synthetic dataset), the real holdout dataset (which was NOT used to train the generative model that produced the synthetic dataset) and the corresponding synthetic dataset to enable the library's modules to perform the necessary computations for evaluation.

Below is a code snippet example for the usage of the library:

# Import the necessary modules from the SURE library
from sure import Preprocessor, report
from sure.utility import (compute_statistical_metrics, compute_mutual_info,
			  compute_utility_metrics_class)
from sure.privacy import (distance_to_closest_record, dcr_stats, number_of_dcr_equal_to_zero, validation_dcr_test, 
			  adversary_dataset, membership_inference_test)

# Assuming real_data, valid_data and synth_data are three pandas DataFrames

# Preprocessor initialization and query execution on the real, synthetic and validation datasets
preprocessor            = Preprocessor(real_data, get_discarded_info=False, num_fill_null='forward', scaling='standardize')

real_data_preprocessed  = preprocessor.transform(real_data)
valid_data_preprocessed = preprocessor.transform(valid_data)
synth_data_preprocessed = preprocessor.transform(synth_data)

# Statistical properties and mutual information
num_features_stats, cat_features_stats, temporal_feat_stats = compute_statistical_metrics(real_data, synth_data)
corr_real, corr_synth, corr_difference                      = compute_mutual_info(real_data_preprocessed, synth_data_preprocessed)

# ML utility: TSTR - Train on Synthetic, Test on Real
X_train      = real_data_preprocessed.drop("label", axis=1) # Assuming the datasets have a “label” column for the machine learning task they are intended for
y_train      = real_data_preprocessed["label"]
X_synth      = synth_data_preprocessed.drop("label", axis=1)
y_synth      = synth_data_preprocessed["label"]
X_test       = valid_data_preprocessed.drop("label", axis=1).limit(10000) # Test the trained models on a portion of the original real dataset (first 10k rows)
y_test       = valid_data_preprocessed["label"].limit(10000)
TSTR_metrics = compute_utility_metrics_class(X_train, X_synth, X_test, y_train, y_synth, y_test)

# Distance to closest record
dcr_synth_train       = distance_to_closest_record("synth_train", synth_data, real_data)
dcr_synth_valid       = distance_to_closest_record("synth_val", synth_data, valid_data)
dcr_stats_synth_train = dcr_stats("synth_train", dcr_synth_train)
dcr_stats_synth_valid = dcr_stats("synth_val", dcr_synth_valid)
dcr_zero_synth_train  = number_of_dcr_equal_to_zero("synth_train", dcr_synth_train)
dcr_zero_synth_valid  = number_of_dcr_equal_to_zero("synth_val", dcr_synth_valid)
share                 = validation_dcr_test(dcr_synth_train, dcr_synth_valid)

# ML privacy attack sandbox initialization and simulation
adversary_df = adversary_dataset(real_data_preprocessed, valid_data_preprocessed)
# The function adversary_dataset adds a column "privacy_test_is_training" to the adversary dataset, indicating whether the record was part of the training set or not
adversary_guesses_ground_truth = adversary_df["privacy_test_is_training"] 
MIA = membership_inference_test(adversary_dfv, synth_data_preprocessed, adversary_guesses_ground_truth)

# Report generation as HTML page
report(real_data, synth_data)

Follow the step-by-step guide to test the library.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

clearbox_sure-0.2.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

clearbox_sure-0.2.5-cp312-cp312-macosx_10_13_universal2.whl (403.5 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

clearbox_sure-0.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

clearbox_sure-0.2.5-cp311-cp311-macosx_10_9_universal2.whl (409.2 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

clearbox_sure-0.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

clearbox_sure-0.2.5-cp310-cp310-macosx_10_9_universal2.whl (410.2 kB view details)

Uploaded CPython 3.10macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file clearbox_sure-0.2.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for clearbox_sure-0.2.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 491b5f99be29c78df83408f80e0f22e997097824e2c9952715e5ae8ddd060814
MD5 b99588de80d0931d0d0bc6f9954d4a4b
BLAKE2b-256 b9d96814f1bd8bb4d24a5dd327f5234c041432acc32f576f9214446d1c2639ce

See more details on using hashes here.

File details

Details for the file clearbox_sure-0.2.5-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for clearbox_sure-0.2.5-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 eb216c45d1e0288b5ac5e19df530959f6f501b869e75e4373d71daac1d05b769
MD5 279d1439ba26f72bccc2392ddf4ae276
BLAKE2b-256 c69b107a2a1fe202ca2252c7b0aad8b7020c18a30632687d130081ed4e686980

See more details on using hashes here.

File details

Details for the file clearbox_sure-0.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for clearbox_sure-0.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4c3290e9fd4c36e603dc494b86d30501bd2d4425be81e9e71104ea2573f466c9
MD5 25b398c0bb0f2f09d6535d9c652f6ab0
BLAKE2b-256 bcb4e95ddda0d0e6121f99900d1f2f2f1de1b72921b9e732891158d77057b279

See more details on using hashes here.

File details

Details for the file clearbox_sure-0.2.5-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for clearbox_sure-0.2.5-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 64aeb4ee3ec5b08bf42dc4e5119844bae226e7c3f6025a19cd9c9b26e65a7920
MD5 0dbb0035618750164cd493ba7f9e1e8d
BLAKE2b-256 598015afa4db97a2d63da510fffbd403adb5e2e24d3b8caff46f28e8e9017b9d

See more details on using hashes here.

File details

Details for the file clearbox_sure-0.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for clearbox_sure-0.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5ce884a607244e6a9977b2d1afd11c8de852d36a1720bf71a7f49aecd85fbc38
MD5 16f09e8080ee823a2da083970a1012dd
BLAKE2b-256 6ae31d4075f83d115d319b2f5c576acc53ce4092b321850d9f06d7bc1cd30820

See more details on using hashes here.

File details

Details for the file clearbox_sure-0.2.5-cp310-cp310-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for clearbox_sure-0.2.5-cp310-cp310-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 8adea7987f3632ac4433ad5709a199d978655316087ecf1d3e9d035c2dd83e56
MD5 117ed1d8d586b7e9635b755cd3de92eb
BLAKE2b-256 1d7a5fc53749918d85e35c4d350a504aacf88148e5f70c7fa208b670036a2f06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page