Skip to main content

Clover: Synthetic Health Data Generation and Validation Library

Project description

Clover: Synthetic Health Data Generation and Validation Library

License Python 3.8+ Tests with Pytest Coverage CI: Pytest Code Style: Black CI: Black Docs: Sphinx

Introducing Clover, a comprehensive library for generating and critically assessing tabular synthetic data. Clover provides eight synthetic data generators and a unified evaluation framework to assess the quality of the generated data. Evaluation focuses on how much information from the original data is preserved, as well as the level of privacy protection achieved.

Acknowledging the inherent trade-off between data utility and privacy, Clover is designed to support the creation of synthetic datasets that strike an effective balance between real-world usefulness and the imperative of safeguarding patient privacy. For each generator included in the library, a differentially private version is also available.

Table of Contents

Useful Links

Documentation

Documentation is available at : https://crchum-citadel.github.io/clover/

Current Features

  • Synthetic data generators incorporating integrated differential privacy, supporting continuous and categorical variables (unique identifiers are not handled):
  • Utility and privacy reports to assess the fidelity of the synthetic data:
    • Summary table
    • Detailed report with figures
  • The following utility metrics are implemented:
    • Univariate metrics
      • Continuous & categorical consistency
      • Continuous & categorical statistics
      • Hellinger distance
      • Kullback-Leibler divergence
    • Bivariate metrics
      • Pairwise Pearson and Spearman correlation difference
      • Pairwise Chi-square correlation difference
    • Population metrics
      • Distinguishability
      • Cross learning (regression & classification)
    • Application metrics
      • Prediction (regression & classification)
      • F-Score for binary classification with continuous variables only
      • Feature importance
  • The following privacy metrics are implemented:
    • Reidentification metrics: Assess the risk of linking records in the synthetic data back to specific individuals in the original real dataset.
      • Distance to Closest Record: Measures how similar each synthetic record is to its nearest neighbor in the real data, indicating potential for identifying near-duplicates.
      • Nearest Neighbor Distance Ratio: Compares the distance to the nearest neighbor within the synthetic data to the distance to the nearest neighbor in the real data for synthetic points, highlighting if synthetic points are too close to real ones.
    • Membership inference attack (MIA): Evaluates how well an adversary can determine if a particular record was part of the original training dataset used to generate the synthetic data.
      • GAN-Leaks: Specifically assesses the leakage of information from the training data in synthetic data generated by Generative Adversarial Networks (GANs).
      • Monte Carlo membership inference attack: A specific type of membership inference attack that uses Monte Carlo simulation to estimate the probability of a record being in the training data.
      • Logan: Assesses the risk of membership inference by training a model to distinguish between the first and second generations of synthetic data.
      • TableGan: Evaluates the vulnerability to membership inference by training both a discriminator (to distinguish between real and synthetic data) and a classifier (likely to predict whether a record was part of the training set).
      • Detector: Measures the susceptibility to membership inference by training a model to classify between the first generation of synthetic data and real data that was not used to generate the synthetic data.
      • Collision: Measures the frequency of identical or very similar records appearing in the synthetic dataset, which could indicate a privacy risk if unique real records are being replicated.
  • Metareport to compare several synthetic datasets with respect to the metrics

Usage

Requirements

All the required packages are available in the requirements file. Clover has been tested on a Linux system running Python 3.8.10 and Python 3.10.

Installation

The package is available on pypi. You can install the package on a conda environment with:

pip install -i https://test.pypi.org/simple/clover-synth

Quickstart

To get started, we created 4 notebooks to guide you through the generation of synthetic data, their associated utility and privacy reports and the hyperparameters tuning:

To get the average summary metrics results for both utility and privacy at once, see the combined report notebook. To compare several synthetic datasets with respect to a list of metrics, see the metareport notebook.

The notebooks are based on the Breast Wisconsin Cancer WBCD dataset.

Join Our Community

If you have any question, feature request or if you have encountered an issue, please open an issue on Github.

We also welcome any contribution to the project. The required packages for development can be found in the dev-requirements file. The documentation was generated with Sphinx.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clover_synth-0.1.0.tar.gz (2.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clover_synth-0.1.0-py3-none-any.whl (231.0 kB view details)

Uploaded Python 3

File details

Details for the file clover_synth-0.1.0.tar.gz.

File metadata

  • Download URL: clover_synth-0.1.0.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clover_synth-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4f4f4f0d7271eeffc09ccbc733a51c757b4db6bb68b435327a0f3b67e00ea2af
MD5 3334fca8e0d61155788ae2ab5b7da631
BLAKE2b-256 9d57b9fc9f3a03a2899f96a3a71d64b382e6c9b91f26d87d9bacab6ca20e4d3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for clover_synth-0.1.0.tar.gz:

Publisher: release.yml on CRCHUM-CITADEL/clover

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clover_synth-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: clover_synth-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 231.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clover_synth-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b07af87ba08da68b86dbbfe7bfd90e7245598759e1b1ab035a8d282987b18fd6
MD5 bb6baa6e70741d17e659507fc0e150aa
BLAKE2b-256 de3ecbf17b964f14904c83b109a329b9e9e309f8984b11b3f81d5f983265da69

See more details on using hashes here.

Provenance

The following attestation bundles were made for clover_synth-0.1.0-py3-none-any.whl:

Publisher: release.yml on CRCHUM-CITADEL/clover

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page