Skip to main content

Synthetic data generator and evaluator!

Project description

synthcity BETA

A library for generating and evaluating synthetic tabular data..

Tests Python Tutorials Documentation Status

Test In Colab about License Python 3.7+

Features:

  • :key: Easy to extend pluginable architecture.
  • :cyclone: Several evaluation metrics for correctness and privacy.
  • :fire: Several reference models, by type:
    • General purpose: GAN-based (AdsGAN, CTGAN, PATEGAN, DP-GAN),VAE-based(TVAE, RTVAE), Normalizing flows, Bayesian Networks(PrivBayes, BN).
    • Time Series generators: TimeGAN, FourierFlows, Probabilistic autoregressive.
    • Survival Analysis: SurvivalGAN, SurVAE.
    • Privacy-focused: DECAF, DP-GAN, AdsGAN, PATEGAN, PrivBayes.
    • Domain adaptation: RadialGAN.
  • :book: Read the docs !
  • :airplane: Checkout the tutorials!

:rotating_light: NOTE: Python 3.10 is NOT supported yet.

:rocket: Installation

The library can be installed from PyPI using

$ pip install synthcity

or from source, using

$ pip install .

:boom: Sample Usage

Generic data

  • List the available general-purpose generators
from synthcity.plugins import Plugins

Plugins(categories=["generic", "privacy"]).list()
  • Load and train a tabular generator
from sklearn.datasets import load_diabetes
from synthcity.plugins import Plugins

X, y = load_diabetes(return_X_y=True, as_frame=True)
X["target"] = y

syn_model = Plugins().get("adsgan")

syn_model.fit(X)
  • Generate new synthetic tabular data
syn_model.generate(count = 10)
  • Benchmark the quality of the plugins
# third party
from sklearn.datasets import load_diabetes

# synthcity absolute
from synthcity.benchmark import Benchmarks
from synthcity.plugins.core.constraints import Constraints
from synthcity.plugins.core.dataloader import GenericDataLoader

X, y = load_diabetes(return_X_y=True, as_frame=True)
X["target"] = y

loader = GenericDataLoader(X, target_column="target", sensitive_columns=["sex"])

score = Benchmarks.evaluate(
    [
        (f"example_{model}", model, {})  # testname, plugin name, plugin args
        for model in ["adsgan", "ctgan", "tvae"]
    ],
    loader,
    synthetic_size=1000,
    metrics={"performance": ["linear_model"]},
    repeats=3,
)
Benchmarks.print(score)

Survival analysis

  • List the available generators dedicated to survival analysis
from synthcity.plugins import Plugins

Plugins(categories=["generic", "privacy", "survival_analysis"]).list()
  • Generate new data
from lifelines.datasets import load_rossi
from synthcity.plugins.core.dataloader import SurvivalAnalysisDataLoader
from synthcity.plugins import Plugins

X = load_rossi()
data = SurvivalAnalysisDataLoader(
    X,
    target_column="arrest",
    time_to_event_column="week",
)

syn_model = Plugins().get("survival_gan")

syn_model.fit(data)

syn_model.generate(count=10)

Time series

  • List the available generators
from synthcity.plugins import Plugins

Plugins(categories=["generic", "privacy", "time_series"]).list()
  • Generate new data
# synthcity absolute
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import TimeSeriesDataLoader
from synthcity.utils.datasets.time_series.google_stocks import GoogleStocksDataloader

static_data, temporal_data, horizons, outcome = GoogleStocksDataloader().load()
data = TimeSeriesDataLoader(
    temporal_data=temporal_data,
    observation_times=horizons,
    static_data=static_data,
    outcome=outcome,
)

syn_model = Plugins().get("timegan")

syn_model.fit(data)

syn_model.generate(count=10)

Serialization

  • Using save/load methods
from synthcity.utils.serialization import save, load
from synthcity.plugins import Plugins

syn_model = Plugins().get("adsgan")

buff = save(syn_model)
reloaded = load(buff)

assert syn_model.name() == reloaded.name()
  • Using the Serializable interface
from synthcity.plugins import Plugins

syn_model = Plugins().get("adsgan")

buff = syn_model.save()
reloaded = Plugins().load(buff)

assert syn_model.name() == reloaded.name()

📓 Tutorials

🔑 Methods

Bayesian methods

Method Description Reference
bayesian_network The method represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG), and uses it to sample new data points pgmpy
privbayes A differentially private method for releasing high-dimensional data. PrivBayes: Private Data Release via Bayesian Networks

Generative adversarial networks(GANs)

Method Description Reference
adsgan A conditional GAN framework that generates synthetic data while minimize patient identifiability that is defined based on the probability of re-identification given the combination of all data on any individual patient Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN)
pategan The methos uses the Private Aggregation of Teacher Ensembles (PATE) framework and applies it to GANs, allowing to tightly bound the influence of any individual sample on the model, resulting in tight differential privacy guarantees and thus an improved performance over models with the same guarantees. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees
ctgan A conditional generative adversarial network which can handle tabular data. Modeling Tabular data using Conditional GAN

Variational autoencoders(VAE)

Method Description Reference
tvae A conditional VAE network which can handle tabular data. Modeling Tabular data using Conditional GAN
rtvae A robust variational autoencoder with β divergence for tabular data (RTVAE) with mixed categorical and continuous features. Robust Variational Autoencoder for Tabular Data with β Divergence

Normalizing Flows

Method Description Reference
nflow Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation can be efficient and exact. Neural Spline Flows

Static Survival analysis methods

Method Description Reference
survival_gan SurvivalGAN is a generative model that can handle survival data by addressing the imbalance in the censoring and time horizons, using a dedicated mechanism for approximating time to event/censoring from the input and survival function. ---
survival_ctgan SurvivalGAN version using CTGAN ---
survae SurvivalGAN version using VAE ---
survival_nflow SurvivalGAN version using normalizing flows ---

Time-Series and Time-Series Survival Analysis methods

Method Description Reference
timegan TimeGAN is a framework for generating realistic time-series data that combines the flexibility of the unsupervised paradigm with the control afforded by supervised training. Through a learned embedding space jointly optimized with both supervised and adversarial objectives, the network adheres to the dynamics of the training data during sampling. Time-series Generative Adversarial Networks
fflows FFlows is an explicit likelihood model based on a novel class of normalizing flows that view time-series data in the frequency-domain rather than the time-domain. The method uses a discrete Fourier transform (DFT) to convert variable-length time-series with arbitrary sampling periods into fixed-length spectral representations, then applies a (data-dependent) spectral filter to the frequency-transformed time-series. Generative Time-series Modeling with Fourier Flows
probabilistic_ar Probabilistic AutoRegressive model allows learning multi-type, multivariate timeseries data and later on generate new synthetic data that has the same format and properties as the learned one. PAR model

Privacy & Fairness

Method Description Reference
decaf Machine learning models have been criticized for reflecting unfair biases in the training data. Instead of solving this by introducing fair learning algorithms directly, DEACF focuses on generating fair synthetic data, such that any downstream learner is fair. Generating fair synthetic data from unfair data - while remaining truthful to the underlying data-generating process (DGP) - is non-trivial. DECAF is a GAN-based fair synthetic data generator for tabular data. With DECAF, we embed the DGP explicitly as a structural causal model in the input layers of the generator, allowing each variable to be reconstructed conditioned on its causal parents. This procedure enables inference time debiasing, where biased edges can be strategically removed to satisfy user-defined fairness requirements. DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks
privbayes A differentially private method for releasing high-dimensional data. PrivBayes: Private Data Release via Bayesian Networks
dpgan Differentially Private GAN Differentially Private Generative Adversarial Network
adsgan A conditional GAN framework that generates synthetic data while minimize patient identifiability that is defined based on the probability of re-identification given the combination of all data on any individual patient Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN)
pategan The methos uses the Private Aggregation of Teacher Ensembles (PATE) framework and applies it to GANs, allowing to tightly bound the influence of any individual sample on the model, resulting in tight differential privacy guarantees and thus an improved performance over models with the same guarantees. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees

Domain adaptation

Method Description Reference
radialgan Training complex machine learning models for prediction often requires a large amount of data that is not always readily available. Leveraging these external datasets from related but different sources is, therefore, an essential task if good predictive models are to be built for deployment in settings where data can be rare. RadialGAN is an approach to the problem in which multiple GAN architectures are used to learn to translate from one dataset to another, thereby allowing to augment the target dataset effectively and learning better predictive models than just the target dataset. RadialGAN: Leveraging multiple datasets to improve target-specific predictive models using Generative Adversarial Networks

Debug methods

Method Description Reference
marginal_distributions A differentially private method that samples from the marginal distributions of the training set ---
uniform_sampler A differentially private method that uniformly samples from the [min, max] ranges of each column. ---
dummy_sampler Resample data points from the training set ---

:zap: Evaluation metrics

The following table contains the available evaluation metrics:

  • Sanity checks
Metric Description Values
data_mismatch Average number of columns with datatype(object, real, int) mismatch between the real and synthetic data 0: no datatype mismatch.
1: complete data type mismatch between the datasets.
common_rows_proportion The proportion of rows in the real dataset leaked in the synthetic dataset. 0: there are no common rows between the real and synthetic datasets.
1: all the rows in the real dataset are leaked in the synthetic dataset.
nearest_syn_neighbor_distance Average distance from the real data to the closest neighbor in the synthetic data 0: all the real rows are leaked in the synthetic dataset.
1: all the synthetic rows are far away from the real dataset.
close_values_probability The probability of close values between the real and synthetic data. 0: there is no chance to have synthetic rows similar to the real.
1 means that all the synthetic rows are similar to some real rows.
distant_values_probability Average distance from the real data to the closest neighbor in the synthetic data 0: no chance to have rows in the synthetic far away from the real data.
1: all the synthetic datapoints are far away from the real data.
  • Statistical tests
Metric Description Values
inverse_kl_divergence The average inverse of the Kullback–Leibler Divergence 0: the datasets are from different distributions.
1: the datasets are from the same distribution.
ks_test The Kolmogorov-Smirnov test 0: the distributions are totally different.
1: the distributions are identical.
chi_squared_test The p-value. A small value indicates that we can reject the null hypothesis and that the distributions are different. 0: the distributions are different
1: the distributions are identical.
max_mean_discrepancy Empirical maximum mean discrepancy. 0: The distributions are the same.
1: The distributions are totally different.
inv_cdf_distance The total distance between continuous features, 0: The distributions are the same.
1: The distributions are totally different.
jensenshannon_dist The Jensen-Shannon distance (metric) between two probability arrays. This is the square root of the Jensen-Shannon divergence. 0: The distributions are the same.
1: The distributions are totally different.
feature_corr The correlation/strength-of-association of features in data-set with both categorical and continuous features using: * Pearson's R for continuous-continuous cases * Cramer's V or Theil's U for categorical-categorical cases 0: The distributions are the same.
1: The distributions are totally different.
wasserstein_dist Wasserstein Distance is a measure of the distance between two probability distributions. 0: The distributions are the same.
prdc Computes precision, recall, density, and coverage given two manifolds. ---
alpha_precision Evaluate the alpha-precision, beta-recall, and authenticity scores. ---
survival_km_distance The distance between two Kaplan-Meier plots(survival analysis). ---
  • Synthetic Data quality
Metric Description Values
performance.xgb Train an XGBoost classifier/regressor/survival model on real data(gt) and synthetic data(syn), and evaluate the performance on the test set. 1 for ideal performance, 0 for worst performance
performance.linear Train a Linear classifier/regressor/survival model on real data(gt) and the synthetic data and evaluate the performance on test data. 1 for ideal performance, 0 for worst performance
performance.mlp Train a Neural Net classifier/regressor/survival model on the read data and the synthetic data and evaluate the performance on test data. 1 for ideal performance, 0 for worst performance
performance.feat_rank_distance Train a model on the synthetic data and a model on the real data. Compute the feature importance of the models on the same test data, and compute the rank distance between the importance(kendalltau or spearman) 1: similar ranks in the feature importance. 0: uncorrelated feature importance
detection_gmm Train a GaussianMixture model to differentiate the synthetic data from the real data. 0: The datasets are indistinguishable.
1: The datasets are totally distinguishable.
detection_xgb Train an XGBoost model to differentiate the synthetic data from the real data. 0: The datasets are indistinguishable.
1: The datasets are totally distinguishable.
detection_mlp Train a Neural net to differentiate the synthetic data from the real data. 0: The datasets are indistinguishable.
1: The datasets are totally distinguishable.
detection_linear Train a Linear model to differentiate the synthetic data from the real data. 0: The datasets are indistinguishable.
1: The datasets are totally distinguishable.
  • Privacy metrics

Quasi-identifiers : pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier.

Metric Description Values
k_anonymization The minimum value k which satisfies the k-anonymity rule: each record is similar to at least another k-1 other records on the potentially identifying variables. Reported on both the real and synthetic data.
l_diversity The minimum value l which satisfies the l-diversity rule: every generalized block has to contain at least l different sensitive values. Reported on both the real and synthetic data.
kmap The minimum value k which satisfies the k-map rule: every combination of values for the quasi-identifiers appears at least k times in the reidentification(synthetic) dataset. Reported on both the real and synthetic data.
delta_presence The maximum re-identification risk for the real dataset from the synthetic dataset. 0 for no risk.
identifiability_score The re-identification score on the real dataset from the synthetic dataset. --- ]
sensitive_data_reidentification_xgb Sensitive data prediction from the quasi-identifiers using an XGBoost. 0 for no risk.
sensitive_data_reidentification_mlp Sensitive data prediction from the quasi-identifiers using a Neural Net. 0 for no risk.

:hammer: Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vsx

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

synthcity-0.1.5-py3-none-macosx_10_14_x86_64.whl (283.5 kB view details)

Uploaded Python 3 macOS 10.14+ x86-64

synthcity-0.1.5-py3-none-any.whl (285.7 kB view details)

Uploaded Python 3

File details

Details for the file synthcity-0.1.5-py3-none-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for synthcity-0.1.5-py3-none-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 a7139aa08b4fa5aec6447fb2276f1830c9c161096e8d2d7e7a20856504d322e0
MD5 8e08ca1aa3568373105ffe5c06f13c2c
BLAKE2b-256 7a73cba38b1472ce2de1b3a157214da4073292e94c76335cd8d86d8eba01095d

See more details on using hashes here.

Provenance

File details

Details for the file synthcity-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: synthcity-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 285.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for synthcity-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8e11efdaf0463fb56fda16bda5a84cb7b6d2be5c6fa6b5036b71ec204dec25d1
MD5 bfa60e7e1af11c798bf98cb8dc29d10f
BLAKE2b-256 ad93e6d28f484cc4c69b3664d60549a73a268f805b03cd669164478b9788629d

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page