Skip to main content

Synthetic data generation and evaluation library

Project description

Synthyverse logo

Welcome to the synthyverse!

The most extensive ecosystem for synthetic data generation and evaluation in Python.

The synthyverse is a work in progress. Please provide any suggestions through a GitHub Issue.

Features

  • 🔧 Highly modular installation. Install only those modules which you require to keep your installation lightweight.
  • 📚 Most extensive library for synthetic data. Any generator or metric can be quickly added without dependency conflicts due to synthyverse's modular installation. This allows the synthyverse to host the most generators and evaluation metrics out of any synthetic data library.
  • ⚙️ Benchmarking module for simplified synthetic data pipelines. The benchmarking module executes a modular pipeline of synthetic data generation and evaluation. Choose a generator, set of evaluation metrics, and pipeline parameters, and obtain results on synthetic data quality.
  • 👷 Minimal preprocessing required. All preprocessing is handled under the hood in the synthyverse, so no need for scaling, one-hot encoding, or handling missing values.

Installation

The synthyverse is unique in its modular installation set-up. To avoid conflicting dependencies, we provide various installation templates. Each template installs only those dependencies which are required to access certain modules.

Templates provide installation for specific generators, the evaluation module, and more. Install multiple templates to get access to multiple modules of the synthyverse, e.g., multiple generators and evaluation.

We strongly advise to only install templates which you require during a specific run. Installing multiple templates gives rise to potential dependency conflicts. Use separate virtual environments across installations. Note that the core installation without any template doesn't install any modules.

See the overview of templates.

General Installation Template

pip install synthyverse[template]

Installation Examples

pip install synthyverse[ctgan]
pip install synthyverse[arf,bn,ctgan,tvae]
pip install synthyverse[ctgan,eval]

Usage

Synthetic Data Generation

Import desired generator. Note that you can only import generators according to your installed synthyverse template.

See all available generators.

from synthyverse.generators import ARFGenerator
generator = ARFGenerator(num_trees=20, random_state=0)

Fit the generator.

from sklearn.datasets import load_breast_cancer
X = load_breast_cancer(as_frame=True).frame
generator.fit(X, discrete_features=["target"])

Sample a synthetic dataset.

syn = generator.generate(len(X))

Synthetic Data Evaluation

Choose a set of metrics. Either choose default metrics as a list, or provide them as a dictionary with carefully selected hyperparameters. Add a dash to the metric name to compute various configurations of the same evaluation metric.

See all available metrics.

metrics = ["mle", "dcr", "similarity"]
metrics={
        "mle-trts": {"train_set": "real"},
        "mle-tstr": {"train_set": "synthetic"},
        "dcr": {"estimates": ["mean", 0.01, 0.05]},
        "similarity":{}
    }

Set-up a MetricEvaluator object.

from synthyverse.evaluation import MetricEvaluator

evaluator = MetricEvaluator(
    metrics=metrics,
    discrete_features=["target"],
    target_column="target",
    random_state=seed
)

Evaluate the metrics with respect to the synthetic data, the training data used to fit the generator, and an independent holdout/test set of real data.

results = evaluator.evaluate(X_train, X_test, syn)

Benchmarking

Set-up a benchmarking object. Supply the generator name and its parameters, evaluation metrics, the number of random train-test splits to fit the generator to, number of random initializations to fit the generator to, the number of synthetic sets to sample for each fitted generator, and the size of the test set.

from synthyverse.benchmark import TabularBenchmark

benchmark = TabularBenchmark(
    generator_name="arf",
    generator_params={"num_trees": 20},
    n_random_splits=3,
    n_inits=3,
    n_generated_datasets=20,
    metrics=["classifier_test", "mle", "dcr"],
    test_size=0.3,
)

Run the benchmarking pipeline on a dataset.

results = benchmark.run(X, target_column="target", discrete_columns=["target"])

Tutorials

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthyverse-0.1.0.tar.gz (2.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthyverse-0.1.0-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file synthyverse-0.1.0.tar.gz.

File metadata

  • Download URL: synthyverse-0.1.0.tar.gz
  • Upload date:
  • Size: 2.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for synthyverse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 313f4162ba0ca2da686bc6eb1ffb13ae031c398edc22e9273c257b121aa98914
MD5 24f03345dd81aa9a5a3ecd61ed9564d4
BLAKE2b-256 d6cbc6071f8c31114a47697f670cdfb61bc5a31506c30be86334b61a0b92f46c

See more details on using hashes here.

File details

Details for the file synthyverse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: synthyverse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for synthyverse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 95713484df6ba87d07b44ac133fea19ac48ce4c8d3548848cae4b740e2977e57
MD5 b845d6d167ec4f9a5734ff4909bd5390
BLAKE2b-256 c12a8d16c4b493b16129f50b173c41ec69acc9009d2e66096aa10322fb43df9c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page