Synthetic data generation and evaluation library

These details have not been verified by PyPI

Project links

Homepage

Project description

Welcome to the synthyverse!

An extensive ecosystem for synthetic data generation and evaluation in Python.

The synthyverse is a work in progress. Please provide any suggestions through a GitHub Issue.

Features

🔧 Highly modular installation. Install only those modules which you require to keep your installation lightweight.
📚 Extensive library for synthetic data. Any generator or metric can be quickly added without dependency conflicts due to synthyverse's modular installation. This allows the synthyverse to host a great amount of generators and evaluation metrics. It also allows the synthyverse to wrap around any existing synthetic data library.
⚙️ Benchmarking module for simplified synthetic data pipelines. The benchmarking module executes a modular pipeline of synthetic data generation and evaluation. Choose a generator, set of evaluation metrics, and pipeline parameters, and obtain results on synthetic data quality.
👷 Minimal preprocessing required. All preprocessing is handled by the synthyverse, so no need for scaling, one-hot encoding, or handling missing values. Different preprocessing schemes can be used by setting simple parameters.
👍 Set constraints for your synthetic data. You can specify inter-column constraints which you want your synthetic data to follow. Constraints are modelled explicitly by the synthyverse, not through oversampling. This ensures efficient and reliable constraint setting.

Installation

The synthyverse is unique in its modular installation set-up. To avoid conflicting dependencies, we provide various installation templates. Each template installs only those dependencies which are required to access certain modules.

Templates provide installation for specific generators, the evaluation module, and more. Install multiple templates to get access to multiple modules of the synthyverse, e.g., multiple generators and evaluation.

We strongly advise to only install templates which you require during a specific run. Installing multiple templates gives rise to potential dependency conflicts. Use separate virtual environments across installations.

Note that the core installation without any template doesn't install any modules.

See the overview of templates.

General Installation Template

pip install synthyverse[template]

Installation Examples

pip install synthyverse[ctgan]

pip install synthyverse[arf,bn,ctgan,tvae]

pip install synthyverse[ctgan,eval]

Usage

Synthetic Data Generation

Import desired generator. Note that you can only import generators according to your installed synthyverse template.

See all available generators.

from synthyverse.generators import ARFGenerator
generator = ARFGenerator(num_trees=20, random_state=0)

Fit the generator. For tabular data, also pass which columns are discrete, as these often need to be handled differently than numerical features. If the target column is discrete, it should also be included in the discrete features list.

from sklearn.datasets import load_breast_cancer
X = load_breast_cancer(as_frame=True).frame
generator.fit(X, discrete_features=["target"])

Sample a synthetic dataset.

syn = generator.generate(len(X))

Synthetic Data Evaluation

Choose a set of metrics. Either choose default metrics as a list, or provide them as a dictionary with carefully selected hyperparameters. Add a dash to the metric name to compute various configurations of the same evaluation metric.

See all available metrics.

metrics = ["mle", "dcr", "similarity"]
metrics={
        "mle-trts": {"train_set": "real"},
        "mle-tstr": {"train_set": "synthetic"},
        "dcr": {"estimates": ["mean", 0.01, 0.05]},
        "similarity":{}
    }

Set-up a metric evaluator object. See the API reference for in-depth usage.

from synthyverse.evaluation import TabularMetricEvaluator

evaluator = TabularMetricEvaluator(
    metrics=metrics,
    discrete_features=["target"],
    target_column="target",
    random_state=seed
)

Evaluate the metrics with respect to the synthetic data, the training data used to fit the generator, and an independent holdout/test set of real data.

results = evaluator.evaluate(X_train, X_test, syn)

Benchmarking

The benchmarking module performs synthetic data generation and evaluation in a single pipeline. See the API reference for in-depth usage.

Set-up a benchmarking object. Supply the generator name and its parameters, evaluation metrics, the number of random train-test splits to fit the generator to, number of random initializations to fit the generator to, the number of synthetic sets to sample for each fitted generator, and the size of the test set.

from synthyverse.benchmark import TabularSynthesisBenchmark

benchmark = TabularSynthesisBenchmark(
    generator_name="arf",
    generator_params={"num_trees": 20},
    n_random_splits=3,
    n_inits=3,
    n_generated_datasets=20,
    metrics=["classifier_test", "mle", "dcr"],
    test_size=0.3,
)

Run the benchmarking pipeline on a dataset.

results = benchmark.run(X, target_column="target", discrete_columns=["target"])

Preprocessing and Constraints

The synthyverse allows for various preprocessing schemes, which can be easily adapted through parameters passed to the generator and/or benchmarking module.

Some of the options include:

enforcing constraints
imputing missing values
whether or not to retain missingness in the output synthetic dataset
whether to encode features which are a mix of discrete spikes and continuous numerical values (e.g., zero-inflated features)
whether to normalize numerical features through quantile transformation

The example below shows how to pass preprocessing parameters to the generator and/or benchmarking module. See the API reference for in-depth usage.

generator = ARFGenerator(
    constraints=["s1>=s2+s3"],  # enforce a constraint on the synthetic data
    missing_imputation_method="random",  # random imputation of missing values
    retain_missingness=True,  # retain missing values in the synthetic data
    encode_mixed_numerical_features=True,
    quantile_transform_numericals=True,
)

generator.fit(X_train, discrete_features=["target"])

syn = generator.generate(len(X))


benchmark = TabularSynthesisBenchmark(
    generator_name="arf",
    generator_params={},
    n_random_splits=1,
    n_inits=1,
    n_generated_datasets=1,
    metrics=["mle", "similarity", "classifier_test"],
    test_size=0.2,
    val_size=0.1,
    missing_imputation_method="drop",
    retain_missingness=False,
    encode_mixed_numerical_features=False,
    quantile_transform_numericals=False,
    constraints=[],
)
results = benchmark.run(
    X, target_column=target_column, discrete_columns=discrete_features
)

Standalone preprocessing module

You can also use the synthyverse's preprocessing module for your other data science tasks. Simply install the base generator version of the synthyverse:

pip install synthyverse[base]

Now you can use the preprocessing class of the synthyverse:

from synthyverse.preprocessing import TabularPreprocessor

preprocessor = TabularPreprocessor(discrete_features=["target"], random_state=0)

X_preprocessed = preprocessor.scale(
    X,
    numerical_transformer="standard",
    categorical_transformer="one-hot",
    numerical_transformer_hparams={},
    categorical_transformer_hparams={},
)

X = preprocessor.inverse_scale(X_preprocessed)

Again, see the API reference for in-depth usage.

Tutorials

Tabular Synthetic Data with the synthyverse: Introduction

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.6

Mar 23, 2026

0.1.5

Mar 6, 2026

0.1.4

Feb 25, 2026

This version

0.1.3

Nov 12, 2025

0.1.2

Sep 16, 2025

0.1.1

Aug 28, 2025

0.1.0

Aug 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthyverse-0.1.3.tar.gz (2.5 MB view details)

Uploaded Nov 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

synthyverse-0.1.3-py3-none-any.whl (103.7 kB view details)

Uploaded Nov 12, 2025 Python 3

File details

Details for the file synthyverse-0.1.3.tar.gz.

File metadata

Download URL: synthyverse-0.1.3.tar.gz
Upload date: Nov 12, 2025
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for synthyverse-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`d4ed6f414aa79c24d75592e0560136e38d0c31303836485a88af6d76628e182d`
MD5	`ac6983fb304e549af0ef264db7e0a379`
BLAKE2b-256	`f62ffc297a252f8f963f4d0e4e7e4d8ee851ffe013c377bc36db613e9963081f`

See more details on using hashes here.

File details

Details for the file synthyverse-0.1.3-py3-none-any.whl.

File metadata

Download URL: synthyverse-0.1.3-py3-none-any.whl
Upload date: Nov 12, 2025
Size: 103.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for synthyverse-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93892282d3a570afc62525afb3ac2d059225887fcd2173b577a93165cf35e866`
MD5	`df0632e485ca154c5e0e242942c343c2`
BLAKE2b-256	`0bb6b86553aae3f3d06603c996176ad841efbfff44f2a1ecf87a22323b26537b`

See more details on using hashes here.

synthyverse 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Installation

General Installation Template

Installation Examples

Usage

Synthetic Data Generation

Synthetic Data Evaluation

Benchmarking

Preprocessing and Constraints

Standalone preprocessing module

Tutorials

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes