Skip to main content

Generate synthetic data for single table, multi table and sequential data

Project description


This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Dev Status PyPi Shield Unit Tests Integration Tests Coverage Status Downloads Colab Forum

Overview

The Synthetic Data Vault (SDV) is a Python library designed to be your one-stop shop for creating tabular synthetic data. The SDV uses a variety of machine learning algorithms to learn patterns from your real data and emulate them in synthetic data.

Features

:brain: Create synthetic data using machine learning. The SDV offers multiple models, ranging from classical statistical methods (GaussianCopula) to deep learning methods (CTGAN). Generate data for single tables, multiple connected tables or sequential tables.

:bar_chart: Evaluate and visualize data. Compare the synthetic data to the real data against a variety of measures. Diagnose problems and generate a quality report to get more insights.

:arrows_counterclockwise: Preprocess, anonymize and define constraints. Control data processing to improve the quality of synthetic data, choose from different types of anonymization and define business rules in the form of logical constraints.

Important Links
Tutorials Get some hands-on experience with the SDV. Launch the tutorial notebooks and run the code yourself.
:book: Docs Learn how to use the SDV library with user guides and API references.
:orange_book: Blog Get more insights about using the SDV, deploying models and our synthetic data community.
:busts_in_silhouette: DataCebo Forum Discuss SDV features, ask questions, and receive help .
:computer: Website Check out the SDV website for more information about the project.

Install

The SDV is publicly available under the Business Source License. Install SDV using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

pip install sdv
conda install -c pytorch -c conda-forge sdv

Getting Started

Load a demo dataset to get started. This dataset is a single table describing guests staying at a fictional hotel.

from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(modality='single_table', dataset_name='fake_hotel_guests')

Single Table Metadata Example

The demo also includes metadata, a description of the dataset, including the data types in each column and the primary key (guest_email).

Synthesizing Data

Next, we can create an SDV synthesizer, an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data. Let's use the GaussianCopulaSynthesizer.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=real_data)

And now the synthesizer is ready to create synthetic data!

synthetic_data = synthesizer.sample(num_rows=500)

The synthetic data will have the following properties:

  • Sensitive columns are fully anonymized. The email, billing address and credit card number columns contain new data so you don't expose the real values.
  • Other columns follow statistical patterns. For example, the proportion of room types, the distribution of check in dates and the correlations between room rate and room type are preserved.
  • Keys and other relationships are intact. The primary key (guest email) is unique for each row. If you have multiple tables, the connection between a primary and foreign keys makes sense.

Evaluating Synthetic Data

The SDV library allows you to evaluate the synthetic data by comparing it to the real data. Get started by generating a quality report.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(real_data, synthetic_data, metadata)
Generating report ...

(1/2) Evaluating Column Shapes: |████████████████| 9/9 [00:00<00:00, 1133.09it/s]|
Column Shapes Score: 89.11%

(2/2) Evaluating Column Pair Trends: |██████████████████████████████████████████| 36/36 [00:00<00:00, 502.88it/s]|
Column Pair Trends Score: 88.3%

Overall Score (Average): 88.7%

This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well as detailed breakdowns. For more insights, you can also visualize the synthetic vs. real data.

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_name='amenities_fee',
    metadata=metadata,
)

fig.show()

Real vs. Synthetic Data

What's Next?

Using the SDV library, you can synthesize single table, multi table and sequential data. You can also customize the full synthetic data workflow, including preprocessing, anonymization and adding constraints.

To learn more, visit the SDV Demo page.

Credits

Thank you to our team of contributors who have built and maintained the SDV ecosystem over the years!

View Contributors

Citation

If you use SDV for your research, please cite the following paper:

Neha Patki, Roy Wedge, Kalyan Veeramachaneni. The Synthetic Data Vault. IEEE DSAA 2016.

@inproceedings{
    SDV,
    title={The Synthetic data vault},
    author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan},
    booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
    year={2016},
    pages={399-410},
    doi={10.1109/DSAA.2016.49},
    month={Oct}
}



The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdv-1.36.2.dev0.tar.gz (177.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdv-1.36.2.dev0-py3-none-any.whl (204.8 kB view details)

Uploaded Python 3

File details

Details for the file sdv-1.36.2.dev0.tar.gz.

File metadata

  • Download URL: sdv-1.36.2.dev0.tar.gz
  • Upload date:
  • Size: 177.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdv-1.36.2.dev0.tar.gz
Algorithm Hash digest
SHA256 36c8759220a480faa4a6def1542589ce71c89ce455d438cd7b659bfb8f717d2a
MD5 307e7ad54e27a423c1d67d506be2b868
BLAKE2b-256 f1d40f4c0127f73e6ad3b3e6f2a8874847551e9cb297467d51a61ba66827b784

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdv-1.36.2.dev0.tar.gz:

Publisher: release.yml on sdv-dev/SDV

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdv-1.36.2.dev0-py3-none-any.whl.

File metadata

  • Download URL: sdv-1.36.2.dev0-py3-none-any.whl
  • Upload date:
  • Size: 204.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdv-1.36.2.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 a156b4f36d1e3f04cc7a94d081871c65ba27ce88ad4cf539559910e18baca226
MD5 b01ef06004018d6c5dc47ca41d484d12
BLAKE2b-256 75e65b593925c6f6094b117c1e8cec3910932d467d1098bcddeba8b7e7f466f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdv-1.36.2.dev0-py3-none-any.whl:

Publisher: release.yml on sdv-dev/SDV

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page