Skip to main content

A Python package for tabular synthetic data

Project description

copula-tabular

Generate tabular synthetic data using Gaussian copulas

Overview

Advancements in synthetic data generation have made it a viable solution for applications in various fields, such as finance, biomedical research, and data science. Synthetic data is generated artificially, yet replicates the joint probability distribution of its real-world counterpart. Its ability to mimic the statistical behaviour of real data makes it a useful tool for testing algorithms, systems, and training machine learning models, and it can be used as an economical substitute for real data when it is not available, is too sensitive to release, or too costly to acquire. Copula-based data generation methods have been demonstrated to produce reliable and accurate tabular data when generating synthetic data.

In this package, we present a tool for generating multivariate synthetic data through the implementation of a Gaussian copula. This model incorporates conditional joint distributions into its framework, allowing for the splitting of single variables into multiple component marginal distributions. The conditional enhancements provides greater usability in the synthesis of complex, non-linear sample distributions, allowing for the replication of a wider range of datasets.

The tool is designed to work with a data dictionary, or a file describing the metadata of the input dataset. There are additional class-based implementations of data cleaning, visualisation tools, transformation tools, privacy leakage evaluation, and sample wrapper scripts for generating synthetic data from start to finish.

Example Result:

Figure showing correlation plots of a simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The left plot shows the original Pearson correlation between variables, while the middle and right plots show the correlation for synthetic data generated using standard copula and conditional copula respectively. Figure showing correlation plots of a simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The left plot shows the original Pearson correlation between variables, while the middle and right plots show the correlation for synthetic data generated using standard copula and conditional copula respectively.

Figure showing superimposed scatterplots of the same simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The training, synthetic (standard copula), synthetic (conditional copula) data points are in blue, grey, and red respectively. Figure showing superimposed scatterplots of the same simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The training, synthetic (standard copula), synthetic (conditional copula) data points are in blue, grey, and red respectively.

Documentation

For installation instructions, getting started guides and tutorials, background information, and API reference summaries, please see the website.

Contributing

Thank you for considering contributing to Synthia. Please follow this link for more details.

Development notes

Please visit the website for more details.

Copyright and license

Copyright 2023 BiomedDAR, BII, A*STAR. Licensed under MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bdarpack-0.1.6.tar.gz (68.4 kB view details)

Uploaded Source

Built Distribution

bdarpack-0.1.6-py3-none-any.whl (72.4 kB view details)

Uploaded Python 3

File details

Details for the file bdarpack-0.1.6.tar.gz.

File metadata

  • Download URL: bdarpack-0.1.6.tar.gz
  • Upload date:
  • Size: 68.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for bdarpack-0.1.6.tar.gz
Algorithm Hash digest
SHA256 027f888baabf043b2f3f8232b92eda9c8a70a40e5e6a76f4af61ce2b3e6e8326
MD5 d0a68eb646cef5fa1c79e570a2244cbd
BLAKE2b-256 d2589d3cafc72c0f100aaa3268fa36dd0854514ac6df35cc04c8c990d4a0dc78

See more details on using hashes here.

File details

Details for the file bdarpack-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: bdarpack-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 72.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for bdarpack-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 3bc57f362d4970287dba98c9d4c0fcf04c4ce70e4416bf2cbc6d695a5e4b0a76
MD5 250104b97dfd916e5856bcf4391fd025
BLAKE2b-256 066743e4b7f2596ca61b4a88720e1955e6fb8aeeb191371b2bfc70fe636daf0d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page