Skip to main content

Create tabular synthetic data using copulas-based modeling.

Project description

This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Development Status PyPi Shield Downloads Unit Tests Coverage Status Slack


Copulas is a Python library for modeling multivariate distributions and sampling from them using copula functions. Given a table of numerical data, use Copulas to learn the distribution and generate new synthetic data following the same statistical properties.

Key Features:

  • Model multivariate data. Choose from a variety of univariate distributions and copulas – including Archimedian Copulas, Gaussian Copulas and Vine Copulas.

  • Compare real and synthetic data visually after building your model. Visualizations are available as 1D histograms, 2D scatterplots and 3D scatterplots.

  • Access & manipulate learned parameters. With complete access to the internals of the model, set or tune parameters to your choosing.


Install the Copulas library using pip or conda.

pip install copulas
conda install -c conda-forge copulas


Get started using a demo dataset. This dataset contains 3 numerical columns.

from copulas.datasets import sample_trivariate_xyz

real_data = sample_trivariate_xyz()

Model the data using a copula and use it to create synthetic data. The Copulas library offers many options including Gaussian Copula, Vine Copulas and Archimedian Copulas.

from copulas.multivariate import GaussianMultivariate

copula = GaussianMultivariate()

synthetic_data = copula.sample(len(real_data))

Visualize the real and synthetic data side-by-side. Let's do this in 3D so see our full dataset.

from copulas.visualization import compare_3d

compare_3d(real_data, synthetic_data)



Click below to run the code yourself on a Colab Notebook and discover new features.

Tutorial Notebook

Community & Support

Learn more about Copulas library from our documentation site.

Questions or issues? Join our Slack channel to discuss more about Copulas and synthetic data. If you find a bug or have a feature request, you can also open an issue on our GitHub.

Interested in contributing to Copulas? Read our Contribution Guide to get started.


The Copulas open source project first started at the Data to AI Lab at MIT in 2018. Thank you to our team of contributors who have built and maintained the library over the years!

View Contributors

The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.


v0.9.1 - 2023-08-10

This release fixes problems with the documentation site and drops support for Python 3.7.


  • Drop support for Python 3.7 - Issue #355 by @amontanez24


  • Formatting is broken on the main docs page - Issue #341 by @amontanez24

v0.9.0 - 2023-04-26

This release adds support for pandas 2.0 and above. Additionally adds a functionality to find version add-ons and renames covariance to correlation.


  • Remove upper bound for pandas - Issue#349 by @pvk-developer
  • Rename covariance to correlation - PR#346 by @frances-h
  • Add functionality to find version add-on - Issue#349 by @frances-h

v0.8.0 - 2023-01-06

This release adds support for python 3.10 and 3.11. Additionally, it drops support for python 3.6.


  • Support python 3.10 and above - PR#338 by @pvk-developer
  • Copulas Package Maintenance Updates - Issue#336 by @pvk-developer
  • Add support for python 3.10 - PR#329 by @katxiao

v0.7.0 - 2022-05-10

This release adds gaussian as a fallback distribution in case the user specified one fails. It also improves the fit of the beta distribution by properly estimating the loc and scale parameters.

General Improvements

  • Add gaussian as fallback - Issue#320 by @fealho
  • Improve the fit of the Beta distribution: Use the new loc and scale - Issue#317 by @pvk-developer

v0.6.1 - 2022-02-25

This release improves the random_state functionality by taking in RandomState objects in addition to random seeds.

General Improvements

  • Use random_state instead of random_seed - Issue#113 by @katxiao

v0.6.0 - 2021-05-13

This release makes Copulas compatible with Python 3.9! It also improves library maintenance by updating dependencies, reorganizing the CI workflows, adding pip check to the workflows and removing unused files.

General Improvements

  • Add support for Python 3.9 - Issue#282 by @amontanez24
  • Remove entry point in - Issue#280 by @amontanez24
  • Update pandas dependency range - Issue#266 by @katxiao
  • Fix repository language - Issue#272 by @pvk-developer
  • Add pip check to CI workflows - Issue#274 by @pvk-developer
  • Reorganize workflows and add codecov - PR#267 by @csala
  • Constrain jinja2 versions - PR#269 by @fealho

v0.5.1 - 2021-08-13

This release improves performance by changing the way scipy stats is used, calling their methods directly without creating intermediate instances.

It also fixes a bug introduced by the scipy 1.7.0 release where some distributions fail to fit because scipy validates the learned parameters.

Issues Closed

  • Exception: Optimization converged to parameters that are outside the range allowed by the distribution. - Issue #264 by @csala
  • Use scipy stats models directly without creating instances - Issue #261 by @csala

v0.5.0 - 2021-01-24

This release introduces conditional sampling for the GaussianMultivariate modeling. The new conditioning feature allows passing a dictionary with the values to use to condition the rest of the columns.

It also fixes a bug that prevented constant distributions to be restored from a dictionary and updates some dependencies.

New Features

  • Conditional sampling from Gaussian copula - Issue #154 by @csala

Bug Fixes

  • ScipyModel subclasses fail to restore constant values when using from_dict - Issue #212 by @csala

v0.4.0 - 2021-01-27

This release introduces a few changes to optimize processing speed by re-implementing the Gaussian KDE pdf to use vectorized root finding methods and also adding the option to subsample the data during univariate selection.

General Improvements

  • Make gaussian_kde faster - Issue #200 by @k15z and @fealho
  • Use sub-sampling in select_univariate - Issue #183 by @csala

v0.3.3 - 2020-09-18

General Improvements

  • Use corr instead of cov in the GaussianMultivariate - Issue #195 by @rollervan
  • Add arguments to GaussianKDE - Issue #181 by @rollervan

New Features

  • Log Laplace Distribution - Issue #188 by @rollervan

v0.3.2 - 2020-08-08

General Improvements

  • Support Python 3.8 - Issue #185 by @csala
  • Support scipy >1.3 - Issue #180 by @csala

New Features

  • Add Uniform Univariate - Issue #179 by @rollervan

v0.3.1 - 2020-07-09

General Improvements

  • Raise numpy version upper bound to 2 - Issue #178 by @csala

New Features

  • Add Student T Univariate - Issue #172 by @gbonomib

Bug Fixes

  • Error in Quickstarts : Unknown projection '3d' - Issue #174 by @csala

v0.3.0 - 2020-03-27

Important revamp of the internal implementation of the project, the testing infrastructure and the documentation by Kevin Alex Zhang @k15z, Carles Sala @csala and Kalyan Veeramachaneni @kveerama


  • Reimplementation of the existing Univariate distributions.
  • Addition of new Beta and Gamma Univariates.
  • New Univariate API with automatic selection of the optimal distribution.
  • Several improvements and fixes on the Bivariate and Multivariate Copulas implementation.
  • New visualization module with simple plotting patterns to visualize probability distributions.
  • New datasets module with toy datasets sampling functions.
  • New testing infrastructure with end-to-end, numerical and large scale testing.
  • Improved tutorials and documentation.

v0.2.5 - 2020-01-17

General Improvements

  • Convert import_object to get_instance - Issue #114 by @JDTheRipperPC

v0.2.4 - 2019-12-23

New Features

  • Allow creating copula classes directly - Issue #117 by @csala

General Improvements

  • Remove select_copula from Bivariate - Issue #118 by @csala
  • Rename TruncNorm to TruncGaussian and make it non standard - Issue #102 by @csala @JDTheRipperPC

Bugs fixed

  • Error on Frank and Gumble sampling - Issue #112 by @csala

v0.2.3 - 2019-09-17

New Features

  • Add support to Python 3.7 - Issue #53 by @JDTheRipperPC

General Improvements

  • Document RELEASE workflow - Issue #105 by @JDTheRipperPC
  • Improve serialization of univariate distributions - Issue #99 by @ManuelAlvarezC and @JDTheRipperPC

Bugs fixed

  • The method 'select_copula' of Bivariate return wrong CopulaType - Issue #101 by @JDTheRipperPC

v0.2.2 - 2019-07-31

New Features

  • truncnorm distribution and a generic wrapper for scipy.rv_continous distributions - Issue #27 by @amontanez, @csala and @ManuelAlvarezC
  • Independence bivariate copulas - Issue #46 by @aliciasun, @csala and @ManuelAlvarezC
  • Option to select seed on random number generator - Issue #63 by @echo66 and @ManuelAlvarezC
  • Option on Vine copulas to select number of rows to sample - Issue #77 by @ManuelAlvarezC
  • Make copulas accept both scalars and arrays as arguments - Issues #85 and #90 by @ManuelAlvarezC

General Improvements

  • Ability to properly handle constant data - Issues #57 and #82 by @csala and @ManuelAlvarezC
  • Tests for analytics properties of copulas - Issue #61 by @ManuelAlvarezC
  • Improved documentation - Issue #96 by @ManuelAlvarezC

Bugs fixed

  • Fix bug on Vine copulas, that made it crash during the bivariate copula selection - Issue #64 by @echo66 and @ManuelAlvarezC

v0.2.1 - Vine serialization

  • Add serialization to Vine copulas.
  • Add distribution as argument for the Gaussian Copula.
  • Improve Bivariate Copulas code structure to remove code duplication.
  • Fix bug in Vine Copulas sampling: 'Edge' object has no attribute 'index'
  • Improve code documentation.
  • Improve code style and linting tools configuration.

v0.2.0 - Unified API

  • New API for stats methods.
  • Standarize input and output to numpy.ndarray.
  • Increase unittest coverage to 90%.
  • Add methods to load/save copulas.
  • Improve Gaussian copula sampling accuracy.

v0.1.1 - Minor Improvements

  • Different Copula types separated in subclasses
  • Extensive Unit Testing
  • More pythonic names in the public API.
  • Stop using third party elements that will be deprected soon.
  • Add methods to sample new data on bivariate copulas.
  • New KDE Univariate copula
  • Improved examples with additional demo data.

v0.1.0 - First Release

  • First release on PyPI.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copulas-0.9.1.tar.gz (1.2 MB view hashes)

Uploaded source

Built Distribution

copulas-0.9.1-py2.py3-none-any.whl (54.4 kB view hashes)

Uploaded py2 py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page