A python library for building different types of copulas and using them for sampling.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

“sdv-dev” An open source project from Data to AI Lab at MIT.

Copulas

License: MIT
Documentation: https://sdv-dev.github.io/Copulas
Homepage: https://github.com/sdv-dev/Copulas

Overview

Copulas is a python library for building multivariate distributuions using copulas and using them for sampling. In short, you give a table of numerical data without missing values as a 2-dimensional numpy.ndarray and copulas models its distribution and using it to generate new records, or analyze its statistical properties.

This repository contains multiple implementations of bivariate and multivariate copulas, further functionality include:

Most usual statistical functions from the underlying distribution.
Built-in inverse-transform sampling method.
Easy save and load of models.
Create copulas directly from their parameters.

Supported Copulas

Bivariate copulas

Clayton
Frank
Gumbel
Independence

Multivariate

Gaussian [+ info]
D-Vine
C-Vine
R-Vine

Install

Requirements

Copulas has been developed and tested on Python 3.5, and 3.6

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where Copulas is run.

These are the minimum commands needed to create a virtualenv using python3.6 for Copulas:

pip install virtualenv
virtualenv -p $(which python3.6) copulas-venv

Afterwards, you have to execute this command to have the virtualenv activated:

source copulas-venv/bin/activate

Remember about executing it every time you start a new console to work on Copulas!

Install with pip

After creating the virtualenv and activating it, we recommend using pip in order to install Copulas:

pip install copulas

This will pull and install the latest stable release from PyPi.

Install from source

Alternatively, with your virtualenv activated, you can clone the repository and install it from source by running make install on the stable branch:

git clone git@github.com:sdv-dev/Copulas.git
cd Copulas
git checkout stable
make install

Install for Development

If you want to contribute to the project, a few more steps are required to make the project ready for development.

First, please head to the GitHub page of the project and make a fork of the project under you own username by clicking on the fork button on the upper right corner of the page.

Afterwards, clone your fork and create a branch from master with a descriptive name that includes the number of the issue that you are going to work on:

git clone git@github.com:{your username}/Copulas.git
cd Copulas
git branch issue-xx-cool-new-feature master
git checkout issue-xx-cool-new-feature

Finally, install the project with the following command, which will install some additional dependencies for code linting and testing.

make install-develop

Make sure to use them regularly while developing by running the commands make lint and make test.

Concepts

Probability

We call probability P to the measure assigned to the chance of an event happening. For example, in a dice, there are 6 sides, each with the same chance of being on top.

If we consider 0 to be impossible and 1 absolute certain, we can explain its probability like this:

Table of values for probability P

 ·    -> 1/6
 :    -> 1/6
 :·   -> 1/6
 ::   -> 1/6
 :·:  -> 1/6
 :::  -> 1/6

Random variable

A random variable X is a function mapping elements from the sample space (in our case, the dice sides) into ℝ.

In our case we have:

Table of values for random variable X and their probability P
      X       P
 ·    ->   1  ->  1/6
 :    ->   2  ->  1/6
 :·   ->   3  ->  1/6
 ::   ->   4  ->  1/6
 :·:  ->   5  ->  1/6
 :::  ->   6  ->  1/6

Distribution

A distribution is a function that describes the behavior of a random variable, like rolling a dice, and the probability of events related to them.

Usually a distribution is presented as a function F: ℝ -> [0, 1], called the cumulative distribution function or cdf, that has the following properties:

Is strictly non-decreasing
Is right-continous
It's limit to negative infinity exists and is 0.
It's limit to positive infinite exists and is 1.

Below we can see the cdf of the distribution of rolling a standard, 6 sided, dice:

We can see as the cumulative probability raises by steps of 1/6 at each integer between 1 and 6, as those are the only values that can appear.

Types of distributions

There are as many different distributions as different random phenomenon, but usually we classify them using this three aspects:

Continuity: We call a random variable a continous random variable if it's cdf is continuous, that it have no steps. Otherwise, we call it discrete random variable. In the example of the dice, we have discrete random variable.
Dimensionality: When a random variable represents the behavior of a single random phenomenon, we call it a univariate distribution, analogously we define bivariate and multivariate distributions.
Type: Most distribution have a type, defined by its behavior, some of the most common types of distributions are: uniform, gaussian, exponential,...

Copulas

Copulas are multivariate distributions whose marginals are uniform. Using them with distributions to model the marginals they allow us to generate multivariate random variables for any kind of phenomena.

Quickstart

In this short tutorial we will guide you through the a series of steps that will help you getting started with the most basic usage of Copulas in order to generate samples from a simple dataset.

NOTE: To be able to run this demo you will need to install the package from its sources.

1. Load the data

The first step is to load the data we will use to fit Copulas. In order to do so, we will first import the module pandas and call its function read_csv with the path to our example dataset.

In this case, we will load the iris dataset into a pandas.DataFrame.

import pandas as pd
data = pd.read_csv('data/iris.data.csv')

This will be return us a dataframe with 4 columns:

              0    1    2
feature_01  5.1  4.9  4.7
feature_02  3.5  3.0  3.2
feature_03  1.4  1.4  1.3
feature_04  0.2  0.2  0.2

2. Create a Copula instance

The next step is to import Copulas and create an instance of the desired copulas.

To do so, we need to import the copulas.multivariate.GaussianMultivariate and call it, in order to create a GaussianMultivariate instance with the default arguments:

from copulas.multivariate import GaussianMultivariate
copula = GaussianMultivariate()

3. Fit the model

Once we have a Copulas instance, we can proceed to call its fit method passing the data that we loaded bfore in order to start the fitting process:

copula.fit(data)

4. Sample new data

After the model has been fitted, we are ready to generate new samples by calling the sample method of the Copulas instance passing it the desired amount of samples:

num_samples = 1000
samples = copula.sample(num_samples)

This will return a DataFrame with the same number of columns as the original data.

                   0         1         2
feature_01  7.534814  7.255292  5.723322
feature_02  2.723615  2.959855  3.282245
feature_03  6.465199  6.896618  2.658393
feature_04  2.267646  2.442479  1.109811

The returned object, samples, is a pandas.DataFrame containing a table of synthetic data with the same format as the input data and 1000 rows as we requested.

5. Load and save a model

For some copula models the fitting process can take a lot of time, so we probably would like to avoid having to fit every we want to generate samples. Instead we can fit a model once, save it, and load it every time we want to sample new data.

If we have a fitted model, we can save it by calling it's save method, that only takes as argument the path where the model will be stored. Similarly, the load allows to load a model stored on disk by passing as argument the path where the model is stored.

model_path = 'mymodel.pkl'
copula.save(model_path)

Once the model is saved, it can be loaded back as a Copulas instance by using the load method:

NOTE: In order to load a saved model, you need to load it using the same class that was used to save it.

new_copula = GaussianMultivariate.load(model_path)

At this point we could use this model instance to generate more samples.

new_samples = new_copula.sample(num_samples)

6. Extract and set parameters

In some cases it's more useful to obtain the parameters from a fitted copula than to save and load from disk.

Once our copula is fitted, we can extract it's parameters using the to_dict method:

copula_params = copula.to_dict()

This will return a dictionary containing all the copula parameters:

{'covariance': [[1.006711409395973,
   -0.11010327176239859,
   0.877604856347186,
   0.8234432550696282],
  [-0.11010327176239859,
   1.006711409395972,
   -0.4233383520816992,
   -0.3589370029669185],
  [0.877604856347186,
   -0.4233383520816992,
   1.006711409395973,
   0.9692185540781538],
  [0.8234432550696282,
   -0.3589370029669185,
   0.9692185540781538,
   1.006711409395974]],
 'distribs': {'feature_01': {'type': 'copulas.univariate.gaussian.GaussianUnivariate',
   'fitted': True,
   'constant_value': None,
   'mean': 5.843333333333334,
   'std': 0.8253012917851409},
  'feature_02': {'type': 'copulas.univariate.gaussian.GaussianUnivariate',
   'fitted': True,
   'constant_value': None,
   'mean': 3.0540000000000003,
   'std': 0.4321465800705435},
  'feature_03': {'type': 'copulas.univariate.gaussian.GaussianUnivariate',
   'fitted': True,
   'constant_value': None,
   'mean': 3.758666666666666,
   'std': 1.7585291834055212},
  'feature_04': {'type': 'copulas.univariate.gaussian.GaussianUnivariate',
   'fitted': True,
   'constant_value': None,
   'mean': 1.1986666666666668,
   'std': 0.7606126185881716}},
 'type': 'copulas.multivariate.gaussian.GaussianMultivariate',
 'fitted': True,
 'distribution': 'copulas.univariate.gaussian.GaussianUnivariate'}

Once we have all the parameters we can create a new identical Copula instance by using the method from_dict:

new_copula = GaussianMultivariate.from_dict(copula_params)

At this point we could use this model instance to generate more samples.

new_samples = new_copula.sample(num_samples)

What's next?

For more details about Copulas and all its possibilities and features, please check the documentation site.

There you can learn more about how to contribute to Copulas in order to help us developing new features or cool ideas.

Credits

Copulas is an open source project from the Data to AI Lab at MIT which has been built and maintained over the years by the following team:

Manuel Alvarez manuel@pythiac.com
Carles Sala carles@pythiac.com
José David Pérez jose@pythiac.com
(Alicia)Yi Sun yis@mit.edu
Andrew Montanez amontane@mit.edu
Kalyan Veeramachaneni kalyan@csail.mit.edu
paulolimac paulolimac@gmail.com

Related Projects

SDV

SDV, for Synthetic Data Vault, is the end-user library for synthesizing data in development under the HDI Project. SDV allows you to easily model and sample relational datasets using Copulas thought a simple API. Other features include anonymization of Personal Identifiable Information (PII) and preserving relational integrity on sampled records.

TGAN

TGAN is a GAN based model for synthesizing tabular data. It's also developed by the MIT's Data to AI Lab and is under active development.

History

0.2.5 (2019-01-17)

General Improvements

Convert import_object to get_instance - Issue #114 by @JDTheRipperPC

0.2.4 (2019-12-23)

New Features

Allow creating copula classes directly - Issue #117 by @csala

General Improvements

Remove select_copula from Bivariate - Issue #118 by @csala
Rename TruncNorm to TruncGaussian and make it non standard - Issue #102 by @csala @JDTheRipperPC

Bugs fixed

Error on Frank and Gumble sampling - Issue #112 by @csala

0.2.3 (2019-09-17)

New Features

Add support to Python 3.7 - Issue #53 by @JDTheRipperPC

General Improvements

Document RELEASE workflow - Issue #105 by @JDTheRipperPC
Improve serialization of univariate distributions - Issue #99 by @ManuelAlvarezC and @JDTheRipperPC

Bugs fixed

The method 'select_copula' of Bivariate return wrong CopulaType - Issue #101 by @JDTheRipperPC

0.2.2 (2019-07-31)

New Features

truncnorm distribution and a generic wrapper for scipy.rv_continous distributions - Issue #27 by @amontanez, @csala and @ManuelAlvarezC
Independence bivariate copulas - Issue #46 by @aliciasun, @csala and @ManuelAlvarezC
Option to select seed on random number generator - Issue #63 by @echo66 and @ManuelAlvarezC
Option on Vine copulas to select number of rows to sample - Issue #77 by @ManuelAlvarezC
Make copulas accept both scalars and arrays as arguments - Issues #85 and #90 by @ManuelAlvarezC

General Improvements

Ability to properly handle constant data - Issues #57 and #82 by @csala and @ManuelAlvarezC
Tests for analytics properties of copulas - Issue #61 by @ManuelAlvarezC
Improved documentation - Issue #96 by @ManuelAlvarezC

Bugs fixed

Fix bug on Vine copulas, that made it crash during the bivariate copula selection - Issue #64 by @echo66 and @ManuelAlvarezC

0.2.1 - Vine serialization

Add serialization to Vine copulas.
Add distribution as argument for the Gaussian Copula.
Improve Bivariate Copulas code structure to remove code duplication.
Fix bug in Vine Copulas sampling: 'Edge' object has no attribute 'index'
Improve code documentation.
Improve code style and linting tools configuration.

0.2.0 - Unified API

New API for stats methods.
Standarize input and output to numpy.ndarray.
Increase unittest coverage to 90%.
Add methods to load/save copulas.
Improve Gaussian copula sampling accuracy.

0.1.1 - Minor Improvements

Different Copula types separated in subclasses
Extensive Unit Testing
More pythonic names in the public API.
Stop using third party elements that will be deprected soon.
Add methods to sample new data on bivariate copulas.
New KDE Univariate copula
Improved examples with additional demo data.

0.1.0 - First Release

First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.12.4.dev2 pre-release yanked

Apr 28, 2025

Reason this release was yanked:

this was a test rc

0.12.4.dev1 pre-release yanked

Apr 28, 2025

Reason this release was yanked:

This was a test rc

0.12.4.dev0 pre-release yanked

Apr 28, 2025

Reason this release was yanked:

This was a test rc

0.12.3

Jun 13, 2025

0.12.3.dev1 pre-release

Jun 13, 2025

0.12.3.dev0 pre-release

Apr 25, 2025

0.12.2

Apr 2, 2025

0.12.2.dev0 pre-release

Apr 2, 2025

0.12.1

Jan 15, 2025

0.12.1.dev0 pre-release

Jan 13, 2025

0.12.0

Nov 13, 2024

0.12.0.dev0 pre-release

Nov 12, 2024

0.11.1

Aug 21, 2024

0.11.1.dev0 pre-release

Aug 20, 2024

0.11.0

Apr 10, 2024

0.11.0.dev0 pre-release

Apr 9, 2024

0.10.1

Mar 13, 2024

0.10.1.dev0 pre-release

Mar 13, 2024

0.10.0

Nov 13, 2023

0.10.0.dev0 pre-release

Nov 13, 2023

0.9.2

Oct 12, 2023

0.9.2.dev0 pre-release

Oct 12, 2023

0.9.1

Aug 10, 2023

0.9.1.dev0 pre-release

Aug 10, 2023

0.9.0

Apr 26, 2023

0.9.0.dev0 pre-release

Apr 26, 2023

0.8.1.dev0 pre-release

Apr 25, 2023

0.8.0

Jan 6, 2023

0.8.0.dev0 pre-release

Jan 5, 2023

0.7.1.dev0 pre-release

Dec 26, 2022

0.7.0

May 10, 2022

0.7.0.dev0 pre-release

May 10, 2022

0.6.1

Feb 25, 2022

0.6.1.dev0 pre-release

Feb 18, 2022

0.6.0

Nov 5, 2021

0.6.0.dev0 pre-release

Nov 5, 2021

0.5.2.dev1 pre-release

Nov 4, 2021

0.5.2.dev0 pre-release

Nov 4, 2021

0.5.1

Aug 17, 2021

0.5.1.dev1 pre-release

Aug 12, 2021

0.5.1.dev0 pre-release

Jul 26, 2021

0.5.0

Feb 24, 2021

0.5.0.dev1 pre-release

Feb 23, 2021

0.5.0.dev0 pre-release

Feb 23, 2021

0.4.0

Jan 27, 2021

0.4.0.dev0 pre-release

Jan 27, 2021

0.3.3

Sep 18, 2020

0.3.3.dev0 pre-release

Sep 18, 2020

0.3.2

Aug 7, 2020

0.3.2.dev1 pre-release

Aug 7, 2020

0.3.2.dev0 pre-release

Aug 4, 2020

0.3.1

Jul 9, 2020

0.3.1.dev0 pre-release

Jul 9, 2020

0.3.0

Mar 28, 2020

0.3.0.dev0 pre-release

Mar 28, 2020

This version

0.2.5

Jan 17, 2020

0.2.4

Dec 23, 2019

0.2.3

Sep 17, 2019

0.2.1

Jan 17, 2019

0.2.0

Sep 14, 2018

0.1.1

Aug 23, 2018

0.1.0

Jun 26, 2018

0.0.0

Jun 12, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copulas-0.2.5.tar.gz (105.3 kB view details)

Uploaded Jan 17, 2020 Source

Built Distribution

copulas-0.2.5-py2.py3-none-any.whl (44.2 kB view details)

Uploaded Jan 17, 2020 Python 2Python 3

File details

Details for the file copulas-0.2.5.tar.gz.

File metadata

Download URL: copulas-0.2.5.tar.gz
Upload date: Jan 17, 2020
Size: 105.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for copulas-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`49cf10aa112fe815676600e339e68f24e64d93363f61737f960e199d00460a94`
MD5	`ed01e0b2f9372817d983b5d28b2dfad1`
BLAKE2b-256	`91b630cec515131b52890044391b713eadeb247b4dd5e123a18ed03cb84e9acc`

See more details on using hashes here.

File details

Details for the file copulas-0.2.5-py2.py3-none-any.whl.

File metadata

Download URL: copulas-0.2.5-py2.py3-none-any.whl
Upload date: Jan 17, 2020
Size: 44.2 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for copulas-0.2.5-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d6c2b769adeec98ef567bbe8c5d0368bb4a4be05bf0052ef7f8b8a16d7a0e77`
MD5	`8ef859386cca9d49e1ff11b5a59e32f8`
BLAKE2b-256	`685f290c5af85ce1d5126a7c6aaba6376837ff84b755be2b30ba216d3164ebe6`

See more details on using hashes here.

copulas 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Copulas

Overview

Supported Copulas

Bivariate copulas

Multivariate

Install

Requirements

Install with pip

Install from source

Install for Development

Concepts

Probability

Random variable

Distribution

Types of distributions

Copulas

Quickstart

1. Load the data

2. Create a Copula instance

3. Fit the model

4. Sample new data

5. Load and save a model

6. Extract and set parameters

What's next?

Credits

Related Projects

SDV

TGAN

History

0.2.5 (2019-01-17)

General Improvements

0.2.4 (2019-12-23)

New Features

General Improvements

Bugs fixed

0.2.3 (2019-09-17)

New Features

General Improvements

Bugs fixed

0.2.2 (2019-07-31)

New Features

General Improvements

Bugs fixed

0.2.1 - Vine serialization

0.2.0 - Unified API

0.1.1 - Minor Improvements

0.1.0 - First Release

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes