Skip to main content

Top-level package for statsmodels copula wrapper.

Project description

Suppose we want to specify a continuous n-dimensional joint probability distribution by giving:

  • n arbitrary marginal distributions
  • a set of pairwise correlations between the dimensions

Copulas are a good way to achieve this separate specification of marginal distribution and dependence.

This is a wrapper around the CopulaDistribution class from the statsmodels package, providing an interface that I prefer. (The rationale is disused in more detail below).

Installation

pip install copula_wrapper  # or `poetry add copula_wrapper`

Usage

Dictionary arguments and Kendall's tau:

from copula_wrapper import CopulaJoint
from scipy import stats
import seaborn as sns

marginals = {
    "consumption elasticity": stats.uniform(0.75, 3),
    "market return": stats.lognorm(0.05, 0.05),
    "risk of war": stats.beta(1, 50)
}

tau = {
    # Missing pairs are assumed to be independent
    ("risk of war", "market return"): -0.5,
}

# Instantiate the (frozen) joint distribution
dist = CopulaJoint(marginals, kendall_tau=tau)

# Query the CDF or PDF
p = dist.cdf({
    "consumption elasticity": 1.5,
    "market return": 1.5,
    "risk of war": 0.5
})

# Sample randomly
sample = dist.rvs(100_000)

# Plot
sns.pairplot(sample, plot_kws={"s": 1})

Array arguments and Spearman's rho:

import pandas as pd
import numpy as np
from copula_wrapper import CopulaJoint
from scipy import stats
import seaborn as sns

marginals = [
    stats.uniform(0.75, 3),
    stats.lognorm(0.05, 0.05),
    stats.beta(1, 50)
]

spearman = np.eye(3)
spearman[1, 2] = spearman[2, 1] = -0.5

# Instantiate the (frozen) joint distribution
dist = CopulaJoint(marginals, spearman_rho=spearman)

# Query the CDF or PDF
p = dist.cdf([1.5, 1.5, 0.5])

# Sample randomly
sample = dist.rvs(10_000)

# Plot
sns.pairplot(pd.DataFrame(sample), plot_kws={"s": 1})

img_2.png

Discussion of interface

CopulaJoint is a wrapper around CopulaDistribution from statsmodels.

The context that motivated this wrapper was:

  • A Monte Carlo simulation model with 10+ input dimensions
  • The correlations come from subjective judgments / expert elicitation
  • Many correlations are zero
  • Users may not be familiar with or care about the mathematics of copulas (I wasn't) and may not be familiar with API conventions of SciPy (that also influenced statsmodels).
  • In my opinion, the terminology in the area invites confusion between a copula and the joint distribution (with arbitrary marginals) that it induces.

How this wrapper changes the API

Abstracts away the underlying copula and takes in rank correlations

In statsmodels, you'd do something like the following

from statsmodels.distributions.copula.api import CopulaDistribution, GaussianCopula
from scipy import stats
corr = [
	[1.0, 0.5],
	[0.5, 1.0],
]
joint = CopulaDistribution(GaussianCopula(corr=corr), marginals=[stats.beta(2,2), stats.norm()])

where corr is the matrix of Pearson's rho correlations (the standard linear correlation coefficient most people are familiar with). IMO, this surfaces implementation details and requires the user to engage with the internals of their joint distribution, which they may not understand or care about. It also invites the misunderstanding that the joint distribution will have a Pearson's rho matching the corr we pass in. This is not the case, as the following example shows:

import numpy as np
from scipy import stats
from statsmodels.distributions.copula.api import CopulaDistribution, GaussianCopula

c = 0.8
corr = [
    [1.0, c],
    [c, 1.0],
]
joint = CopulaDistribution(GaussianCopula(corr=corr),
                           marginals=[stats.norm(2, 2), stats.lognorm(loc=0, s=1)])

samples = joint.rvs(int(1e6))
joint_r = stats.pearsonr(samples[:, 0], samples[:, 1])[0]  # about 0.6
assert not np.isclose(joint_r, c)

Therefore, this wrapper instead takes rank correlations as input, and these are satisfied by the resulting joint distribution. It converts these to the appropriate correlation measure for the underlying copula. You must explicitly provide kendall_tau or spearman_rho.

(For now, only the Gaussian copula is supported. But I've written the class to be extensible easily and even put some hints as code comments. PRs welcome. If more copulas are supported, there will be a parameter family with values like "Gaussian", "Clayton"...)

Allows dimension names

if dimensions are to be named, marginals must be a dict, and rank_corr must be a dict of tuples, e.g.

from scipy import stats
marginals = {
	"a": stats.uniform(0, 1),
	"b": stats.lognorm(2, 3),
	"c": stats.norm(4, 5),
}

rank_corr = {
	# Missing pairs are assumed to be independent
	("a", "b"): 0.5,
}

and the first argument to .pdf, .cdf, and .logpdf must be a dict with the same keys as marginals. (I originally had these methods support dimensions as keyword arguments, but I thought it was better not to modify the API even more, plus dictionary keys can be any hashable object, not just strings.).

If dimensions are named, .rvs returns a Pandas DataFrame with dimension names as column names.

Otherwise, marginals must be a list and rank_corr must be a square matrix, and .rvs returns a numpy array, just like in statsmodels.

Satisfies interface of SciPy multivariate frozen distribution

For background, the Scipy pattern is that there is e.g. a norm_gen subclass of rv_continuous that is used to create the norm instance, which can then optionally be 'frozen'.

The statsmodels CopulaDistribution does not follow this pattern. It acts 'frozen' in sofar as the underlying copula (e.g. Gaussian) and its correlation matrix are specified at initialization. However, it acts 'unfrozen' in that cop_args and marg_args can be passed to .pdf, .cdf, and .logpdf. As a result, CopulaDistribution does not subclass any of the scipy.stats classes.

I find this a bit inconsistent, and don't find cop_args and marg_args useful, so I drop them in this wrapper. The benefit is that the wrapper now satisfies the interface of a SciPy multivariate frozen distribution.

This means that the marginals provided to CopulaJoint must be frozen as well.

It also means I use size (SciPy) instead of nobs (statsmodels) for the call to rvs.

Tests

While the copula tests in statsmodels are likely to be good for their intended purpose, I couldn't quickly decipher exactly what behaviour these tests were checking for. In order to make sure this wrapper behaves in the way that its interface promises, I created a set of end-to-end tests. These tests that check that random samples from the joint distribution have:

  1. The expected rank correlation matrix
  2. The expected marginal distributions (using the Kolmogorov-Smirnov statistic)

The tests are parametrized (run for many inputs).

As they involve large numbers of random samples, they are slow.

There are also a few simple unit tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copula_wrapper-0.1.6.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

copula_wrapper-0.1.6-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file copula_wrapper-0.1.6.tar.gz.

File metadata

  • Download URL: copula_wrapper-0.1.6.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/22.3.0

File hashes

Hashes for copula_wrapper-0.1.6.tar.gz
Algorithm Hash digest
SHA256 95141a71f8408d01f76e73347e635bd1763b1379020970d3b5cfa81059ab80ac
MD5 acbdf1f91fa4d112319c87d30c9bc040
BLAKE2b-256 c23c42c0cda4feb184b2ea630c4e9fa437c97899db4105542ae4d8fc79ab0dd3

See more details on using hashes here.

File details

Details for the file copula_wrapper-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: copula_wrapper-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/22.3.0

File hashes

Hashes for copula_wrapper-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 6c6f2d2b8f32cfe4bd00f6baabf7389eefb5f3f7d901be47229473f15060809a
MD5 d8e67a473838fbee140a939261e4a975
BLAKE2b-256 b8c61cc1e4728a9f58f7f5c14971bd5b1c09979196d25bdb9117f1123efc4a13

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page