Skip to main content

scmopy: Distribution-Agnostic Structural Causal Models Optimization in Python

Project description

scmopy: Distribution-Agnostic Structural Causal Models Optimization in Python

License PyPI - Version

scmopy is a composite package for causal discovery/analysis using several novel types of Structural Causal Models Optimization algorithms.

scmopy provides Distribution-Agnostic methods in identifying causality; in other words, distributional assumptions in typical SCM/SEM such as satisfying the normality conditions are not required.

The package is mainly structured in three parts:

  1. ESA-2SCM (Elastic Segment Allocation-based Two-Stage Least Squares SCM)
  1. Gradient Non-Gaussian SCM
  • Gradient Non-Gaussian SCM incorporates the information of higher order moment structures assuming non-gaussianity to determine the true causal direction.
  • Gradient Non-Gaussian SCM is a customized implementation of S.Shimizu and Y.Kano's conceptualization of nnSEM. Specifically, the quadratic objective function based on the difference between the sample moments and theoretical moments is optimized via gradient method (defaulting to BFGS) instead of performing GLS.
  • For details regarding the original nnSEM, please refer to:
  • S.Shimizu and Y.Kano (2008). Use of non-normality in structural equation modeling: Application to direction of causation, Journal of Statistical Planning and Inference, 138, 11, 3483-3491.

  1. Auto-SCM Selector for Optimization
  • The SCM Selector automatically determines the optimal model via pre-inspecting the dataset.
  • Internally, it utilizes voting methods in combination with multiple hypothesis testing techniques on the data's original distribution for the precision of model determination: ESA-2SCM is selected as the basemodel if the pre-inspection result suggests gaussianity, otherwise the Gradient Non-Gaussian SCM is selected.

For further details on each model's algorithm, refer to the Models Overview section below.

Requirements

  • Python3

  • numpy

  • pandas

  • scipy

Installation

To install the scmopy package, use pip as follows:

pip install scmopy

Example Usage

from scmopy.nongaussian_scm import GradientNonGaussianScm
from scmopy.gaussian_scm import Esa2Scm
from scmopy.model_selection import ScmSelector

import numpy as np # to generate sample data for demonstration

Gradient Non-Gaussian SCM

# Generate sample data for demonstration
N = 10000
np.random.seed(11)
x2 = np.random.gamma(shape=0.5, scale=0.5, size=N) # non-gaussian sample
noise = np.random.random(size=N) # non-gaussian noise
b12 = 1.8 # True Causal Coefficient set as 1.8
x1 = b12 * x2 + noise # True Causal Direction set as x2 -> x1

# Initialize GradientNonGaussianSCM with no prior knowledge on the causal direction
scm = GradientNonGaussianScm(x1, x2, prior_knowledge=None, unit_var=False)

# Fit the model
scm.fit(alpha=0.1) # Set alpha for chi2 test using the test statistic T2 for determination of causal direction

# To confirm the estimated True Causal Direction
print(scm.causal_direction)

# To confirm the estimated True Causal Coefficient
print(scm.causal_coef)

# To confirm the test statistic (T2) and p-value for hypothesis testing on the Causal Direction
print(scm.test_statistic)
print(scm.p_value)


# To confirm the fit score
print(scm.score)
# For result summary:
scm.summary()
x2->x1 x1->x2
Causal Direction Decisive True Decisive False
Causal Coefficient 1.8010473505277451 0.3574865003509795
Test Statistic 3.4739238216348447 646.2378958215542
P-value 0.6273369079064182 0.0
Reject H0 False True
Goodness of Fit 0.82038 -



ESA-2SCM

# Initialize ESA-2SCM with no prior knowledge on the causal direction
scm = Esa2Scm(x1, x2, prior_knowledge=None)

# Fit the model, using Synthetic IV generation method(syniv_method, default='esa') to estimate causality
# Adjust the parameter M(default=2) to manually manage the degree of correlation between the Synthetic IVs (2SLS-converted) and the respective endogenous variables
scm.fit(syniv_method="esa", M=3)

# To confirm the estimated True Causal Direction
print(scm.causal_direction)

# To confirm the estimated True Causal Coefficient
print(scm.causal_coef)

# To check the degree of correlation between the generated Synthetic IVs and the endogenous variables (x1 and x2, respectively):
print(scm.corr_x1_to_slsiv)
print(scm.corr_x2_to_slsiv)

# To confirm the true goodness of fit of the ESA-2SCM for determination of the causal direction:
print(scm.esa2scm_score)

# With causal direction determined via ESA-2SCM, to confirm the posthoc goodness of fit of the Regression Model using original variables:
print(scm.posthoc_score)
# For result summary:
scm.summary()
x2->x1 x1->x2
Causal Direction True False
Causal Coefficient 1.804851 0.385192
Goodness of Fit 0.39664 0.35318
Corr (2SLS_IV-Explanatory) 0.694676 0.774985
Posthoc Goodness of Fit 0.82038 -



Auto SCM Selector for Optimal SCM Selection

# Initialize Auto SCM Selector
selector = ScmSelector(x1, x2)

# Fit the selector
selector.fit(alpha=0.15, voting_strategy='strict')

# Confirm optimal model for the given dataset x1 and x2.
best_scm = selector.selected_scm

# Fit using the selected model
best_scm.fit()

# Confirm the estimated True Causal Direction
print(best_scm.causal_direction)

# Confirm the estimated True Causal Coefficient
print(best_scm.causal_coef)
# For result summary:
best_scm.summary()
x2->x1 x1->x2
Causal Direction Decisive True Decisive False
Causal Coefficient 1.8010473505277451 0.3574865003509795
Test Statistic 3.4739238216348447 646.2378958215542
P-value 0.6273369079064182 0.0
Reject H0 False True
Goodness of Fit 0.82038 -

Models Overview

Gradient Non-Gaussian SCM accounts for the case where the exogenous variable or the noise follows non-gaussian distribution.
ESA-2SCM, on the other hand, accounts for the case where the noise follows gaussian distribution.

In scmopy, these two models are deployed in a complementary manner, ultimately enabling Distribution-Agnostic SCM optimization for causal discovery.

Gradient Non-Gaussian SCM

Gradient Non-Gaussian SCM is a customized implementation of S.Shimizu and Y.Kano's conceptualization of nnSEM (2008). More specifically:

  • the quadratic objective function based on the difference between the sample moments and theoretical moments is optimized via gradient method (defaulting to BFGS) instead of performing GLS.
  • Weight matrix is defined as $\hat{\Sigma}$ and Pseudo-inverse matrix $\hat{\Sigma}^+$ is used instead if the inverse matrix of $\hat{\Sigma}$ cannot be obtained.

With $\xi$ and $\eta$ defined as exogenous and endogenous vectors in observable $x$ and latent $f$, the standard SEM is denoted as:

Reduced form of the above with respect to $x$ can be rewritten as follows, with $G$ as the selection matrix which selects only the observed variables:

With $H_i$ as the selection matrix which selects non-duplicated elements, the first and second to fourth order moment structures of the SEM can be denoted, respectively, as (S.Shimizu and Y.Kano, 2008):

Assumption that the SEM is identifiable using moment structures is equivalent of:

Denote sample counterparts to the first and second to fourth theoretical moment structures above as:

Then with $\tau_0$ as the true parameter, the following holds:

With $m$ and $\sigma(\tau)$ denoted as:

$\tau$ can be estimated with:

S.Shimizu and Y.Kano (2008) obtains $\hat{\tau}$ with GLS using $\hat{V}$ for $\hat{U}$.

Gradient Non-Gaussian SCM in scmopy adopts instead a gradient method (defaulting to BFGS) for solving the above with $\hat{\Sigma}$ defined as:

and with $\tau$ estimation rewritten as:

Assuming unit variance $(I)$ and single parameter/moment ($\tau$, $\sigma(\tau)$) for simplification:

so that

Applying chain rule,

Then, the basic form of gradient descent can be written as:

Generalizing,

Pseudo-inverse matrix of $\hat{\Sigma}$ is used instead if the inverse matrix cannot be obtained.

Application as regards to the determination of true causal direction is identical to the case of nnSEM (S.Shimizu and Y.Kano, 2008), as follows.

Suppose that we are interested in identifying the true causal direction between the two random variables $x_1$ and $x_2$ ($x_1$ -> $x_2$ vs. $x_2$ -> $x_1$). Then the SCM for testing can be denoted as (after mean-centering):

with,

The first- and second-order moment structures of $(1)$ can be obtained as:

As there are as many parameters as the sample moments, models $(1)$ and $(2)$ are saturated, and equivalent (as they provide the same moment structures thus disabling to determine which model is better based on these test statistics alone) to each other when using only up to second order moments.

Now, expanding up to third and fourth order moment structures, S.Shimizu and Y.Kano (2008) prove that under the satisfaction of the following three conditions:
(1) Either the exogenous variable $x_2$ or the noise $e_1$ is non-gaussian (when assuming that the model $(1)$ holds True)
(2) $corr(x_1, x_2) \neq 0$
(3) $-1 < corr(x_1, x_2) < 1$
models $(1)$ and $(2)$ can be differentiated.

That is to say,
with $ r = corr(x_1,x_2) $, and $i$-th order moment structure defined as:

and,

to get the isolated quantity from the fourth order moment ($E(z^4)$), especially the parts that are independent from the lower order moments, leaving a quantity that captures the "pure" fourth-order behavior of the distribution,

if (1) $0< |r| < 1$ and (2) either $E(x_2^3), E(e_1^3), cum_4(x_2)$, or $cum_4(e_1)$ is non-zero, the following holds:

so that the models $(1)$ and $(2)$ are distinguishable using the moments up to third or fourth moment orders.

More specifically,

For

to hold,

should hold. Solving for the matrix,

Similar derivation process can be applied for the fourth-order moment.

Thus, the two models $(1)$ and $(2)$ are distinguishable from each other using either third- and/or fourth-order moments, with each order moment structure (e.g., for model $(1)$) defined respectively as:

and with $H_0$ and $H_1$ defined as:

Test statistic $T_2$ to test $H_0$ is defined as (S.Shimizu and Y.Kano, 2008):

or, in case of Gradient Non-Gaussian SCM in scmopy,

where $T_2$ asymptotically follows chi-squared distribution with dof $u-v$ where $u$ is the total number of distinct moments employed and $v$ is the total number of parameters estimated.

Reference and the original conceptualization of nnSEM by S.Shimizu and Y.Kano (2008):

  • S.Shimizu and Y.Kano (2008). Use of non-normality in structural equation modeling: Application to direction of causation, Journal of Statistical Planning and Inference, 138, 11, 3483-3491.



ESA-2SCM

ESA-2SCM is a new method for detecting causality based on Elastic Segment Allocation-based synthetic instrumental variables with 2SLS application for estimating structural causal models.

Suppose that you are interested in discovering the causal relationship between $x_1$ and $x_2$ (e.g., determining the true causal direction: $x_1$ -> $x_2$ vs. $x_2$ -> $x_1$, measuring the magnitude of causal impact):

Estimation of the above equation under standard OLS is structurally biased and inconsistent due to endogeneity:

where

thus,

The estimators are also asymptotically inconsistent, as:

ESA-2SCM provides a countermeasure to such problem, enabling the determination of true causal direction and estimation of the true causal coefficient through the following procedures.

  1. Vector definition:

  1. Sorting:

  1. Set initial number of segments (M):

  1. Segment size allocation:

  1. Elastic adjustment algorithm for adjusting the number of segments:

  1. Grouping based on the adjusted sizes and number of segments:

  1. Segment value assignment:

  1. Apply 2SLS using the generated Synthetic IV vectors (Z):

    • Get $z_1$ and $z_2$ via applying the process (1) to (7) for $x_1$ and $x_2$, then perform 2SLS to estimate for:

Compare fits to determine the true causal direction, and estimate the true causal coefficient from the correctly identified model.


Reference and detailed documentation for the ESA-2SCM algorithm:

  • Lee, Sanghoon (2024). ESA-2SCM for Causal Discovery: Causal Modeling with Elastic Segmentation-based Synthetic Instrumental Variable, SnB Political and Economic Research Institute, 1, 21. <snbperi.org/article/230> [ARTICLE LINK]

Examples

Examples of running scmopy in Jupyter Notebook are included in scmopy/examples

License

scmopy package is licensed under the terms of the MIT license

References

scmopy Package

Should you use the scmopy package to perform causal discovery, please kindly cite my original article and the original article by S.Shimizu and Y.Kano:

  • Lee, Sanghoon (2024). ESA-2SCM for Causal Discovery: Causal Modeling with Elastic Segmentation-based Synthetic Instrumental Variable, SnB Political and Economic Research Institute, 1, 21. <snbperi.org/article/230> [ARTICLE LINK]

  • S.Shimizu and Y.Kano (2008). Use of non-normality in structural equation modeling: Application to direction of causation, Journal of Statistical Planning and Inference, 138, 11, 3483-3491.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scmopy-0.1.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

scmopy-0.1.0-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file scmopy-0.1.0.tar.gz.

File metadata

  • Download URL: scmopy-0.1.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for scmopy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 40496829f645023040150ad8cee0159faaec5c96b032bef47dd278db9369d47f
MD5 c7f47d3e0949b30506f7c42a26875e96
BLAKE2b-256 5e55518ebff8744c4d4fb921a15f8af134d8b573cc6d9690f22a802bb6815196

See more details on using hashes here.

File details

Details for the file scmopy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scmopy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for scmopy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5c5a3ef2546d3299ee49f2845362149be7136477ac96f026eb2010c24670f04
MD5 09df3eb5bff45b4e9b0c2e2e05cfdd88
BLAKE2b-256 b8e8c02618a8d5cd6af588cbda40f6173a75bd4a6235e8386079a4f1819adedb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page