scmopy: Distribution-Agnostic Structural Causal Models Optimization in Python
Project description
scmopy: Distribution-Agnostic Structural Causal Models Optimization in Python
scmopy is a composite package for causal discovery/analysis using several novel types of Structural Causal Models Optimization algorithms.
scmopy provides Distribution-Agnostic methods in identifying causality; in other words, distributional assumptions in typical SCM/SEM such as satisfying the normality conditions are not required.
The package is mainly structured in three parts:
- ESA-2SCM (Elastic Segment Allocation-based Two-Stage Least Squares SCM)
- ESA-2SCM is a new method for detecting causality based on the Elastic Segment Allocation-based synthetic instrumental variables with 2SLS application for estimating structural causal models.
- For details and documentation, please refer to my Original Article:
- Lee, Sanghoon (2024). ESA-2SCM for Causal Discovery: Causal Modeling with Elastic Segmentation-based Synthetic Instrumental Variable, SnB Political and Economic Research Institute, 1, 21. <snbperi.org/article/230>
- Gradient Non-Gaussian SCM
- Gradient Non-Gaussian SCM incorporates the information of higher order moment structures assuming non-gaussianity to determine the true causal direction.
- Gradient Non-Gaussian SCM is a customized implementation of S.Shimizu and Y.Kano's conceptualization of nnSEM. Specifically, the quadratic objective function based on the difference between the sample moments and theoretical moments is optimized via gradient method (defaulting to BFGS) instead of performing GLS.
- For details regarding the original nnSEM, please refer to:
- S.Shimizu and Y.Kano (2008). Use of non-normality in structural equation modeling: Application to direction of causation, Journal of Statistical Planning and Inference, 138, 11, 3483-3491.
- Auto-SCM Selector for Optimization
- The SCM Selector automatically determines the optimal model via pre-inspecting the dataset.
- Internally, it utilizes voting methods in combination with multiple hypothesis testing techniques on the data's original distribution for the precision of model determination: ESA-2SCM is selected as the basemodel if the pre-inspection result suggests gaussianity, otherwise the Gradient Non-Gaussian SCM is selected.
For further details on each model's algorithm, refer to the Models Overview section below.
Requirements
-
Python3
-
numpy
-
pandas
-
scipy
Installation
To install the scmopy package, use pip
as follows:
pip install scmopy
Example Usage
from scmopy.nongaussian_scm import GradientNonGaussianScm
from scmopy.gaussian_scm import Esa2Scm
from scmopy.model_selection import ScmSelector
import numpy as np # to generate sample data for demonstration
Gradient Non-Gaussian SCM
# Generate sample data for demonstration
N = 10000
np.random.seed(11)
x2 = np.random.gamma(shape=0.5, scale=0.5, size=N) # non-gaussian sample
noise = np.random.random(size=N) # non-gaussian noise
b12 = 1.8 # True Causal Coefficient set as 1.8
x1 = b12 * x2 + noise # True Causal Direction set as x2 -> x1
# Initialize GradientNonGaussianSCM with no prior knowledge on the causal direction
scm = GradientNonGaussianScm(x1, x2, prior_knowledge=None, unit_var=False)
# Fit the model
scm.fit(alpha=0.1) # Set alpha for chi2 test using the test statistic T2 for determination of causal direction
# To confirm the estimated True Causal Direction
print(scm.causal_direction)
# To confirm the estimated True Causal Coefficient
print(scm.causal_coef)
# To confirm the test statistic (T2) and p-value for hypothesis testing on the Causal Direction
print(scm.test_statistic)
print(scm.p_value)
# To confirm the fit score
print(scm.score)
# For result summary:
scm.summary()
x2->x1 | x1->x2 | |
---|---|---|
Causal Direction | Decisive True | Decisive False |
Causal Coefficient | 1.8010473505277451 | 0.3574865003509795 |
Test Statistic | 3.4739238216348447 | 646.2378958215542 |
P-value | 0.6273369079064182 | 0.0 |
Reject H0 | False | True |
Goodness of Fit | 0.82038 | - |
ESA-2SCM
# Initialize ESA-2SCM with no prior knowledge on the causal direction
scm = Esa2Scm(x1, x2, prior_knowledge=None)
# Fit the model, using Synthetic IV generation method(syniv_method, default='esa') to estimate causality
# Adjust the parameter M(default=2) to manually manage the degree of correlation between the Synthetic IVs (2SLS-converted) and the respective endogenous variables
scm.fit(syniv_method="esa", M=3)
# To confirm the estimated True Causal Direction
print(scm.causal_direction)
# To confirm the estimated True Causal Coefficient
print(scm.causal_coef)
# To check the degree of correlation between the generated Synthetic IVs and the endogenous variables (x1 and x2, respectively):
print(scm.corr_x1_to_slsiv)
print(scm.corr_x2_to_slsiv)
# To confirm the true goodness of fit of the ESA-2SCM for determination of the causal direction:
print(scm.esa2scm_score)
# With causal direction determined via ESA-2SCM, to confirm the posthoc goodness of fit of the Regression Model using original variables:
print(scm.posthoc_score)
# For result summary:
scm.summary()
x2->x1 | x1->x2 | |
---|---|---|
Causal Direction | True | False |
Causal Coefficient | 1.804851 | 0.385192 |
Goodness of Fit | 0.39664 | 0.35318 |
Corr (2SLS_IV-Explanatory) | 0.694676 | 0.774985 |
Posthoc Goodness of Fit | 0.82038 | - |
Auto SCM Selector for Optimal SCM Selection
# Initialize Auto SCM Selector
selector = ScmSelector(x1, x2)
# Fit the selector
selector.fit(alpha=0.15, voting_strategy='strict')
# Confirm optimal model for the given dataset x1 and x2.
best_scm = selector.selected_scm
# Fit using the selected model
best_scm.fit()
# Confirm the estimated True Causal Direction
print(best_scm.causal_direction)
# Confirm the estimated True Causal Coefficient
print(best_scm.causal_coef)
# For result summary:
best_scm.summary()
x2->x1 | x1->x2 | |
---|---|---|
Causal Direction | Decisive True | Decisive False |
Causal Coefficient | 1.8010473505277451 | 0.3574865003509795 |
Test Statistic | 3.4739238216348447 | 646.2378958215542 |
P-value | 0.6273369079064182 | 0.0 |
Reject H0 | False | True |
Goodness of Fit | 0.82038 | - |
Models Overview
Gradient Non-Gaussian SCM accounts for the case where the exogenous variable or the noise follows non-gaussian distribution.
ESA-2SCM, on the other hand, accounts for the case where the noise follows gaussian distribution.
In scmopy, these two models are deployed in a complementary manner, ultimately enabling Distribution-Agnostic SCM optimization for causal discovery.
Gradient Non-Gaussian SCM
Gradient Non-Gaussian SCM is a customized implementation of S.Shimizu and Y.Kano's conceptualization of nnSEM (2008). More specifically:
- the quadratic objective function based on the difference between the sample moments and theoretical moments is optimized via gradient method (defaulting to BFGS) instead of performing GLS.
- Weight matrix is defined as $\hat{\Sigma}$ and Pseudo-inverse matrix $\hat{\Sigma}^+$ is used instead if the inverse matrix of $\hat{\Sigma}$ cannot be obtained.
With $\xi$ and $\eta$ defined as exogenous and endogenous vectors in observable $x$ and latent $f$, the standard SEM is denoted as:
Reduced form of the above with respect to $x$ can be rewritten as follows, with $G$ as the selection matrix which selects only the observed variables:
With $H_i$ as the selection matrix which selects non-duplicated elements, the first and second to fourth order moment structures of the SEM can be denoted, respectively, as (S.Shimizu and Y.Kano, 2008):
Assumption that the SEM is identifiable using moment structures is equivalent of:
Denote sample counterparts to the first and second to fourth theoretical moment structures above as:
Then with $\tau_0$ as the true parameter, the following holds:
With $m$ and $\sigma(\tau)$ denoted as:
$\tau$ can be estimated with:
S.Shimizu and Y.Kano (2008) obtains $\hat{\tau}$ with GLS using $\hat{V}$ for $\hat{U}$.
Gradient Non-Gaussian SCM in scmopy adopts instead a gradient method (defaulting to BFGS) for solving the above with $\hat{\Sigma}$ defined as:
and with $\tau$ estimation rewritten as:
Assuming unit variance $(I)$ and single parameter/moment ($\tau$, $\sigma(\tau)$) for simplification:
so that
Applying chain rule,
Then, the basic form of gradient descent can be written as:
Generalizing,
Pseudo-inverse matrix of $\hat{\Sigma}$ is used instead if the inverse matrix cannot be obtained.
Application as regards to the determination of true causal direction is identical to the case of nnSEM (S.Shimizu and Y.Kano, 2008), as follows.
Suppose that we are interested in identifying the true causal direction between the two random variables $x_1$ and $x_2$ ($x_1$ -> $x_2$ vs. $x_2$ -> $x_1$). Then the SCM for testing can be denoted as (after mean-centering):
with,
The first- and second-order moment structures of $(1)$ can be obtained as:
As there are as many parameters as the sample moments, models $(1)$ and $(2)$ are saturated, and equivalent (as they provide the same moment structures thus disabling to determine which model is better based on these test statistics alone) to each other when using only up to second order moments.
Now, expanding up to third and fourth order moment structures, S.Shimizu and Y.Kano (2008) prove that under the satisfaction of the following three conditions:
(1) Either the exogenous variable $x_2$ or the noise $e_1$ is non-gaussian (when assuming that the model $(1)$ holds True)
(2) $corr(x_1, x_2) \neq 0$
(3) $-1 < corr(x_1, x_2) < 1$
models $(1)$ and $(2)$ can be differentiated.
That is to say,
with $ r = corr(x_1,x_2) $, and $i$-th order moment structure defined as:
and,
to get the isolated quantity from the fourth order moment ($E(z^4)$), especially the parts that are independent from the lower order moments, leaving a quantity that captures the "pure" fourth-order behavior of the distribution,
if (1) $0< |r| < 1$ and (2) either $E(x_2^3), E(e_1^3), cum_4(x_2)$, or $cum_4(e_1)$ is non-zero, the following holds:
so that the models $(1)$ and $(2)$ are distinguishable using the moments up to third or fourth moment orders.
More specifically,
For
to hold,
should hold. Solving for the matrix,
Similar derivation process can be applied for the fourth-order moment.
Thus, the two models $(1)$ and $(2)$ are distinguishable from each other using either third- and/or fourth-order moments, with each order moment structure (e.g., for model $(1)$) defined respectively as:
and with $H_0$ and $H_1$ defined as:
Test statistic $T_2$ to test $H_0$ is defined as (S.Shimizu and Y.Kano, 2008):
or, in case of Gradient Non-Gaussian SCM in scmopy,
where $T_2$ asymptotically follows chi-squared distribution with dof $u-v$ where $u$ is the total number of distinct moments employed and $v$ is the total number of parameters estimated.
Reference and the original conceptualization of nnSEM by S.Shimizu and Y.Kano (2008):
- S.Shimizu and Y.Kano (2008). Use of non-normality in structural equation modeling: Application to direction of causation, Journal of Statistical Planning and Inference, 138, 11, 3483-3491.
ESA-2SCM
ESA-2SCM is a new method for detecting causality based on Elastic Segment Allocation-based synthetic instrumental variables with 2SLS application for estimating structural causal models.
Suppose that you are interested in discovering the causal relationship between $x_1$ and $x_2$ (e.g., determining the true causal direction: $x_1$ -> $x_2$ vs. $x_2$ -> $x_1$, measuring the magnitude of causal impact):
Estimation of the above equation under standard OLS is structurally biased and inconsistent due to endogeneity:
where
thus,
The estimators are also asymptotically inconsistent, as:
ESA-2SCM provides a countermeasure to such problem, enabling the determination of true causal direction and estimation of the true causal coefficient through the following procedures.
- Vector definition:
- Sorting:
- Set initial number of segments (M):
- Segment size allocation:
- Elastic adjustment algorithm for adjusting the number of segments:
- Grouping based on the adjusted sizes and number of segments:
- Segment value assignment:
-
Apply 2SLS using the generated Synthetic IV vectors (Z):
- Get $z_1$ and $z_2$ via applying the process (1) to (7) for $x_1$ and $x_2$, then perform 2SLS to estimate for:
Compare fits to determine the true causal direction, and estimate the true causal coefficient from the correctly identified model.
Reference and detailed documentation for the ESA-2SCM algorithm:
- Lee, Sanghoon (2024). ESA-2SCM for Causal Discovery: Causal Modeling with Elastic Segmentation-based Synthetic Instrumental Variable, SnB Political and Economic Research Institute, 1, 21. <snbperi.org/article/230> [ARTICLE LINK]
Examples
Examples of running ESA-2SCM in Jupyter Notebook are included in esa_2scm/examples
License
scmopy package is licensed under the terms of the MIT license
References
scmopy Package
Should you use the scmopy package to perform causal discovery, please kindly cite my original article and the original article by S.Shimizu and Y.Kano:
-
Lee, Sanghoon (2024). ESA-2SCM for Causal Discovery: Causal Modeling with Elastic Segmentation-based Synthetic Instrumental Variable, SnB Political and Economic Research Institute, 1, 21. <snbperi.org/article/230> [ARTICLE LINK]
-
S.Shimizu and Y.Kano (2008). Use of non-normality in structural equation modeling: Application to direction of causation, Journal of Statistical Planning and Inference, 138, 11, 3483-3491.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scmopy-0.0.9.tar.gz
.
File metadata
- Download URL: scmopy-0.0.9.tar.gz
- Upload date:
- Size: 20.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0eb7fff01d44db9610953f967e5e67733b63623e99d3930e15f1d68019121804 |
|
MD5 | 028181af3c59e73995cf4ee69426abae |
|
BLAKE2b-256 | d112af8961964cd53f9d262032af5d754949d83062216de9852b73ed1e54535f |
File details
Details for the file scmopy-0.0.9-py3-none-any.whl
.
File metadata
- Download URL: scmopy-0.0.9-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b30de367835bf5ec372e8d19a7a95a1fcdb4d2072952b107172d9597070eec2 |
|
MD5 | 1903783b26e2672d7510e0d62b9fa2a3 |
|
BLAKE2b-256 | 7d267d47c7b44ee16a0a8a24891859f2870c51eb557d57bb40d1e9017dffef27 |