Causing: CAUSal INterpretation using Graphs

These details have not been verified by PyPI

Project links

Homepage

Project description

Causing: CAUSal INterpretation using Graphs

Causing is a multivariate graphical analysis tool helping you to interpret the causal effects of a given equation system. Get a nice colored graph and immediately understand the causal effects between the variables.

Input: You simply have to put in a dataset and provide an equation system in form of a python function. The endogenous variable on the left-hand side are assumed being caused by the variables on the right-hand side of the equation. Thus, you provide the causal structure in form of a directed acyclic graph (DAG).

Output: As an output you will get a colored graph of quantified effects acting between the model variables. You are able to immediately interpret mediation chains for every individual observation - even for highly complex nonlinear systems.

Further, the method enables model validation. The effects are estimated using a structural neural network. You can check whether your assumed model fits the data. Testing for significance of each individual effect guides you in how to modify and further develop the model. The method can be applied to highly latent models with many of the modeled endogenous variables being unobserved.

Here is a table relating Causing to other approaches:

Causing is	Causing is NOT
causal model given	causal search
DAG directed acyclic graph	cyclic, undirected or bidirected graph
latent variables	just observed / manifest variables
individual effects	just average effects
direct, total and mediation effects	just total effects
linear algebra effect formulas	no iterative do-calculus rules
local identification via ridge regression	check of global identification rules
one regression for all effects	individual counterfactual analysis
structural model	reduced model
small data	big data requirement
supervised learning	unsupervised learning
minimizing sum of squared errors	fitting covariance matrix
model estimation plus validation	just model estimation
graphical results	just numerical results
XAI explainable AI	black box neural network

The Causing approach is quite flexible. The most severe restriction certainly is that you need to specify the causal model / causal ordering. If you know the causal ordering but not the specific equations, you can let the Causing model estimate a linear relationship. Just plug in sensible starting values.

Further, exogenous variables are assumed to be observed and deterministic. Endogenous variables instead may be manifest or latent and they might have error correlated terms. Error terms are not modeled explicitly, they are automatically dealt with in the regression / backpropagation estimation.

Introduction Video

This 5 minute introductory video gives you a short overview and a real data example:

See Causing_Introduction_Video

Software

Causing is a free software written in Python 3. It makes use of PyTorch for automatic computation of total derivatives and SymPy for partial algebraic derivatives. Graphs are generated using Graphviz.

See dependencies in setup.py.

Effects

Causing provides direct, total and mediation effects. Using the given equation system, they are computed for individual observations and their total. Also, the average effects are estimated by fitting to the observed data. The respective effects are abbreviated as:

Effects	Direct	Total	Mediation
Average effects	ADE	ATE	AME
Estimated effects	EDE	ETE	EME
Individual effects	IDE	ITE	IME

Model Validation

To evaluate estimation, t-values are reported. To evaluate the necessity of effects, t-values with respect to zero are shown (estimated effect divided by its standard deviation). These t-values are expected to be significant. i.e. larger than two in absulute value. Insignificant effects could indicate possible model simplifications.

To evaluate the validity of the hypothesized model , t-values with respect to the hypothesized average model effects are used (estimated minus average effect and then divided by its standard deviation). In this case, significant deviations could suggest a model refinement.

t-values	Direct	Total	Mediation
t-values wrt. zero	ED0	ET0	EM0
t-values wrt. model	ED1	ET1	EM1

Finally, for every equation we separately estimate a constant / bias term to quickly find possibly misspecified equations. This could be the case for significant biases.

Abstract

We propose simple linear algebra formulas for the causal analysis of equation systems. The effect of one variable on another is the total derivative. We extend them to endogenous system variables. These total effects are identical to the effects used in graph theory and its do-calculus. Further, we define mediation effects, decomposing the total effect of one variable on a final variable of interest over all its directly caused variables. This allows for an easy but in-depth causal and mediation analysis.

To estimate the given theoretical model we define a structural neural network (SNN). The network's nodes are represented by the model variables and its edge weights are given by the direct effects. Identification could be given by zero restrictions on direct effects implied by the equation model provided. Otherwise, identification is automatically achieved via ridge regression / weight decay. We choose the regularization parameter minimizing out-of-sample sum of squared errors subject to at least yielding a well conditioned positive-definite Hessian, being evaluated at the estimated direct effects.

Unlike classical deep neural networks, we follow a sparse and 'small data' approach. Estimation of structural direct effects is done using PyTorch and automatic differentiation taylormade for fast backpropagation. We make use of our closed form effect formulas in order to compute mediation effects. The gradient and Hessian are also given in analytic form.

Keywords: total derivative, graphical effect, graph theory, do-Calculus, structural neural network, linear Simultaneous Equations Model (SEM), Structural Causal Model (SCM), insurance rating

Citation

The Causing approach and its formulas together with an application are given in:

Bartel, Holger (2020), "Causal Analysis - With an Application to Insurance Ratings" DOI: 10.13140/RG.2.2.31524.83848 https://www.researchgate.net/publication/339091133

Note that in this paper the mediation effects on the final variable of interest are called final effects.

Example

Assume a model defined by the equation system:

Y₁ = X₁

Y₂ = X₂ + 2 * Y₁²

Y₃ = Y₁ + Y₂.

This gives the following graphs. Some notes are in order to understand them:

The data used consist of 200 observations. They are available for the x variables X₁ and X₂ with mean(X₁) = 3 and mean(X₂) = 2. Variables Y₁ and Y₂ are assumed to be latent / unobserved. Y₃ is assumed to be manifest / observed. Therefore 200 observations are available for Y₃.
Average effects are based on the hypothesized model. The median values of all exogenous data is put into the given graph function, giving the corresponding endogenous values. The effects are computed at this point.
Individual effects are also based on the hypothesized model. For each individual, however its own exogenous data is put into the given graph function to yiel the corresponding endogenous values. The effects are computed at this individual point.
Estimated effects are based on the hypothesized model: The zero restrictions (effects being always exactly zero by model construction) are carried over and the average hypothesized effects are used as starting values. However, effects are estimated by fitting a linearized approximate model using a structural neural network. Effects are fitted by minimizing squared errors of observed endogenous variables. This corresponds to a nonlinear structural regression of Y₃ on X₁, X₂ using all 200 observations.
Mediation effects are shown exemplary for the final variable of interest, assumed here to be Y₃. In the mediation graph the nodes depict the total effect of that variable on Y₃. This effect is partitioned over all outgoing edges, representing the mediation effects and thus enabling path interpretation. Note however that incoming edges do not sum up to the node value.
Individual effects are shown exemplary for individual no. 1 out of the 200 observations. To ease their interpretation, each individual effect is multiplied by the absolute difference of its causing variable to the median of all observations. Further, we color nodes and edges, showing positive (green) and negative (red) effects these deviations have on the final variable Y₃.

Effects	Direct	Total	Mediation for Y₃
Average effects
Estimated effects
Individual effects for individual no. 1

As you can see in the bottom right graph for the individual mediation effects (IME), there is one green path starting at X₁ passing through Y₁, Y₂ and finally ending in Y₃. This means that X₁ is the main cause for Y₃ taking on a value above average with its effect being +37.44. However, this positive effect is slightly reduced by X₂. In total, accounting for all exogenous and endogenous effects, Y₃ is +29.34 above average. You can understand at one glance why Y₃ is above average for individual no. 1.

The t-values corresponding to the estimated effects are also given as graphs. To assess model validation using the t-value graphs note the following:

Estimated standard errors for the effects are derived from the Hessian. Test and t-vales are asymptotically correct, but in small samples they suffer from the effects being biased in the case of regularization.
In this example regularization is required. The minimal regularization parameter is 0.000950 to obtain a well-posed optimization problem with a positive-definite Hessian. The optimal regularization parameter minimizing out-of-sample squared errors is 0.001545.
The t-values with respect to zero should be larger than two in absolute value, indicating that the specified model structure indeed yields significant effects.
The t-values with respect to the hypothesized model effects should be smaller than two in absolute value, indicating that there is no severe devation between model and data.
For the mediation t-value graphs EM0 and EM1 the outgoing edges do not some up to its outgoing node. In the EM0 graph all outgoing edges are even identical to their outgoing node because effects and standard deviations are partioned in the same way over their outgoing edges thus cancelling out in the t-values. However, this is not true for the EM1 graph since different partitioning schemes are used for the estimated and subtracted hypothesized model effects.

Effects	Direct	Total	Mediation for Y₃
t-values wrt. zero
t-values wrt. model

The t-values with respect to zero show that just some of the estimated effects are significant. This could be due to the small sample size. In this example we estimate five direct effects from 200 observations with the only observable endogenous variable being Y₃.

None of the t-values with respect to the hypothesized model values is significant. This means that the specified model fits well to the observed data.

Biases are estimated for each endogenous variable. Estimation is done at the point of average effects implied by the specified model. That is, possible model misspecifications are captured by a single bias, one at a time. Biases therefore are just one simple way to detect wrong modeling assumptions.

Variable	Bias value	Bias t-value
Y₁	0.00	0.64
Y₂	0.06	0.55
Y₃	0.06	0.55

In our example none of the biases is significant, further supporting correctness of model specification.

A Real World Example

To dig a bit deeper, here we have a real world example from social sciences. We analyze how the wage earned by young American workers is determined by their educational attainment, family characteristics, and test scores.

See education.md

Start your own Model

When starting python -m causing.examples example after cloning / downloading the Causing repository you will find the example results described above in the output folder. The results are saved in SVG files.

See causing/examples for the code generating these examples.

To run a model, you have to provide the following information, as done in the example code below:

Define all your model variables as SymPy symbols.
Note that in Sympy some operators are special, e.g. Max() instead of max().
Provide the model equations in topological order, that is, in order of computation.
Then the model is specified with:
- xvars: exogenous variables
- yvars: endogenous variables in topological order
- equations: previously defined equations
- final_var: the final variable of interest used for mediation effects
To simulate data, we have to provide simulation parameters as in:
- ymvars: manifest / observed endogenous variables
- xmean_true: mean of exogenous data
- sigx_theo: true scalar error variance of xvars
- sigym_theo: true scalar error variance of ymvars
- rho: true correlation within y and within x vars
- tau: no. of simulated observations
In estimate_input, the inputs to be used for estimation, further specify
- ymvars: manifest endogenous variables
- ymdat: manifest endogenous data
- estimate_bias: estimate equation biases, for model validation
- alpha: regularization parameter, is estimated if None
- dof: effective degrees of freedom, corresponding to alpha

In the example case the python SymPy function looks like this:

def example():
    """model example"""

    X1, X2, Y1, Y2, Y3 = symbols(["X1", "X2", "Y1", "Y2", "Y3"])
    equations = (               # equations in topological order (Y1, Y2, ...)
        X1,
        X2 + 2 * Y1 ** 2,
        Y1 + Y2,
    )
    m = Model(
        xvars=[X1, X2],         # exogenous variables in desired order
        yvars=[Y1, Y2, Y3],     # endogenous variables in topological order
        equations=equations,
        final_var=Y3,           # final variable of interest, for mediation analysis
    )

    ymvars = [Y3]               # manifest endogenous variables
    xdat, ymdat = simulate(
        m,
        SimulationParams(
            ymvars=ymvars,
            xmean_true=[3, 2],  # mean of exogenous data
            sigx_theo=1,        # true scalar error variance of xvars
            sigym_theo=1,       # true scalar error variance of ymvars
            rho=0.2,            # true correlation within y and within x vars
            tau=200,            # nr. of simulated observations
        ),
    )

    estimate_input = dict(
        ymvars=ymvars,
        ymdat=ymdat,
        estimate_bias=True,     # estimate equation biases, for model validation
        alpha=None,             # regularization parameter, is estimated if None
        dof=None,               # effective degrees of freedom, corresponding to alpha
    )

    return m, xdat, ymdat, estimate_input

The files ADE, ATE, AME contain the average effects based on the median xdat observation as well as the estimated effects (EDE, ETE, EME) using the observed endogenous data ymdat.

The graphs ED0, ET0, EM0 contain the t-values graphs with respect to zero and the graphs ED1, ET1, EM1 contain t-values graphs with respect to the hypothesised model.

The files IDE, ITE, IME show the individual effects for the respective individual.

Award

RealRate's AI software Causing is a winner of PyTorch AI Hackathon.

October 2020: We are very happy to announce that the RealRate AI software was announced a winner of the PyTorch Summer Hackathon 2020 in the Responsible AI category. This is quite an honor given that more than 2500 teams submitted their projects.

devpost.com/software/realrate-explainable-ai-for-company-ratings.

Causing means CAUSal INterpretation using Graphs. Causing is a tool for Explainable AI (XAI). We explain causality and ensure fair treatment.

The software is developed by RealRate, an AI rating agency aiming to re-invent the ratings market by using AI, interpretability and avoiding any conflict of interest. See www.realrate.ai.

License

Causing is available under MIT license. See LICENSE.

Consulting

If you need help with your project, please contact me. I could perform the data analytics or adapt the software to your special needs.

Dr. Holger Bartel
RealRate GmbH
Cecilienstr. 14, D-12307 Berlin
holger.bartel@realrate.ai
Phone: +49 160 957 90
www.realrate.ai

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.4.1

Oct 8, 2024

2.4.0

Oct 25, 2022

2.3.0

Oct 12, 2022

2.2.0

Oct 4, 2022

2.1.0

Sep 27, 2022

2.0.0

Sep 26, 2022

0.2.2

Jun 3, 2022

This version

0.2.1

Mar 2, 2022

0.2.0

Feb 2, 2022

0.1.7

Nov 26, 2021

0.1.6

Nov 19, 2021

0.1.4

Sep 24, 2021

0.1.3

Sep 20, 2021

0.1.2

Sep 10, 2021

0.1.1

Aug 25, 2021

0.1.0

Aug 20, 2021

0.0.5

Jan 21, 2021

0.0.4

Dec 16, 2020

0.0.3

Dec 14, 2020

0.0.2

Dec 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

causing-0.2.1.tar.gz (36.0 kB view details)

Uploaded Mar 2, 2022 Source

Built Distribution

causing-0.2.1-py3-none-any.whl (32.5 kB view details)

Uploaded Mar 2, 2022 Python 3

File details

Details for the file causing-0.2.1.tar.gz.

File metadata

Download URL: causing-0.2.1.tar.gz
Upload date: Mar 2, 2022
Size: 36.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for causing-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`1ed1491af4891354c893758882740a7dcd9977994f7c3e266d3f00a16af69c77`
MD5	`ab94e2071408969ed7073ee45490fd8c`
BLAKE2b-256	`96961c3645c3b04d8a4a43a6fb6abb0597ca994a351e76414131d0764613f351`

See more details on using hashes here.

Provenance

File details

Details for the file causing-0.2.1-py3-none-any.whl.

File metadata

Download URL: causing-0.2.1-py3-none-any.whl
Upload date: Mar 2, 2022
Size: 32.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for causing-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`39d98a9b28f4d5a45030346ffdad60fe6497b5bee5c172213d239e784122d378`
MD5	`ec58c6000cb3b4e441984dd8b291ab87`
BLAKE2b-256	`5e1a78d338d58c4d145ece73cb85d82c33bec1761dcc0a1e53d8c1e80ad0c64c`