Tools for causal inference
Project description
# Causality
This package contains tools for causal analysis using observational (rather than experimental) datasets. The main documentation is on the github page at github.com/akelleh/causality
## Installation
Assuming you have pip installed, just run
```
pip install causality
```
## [Causal Analysis](https://github.com/akelleh/causality/tree/master/causality/analysis)
The simplest interface to this package is probably through the `CausalDataFrame` object in [`causality.analysis.CausalDataFrame`](https://github.com/akelleh/causality/blob/master/causality/analysis/dataframe.py#L8). This is just an extension of the `pandas.DataFrame` object, and so it inherits the same methods.
The `CausalDataFrame` current supports two kinds of causal analysis. First, it has a `CausalDataFrame.zmean` method. This method lets you control for a set of variables, `z`, when you're trying to estimate the effect of a discrete variable `x` on a continuous variable, `y`. It supports both returning the `y` estimates at each `x` value, as well as providing bootstrap error bars. For more details, check out the readme [here](https://github.com/akelleh/causality/tree/master/causality/analysis).
The second kind of analysis supported is plotting to show the effect of discrete or continuous `x` on continous `y` while controlling for `z`. You can do this with the `CausalDataFrame.zplot` method. For details, check out the readme [here](https://github.com/akelleh/causality/tree/master/causality/analysis).
## Measuring Causal Effects
the [`causality.estimation`](https://github.com/akelleh/causality/tree/master/causality/estimation) module contains tools for estimating causal effects from observational and experimental data. Most tools are parametric, like `PropensityScoreMatching`, and can be found in `causality.estimation.parametric`. Other models are nonparametric, and rely on directly estimating densities and using the gestimation approach.
## DAG Inference
The `causality.inference` module will contain various algorithms for inferring causal DAGs. Currently (2016/01/23), the only algorithm implemented is the IC\* algorithm from Pearl (2000). It has decent test coverage, but feel free to write some more! I've left some stubs in `tests/unit/test\_IC.py`.
To run a graph search on a dataset, you can use the algorithms like (using IC\* as an example):
```python
import numpy
import pandas as pd
from causality.inference.search import IC
from causality.inference.independence_tests import RobustRegressionTest
# generate some toy data:
SIZE = 2000
x1 = numpy.random.normal(size=SIZE)
x2 = x1 + numpy.random.normal(size=SIZE)
x3 = x1 + numpy.random.normal(size=SIZE)
x4 = x2 + x3 + numpy.random.normal(size=SIZE)
x5 = x4 + numpy.random.normal(size=SIZE)
# load the data into a dataframe:
X = pd.DataFrame({'x1' : x1, 'x2' : x2, 'x3' : x3, 'x4' : x4, 'x5' : x5})
# define the variable types: 'c' is 'continuous'. The variables defined here
# are the ones the search is performed over  NOT all the variables defined
# in the data frame.
variable_types = {'x1' : 'c', 'x2' : 'c', 'x3' : 'c', 'x4' : 'c', 'x5' : 'c'}
# run the search
ic_algorithm = IC(RobustRegressionTest)
graph = ic_algorithm.search(X, variable_types)
```
Now, we have the inferred graph stored in `graph`. In this graph, each variable is a node (named from the DataFrame columns), and each edge represents statistical dependence between the nodes that can't be eliminated by conditioning on the variables specified for the search. If an edge can be oriented with the data available, the arrowhead is indicated in `'arrows'`. If the edge also satisfies the local criterion for genuine causation, then that directed edge will have `marked=True`. If we print the edges from the result of our search, we can see which edges are oriented, and which satisfy the local criterion for genuine causation:
```python
>>> graph.edges(data=True)
[('x2', 'x1', {'arrows': [], 'marked': False}),
('x2', 'x4', {'arrows': ['x4'], 'marked': False}),
('x3', 'x1', {'arrows': [], 'marked': False}),
('x3', 'x4', {'arrows': ['x4'], 'marked': False}),
('x4', 'x5', {'arrows': ['x5'], 'marked': True})]
```
We can see the edges from `'x2'` to `'x4'`, `'x3'` to `'x4'`, and `'x4'` to `'x5'` are all oriented toward the second of each pair. Additionally, we see that the edge from `'x4'` to `'x5'` satisfies the local criterion for genuine causation. This matches the structure given in figure `2.3(d)` in Pearl (2000).
## Nonparametric Effects Estimation
The `causality.nonparametric` module contains a tool for nonparametrically estimating a causal distribution from an observational data set. You can supply an "admissable set" of variables for controlling, and the measure either the causal effect distribution of an effect given the cause, or the expected value of the effect given the cause.
I've recently added adjustment for direct causes, where you can estimate the causal effect of fixing a set of X variables on a set of Y variables by adjusting for the parents of X in your graph. Using the dataset above, you can run this like
```python
from causality.estimation.adjustments import AdjustForDirectCauses
from networkx import DiGraph
g = DiGraph()
g.add_nodes_from(['x1','x2','x3','x4', 'x5'])
g.add_edges_from([('x1','x2'),('x1','x3'),('x2','x4'),('x3','x4')])
adjustment = AdjustForDirectCauses()
```
Then, you can see the set of variables being adjusted for by
```python
>>> print adjustment.admissable_set(g, ['x2'], ['x3'])
set(['x1'])
```
If we hadn't adjusted for `'x1'` we would have incorrectly found that `'x2'` had a causal effect on `'x3'` due to the counfounding pathway `x2, x1, x3`. Adjustment for `'x1'` removes this bias.
You can see the causal effect of intervention, `P(x3do(x2))` using the measured causal effect in `adjustment`,
```python
>>>from causality.estimation.nonparametric import CausalEffect
>>>admissable_set = adjustment.admissable_set(g,['x2'], ['x3'])
>>>effect = CausalEffect(X, ['x2'], ['x3'], variable_types=variable_types, admissable_set=list(admissable_set))
>>>x = pd.DataFrame({'x2' : [0.], 'x3' : [0.]})
>>>effect.pdf(x)
0.268915603296
```
Which is close to the correct value of `0.282` for a gaussian with mean 0. and variance 2. If you adjust the value of `'x2'`, you'll find that the probability of `'x3'` doesn't change. This is untrue with just the conditional distribution, `P(x3x2)`, since in this case, observation and intervention are not equivalent.
## Other Notes
This repository is in its early phases. The runtime for the tests is long. Many optimizations will be made in the near future, including
* Implement fast mutual information calculation, O( N log N )
* Speed up integrating out variables for controlling
* Take a usersupplied graph, and find the set of admissable sets
* Frontdoor criterion method for determining causal effects
Pearl, Judea. _Causality_. Cambridge University Press, (2000).
This package contains tools for causal analysis using observational (rather than experimental) datasets. The main documentation is on the github page at github.com/akelleh/causality
## Installation
Assuming you have pip installed, just run
```
pip install causality
```
## [Causal Analysis](https://github.com/akelleh/causality/tree/master/causality/analysis)
The simplest interface to this package is probably through the `CausalDataFrame` object in [`causality.analysis.CausalDataFrame`](https://github.com/akelleh/causality/blob/master/causality/analysis/dataframe.py#L8). This is just an extension of the `pandas.DataFrame` object, and so it inherits the same methods.
The `CausalDataFrame` current supports two kinds of causal analysis. First, it has a `CausalDataFrame.zmean` method. This method lets you control for a set of variables, `z`, when you're trying to estimate the effect of a discrete variable `x` on a continuous variable, `y`. It supports both returning the `y` estimates at each `x` value, as well as providing bootstrap error bars. For more details, check out the readme [here](https://github.com/akelleh/causality/tree/master/causality/analysis).
The second kind of analysis supported is plotting to show the effect of discrete or continuous `x` on continous `y` while controlling for `z`. You can do this with the `CausalDataFrame.zplot` method. For details, check out the readme [here](https://github.com/akelleh/causality/tree/master/causality/analysis).
## Measuring Causal Effects
the [`causality.estimation`](https://github.com/akelleh/causality/tree/master/causality/estimation) module contains tools for estimating causal effects from observational and experimental data. Most tools are parametric, like `PropensityScoreMatching`, and can be found in `causality.estimation.parametric`. Other models are nonparametric, and rely on directly estimating densities and using the gestimation approach.
## DAG Inference
The `causality.inference` module will contain various algorithms for inferring causal DAGs. Currently (2016/01/23), the only algorithm implemented is the IC\* algorithm from Pearl (2000). It has decent test coverage, but feel free to write some more! I've left some stubs in `tests/unit/test\_IC.py`.
To run a graph search on a dataset, you can use the algorithms like (using IC\* as an example):
```python
import numpy
import pandas as pd
from causality.inference.search import IC
from causality.inference.independence_tests import RobustRegressionTest
# generate some toy data:
SIZE = 2000
x1 = numpy.random.normal(size=SIZE)
x2 = x1 + numpy.random.normal(size=SIZE)
x3 = x1 + numpy.random.normal(size=SIZE)
x4 = x2 + x3 + numpy.random.normal(size=SIZE)
x5 = x4 + numpy.random.normal(size=SIZE)
# load the data into a dataframe:
X = pd.DataFrame({'x1' : x1, 'x2' : x2, 'x3' : x3, 'x4' : x4, 'x5' : x5})
# define the variable types: 'c' is 'continuous'. The variables defined here
# are the ones the search is performed over  NOT all the variables defined
# in the data frame.
variable_types = {'x1' : 'c', 'x2' : 'c', 'x3' : 'c', 'x4' : 'c', 'x5' : 'c'}
# run the search
ic_algorithm = IC(RobustRegressionTest)
graph = ic_algorithm.search(X, variable_types)
```
Now, we have the inferred graph stored in `graph`. In this graph, each variable is a node (named from the DataFrame columns), and each edge represents statistical dependence between the nodes that can't be eliminated by conditioning on the variables specified for the search. If an edge can be oriented with the data available, the arrowhead is indicated in `'arrows'`. If the edge also satisfies the local criterion for genuine causation, then that directed edge will have `marked=True`. If we print the edges from the result of our search, we can see which edges are oriented, and which satisfy the local criterion for genuine causation:
```python
>>> graph.edges(data=True)
[('x2', 'x1', {'arrows': [], 'marked': False}),
('x2', 'x4', {'arrows': ['x4'], 'marked': False}),
('x3', 'x1', {'arrows': [], 'marked': False}),
('x3', 'x4', {'arrows': ['x4'], 'marked': False}),
('x4', 'x5', {'arrows': ['x5'], 'marked': True})]
```
We can see the edges from `'x2'` to `'x4'`, `'x3'` to `'x4'`, and `'x4'` to `'x5'` are all oriented toward the second of each pair. Additionally, we see that the edge from `'x4'` to `'x5'` satisfies the local criterion for genuine causation. This matches the structure given in figure `2.3(d)` in Pearl (2000).
## Nonparametric Effects Estimation
The `causality.nonparametric` module contains a tool for nonparametrically estimating a causal distribution from an observational data set. You can supply an "admissable set" of variables for controlling, and the measure either the causal effect distribution of an effect given the cause, or the expected value of the effect given the cause.
I've recently added adjustment for direct causes, where you can estimate the causal effect of fixing a set of X variables on a set of Y variables by adjusting for the parents of X in your graph. Using the dataset above, you can run this like
```python
from causality.estimation.adjustments import AdjustForDirectCauses
from networkx import DiGraph
g = DiGraph()
g.add_nodes_from(['x1','x2','x3','x4', 'x5'])
g.add_edges_from([('x1','x2'),('x1','x3'),('x2','x4'),('x3','x4')])
adjustment = AdjustForDirectCauses()
```
Then, you can see the set of variables being adjusted for by
```python
>>> print adjustment.admissable_set(g, ['x2'], ['x3'])
set(['x1'])
```
If we hadn't adjusted for `'x1'` we would have incorrectly found that `'x2'` had a causal effect on `'x3'` due to the counfounding pathway `x2, x1, x3`. Adjustment for `'x1'` removes this bias.
You can see the causal effect of intervention, `P(x3do(x2))` using the measured causal effect in `adjustment`,
```python
>>>from causality.estimation.nonparametric import CausalEffect
>>>admissable_set = adjustment.admissable_set(g,['x2'], ['x3'])
>>>effect = CausalEffect(X, ['x2'], ['x3'], variable_types=variable_types, admissable_set=list(admissable_set))
>>>x = pd.DataFrame({'x2' : [0.], 'x3' : [0.]})
>>>effect.pdf(x)
0.268915603296
```
Which is close to the correct value of `0.282` for a gaussian with mean 0. and variance 2. If you adjust the value of `'x2'`, you'll find that the probability of `'x3'` doesn't change. This is untrue with just the conditional distribution, `P(x3x2)`, since in this case, observation and intervention are not equivalent.
## Other Notes
This repository is in its early phases. The runtime for the tests is long. Many optimizations will be made in the near future, including
* Implement fast mutual information calculation, O( N log N )
* Speed up integrating out variables for controlling
* Take a usersupplied graph, and find the set of admissable sets
* Frontdoor criterion method for determining causal effects
Pearl, Judea. _Causality_. Cambridge University Press, (2000).
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size causality0.0.9py3noneany.whl (24.3 kB)  File type Wheel  Python version py3  Upload date  Hashes View hashes 
Filename, size causality0.0.9.tar.gz (20.4 kB)  File type Source  Python version None  Upload date  Hashes View hashes 
Close
Hashes for causality0.0.9py3noneany.whl
Algorithm  Hash digest  

SHA256  326f63d7e4021c6435b8addfed0e2183cde683b7ac63ff5fd29eef1861ec84e1 

MD5  7e91abb5e204f0383fc10886b698d2e6 

BLAKE2256  9b315284b1f4a1f0ee00c6f8d9971ae7410d5567048245c3adb5486053e4bc8e 