Skip to main content

A package for causal discovery and causal inference algorithms

Project description

MagPy: Causal Discovery and Effect Estimation Framework

MagPy is a Python framework for causal discovery and effect estimation, aimed at uncovering causal relationships in observational data and estimate the impact of interventinos/counterfactuals.

This is an experimental project, currently under active development.

Installation

pip install causal-magpy

or

git clone https://github.com/ergodic-ai/magpy.git
cd magpy
pip install .

or

pip install git+https://github.com/ergodic-ai/magpy.git

1. AStarSearch

The AStarSearch class implements the A* search algorithm for Bayesian network structure learning. Our goal with this algorithm is to be able to decouple the scoring function from the search algorithm.

A simple graph search using the BIC score:

import pandas as pd
from magpy.search.astar import AStarSearch, bic_score_node

data: pd.DataFrame = ...

astar = AStarSearch(data)
astar.run_scoring(func=bic_score_node, parallel=False)
result = astar.search()

Or a more complex search using a polynomial scoring function:

from typing import Optional
import numpy
from sklearn.preprocessing import PolynomialFeatures


def bic_score_node_poly(
    y: numpy.ndarray,
    X: Optional[numpy.ndarray] = None,
    node: str | None = None,
    parent_set: set | None = None,
    degree=3,
    include_bias=True,
):
    n = len(y)

    if X is None:
        residual = numpy.sum(y**2)
        dof = 0

    else:
        Xf = PolynomialFeatures(degree=degree, include_bias=include_bias).fit_transform(
            X
        )
        n, dof = Xf.shape
        _, residual, _, _ = numpy.linalg.lstsq(a=Xf, b=y, rcond=None)

    bic = n * numpy.log(residual / n) + dof * numpy.log(n)
    return bic.item()

astar = AStarSearch(data)
astar.run_scoring(func=bic_score_node, parallel=False)
result = astar.search()

The philosophy behind this algorithm is that:

  • You know the data! And that should be reflected in your choice of scoring function. As long as it adheres to the above interface, it will work.
  • The algorithm executes the search.

2. Skeleton Learning

As part of the causal discovery pipeline, we need to learn the skeleton of the graph. This is typically done using the PC algorithm. The PC algorithm is essentially a wrapper around a conditional independence test, which we'll call an oracle.

An oracle is an object that tests whether a certin variable X is independent of Y given a set of covariates Z: $(X \perp Y | Z)$.

We are currently supporting two main oracles:

  • MixedDataOracle, which can handle continuous, binary, and categorical variables, and assumes linear relationships.
  • BaseOracle, which is the base class to implement your own oracle.

2.1 BaseOracle

Similarly to the AStar algorithm, our objective with the BaseOracle is to decouple the underlying "learner" from the hypothesis testing.

from magpy.oracles.oracles import BaseOracle, linear
import numpy

z = numpy.random.randn(1000)
x = z + numpy.random.randn(1000) * 0.1
y = x + numpy.random.randn(1000) * 0.1

df = pandas.DataFrame({"x": x, "y": y, "z": z})


oracle = BaseOracle(df, threshold=0.05, learner=linear)

print("linear: ")
print("independent: ", oracle("y", "x", ["z"]))
print("pvalue: ", oracle._run("y", "x", ["z"]))

The learner object is a function that accepts X, y and returns the RSS of a regression and the number of degrees of freedom within the model..

Here's an example of how to implement a learner based on polynomial regression:

from typing import Optional, Union
import pandas
import numpy
from sklearn.preprocessing import PolynomialFeatures


def poly_rss(
    X: Union[pandas.DataFrame, None],
    y: pandas.Series,
    node: Optional[str] = None,
    parent_set: Optional[set] = None,
    degree: int = 3,
):
    """Perform polynomial regression and return residual sum of squares and degrees of freedom.

    Args:
        X (Union[pandas.DataFrame, None]): Feature matrix. If None, only intercept is used.
        y (pandas.Series): Target variable.
        node (Optional[str], optional): Node name, not used but included for API compatibility. Defaults to None.
        parent_set (Optional[set], optional): Parent set, not used but included for API compatibility. Defaults to None.
        degree (int, optional): Degree of polynomial features. Defaults to 3.

    Returns:
        tuple: (rss, p)
            rss (float): Residual sum of squares from polynomial regression
            p (int): Number of parameters (degrees of freedom) in the model
    """

    if X is None:
        X_values = numpy.ones(shape=(y.shape[0], 1))

    else:
        X_values = X.values
        X_values = PolynomialFeatures(degree=degree).fit_transform(X_values)

    y_values: numpy.ndarray = y.values  # type: ignore

    _, [rss], _, _ = numpy.linalg.lstsq(X_values, y_values, rcond=None)

    p = X_values.shape[1]

    return rss, p

Using this learner, we can now model more complex relationships:

from magpy.oracles.oracles import BaseOracle, linear
import numpy

z = numpy.random.randn(1000)
x = z**2 + numpy.random.randn(1000) * 0.1
y = x**2 + numpy.random.randn(1000) * 0.1

df = pandas.DataFrame({"x": x, "y": y, "z": z})


oracle = BaseOracle(df, threshold=0.05, learner=linear)

print("linear: ")
print("independent: ", oracle("y", "x", ["z"]))
print("pvalue: ", oracle._run("y", "x", ["z"]))



oracle = BaseOracle(df, threshold=0.05, learner=poly_rss)

print("polynomial: ")
print("independent: ", oracle("y", "x", ["z"]))
print("pvalue: ", oracle._run("y", "x", ["z"]))

Again, our philosophy is that you know your data best, and you should be able to implement a learner that best captures the relationship you are interested in.

2.2 MixedDataOracle

We developed this oracle because dealing with mixed data types is a pain. One-hot encoding and praying isn't necessarily a good idea, and this provides a quick way to handle this with some science behind it.

This is losely based on the work of Tsagris et al.

Here's a silly example:

from magpy.oracles.mixed import MixedDataOracle
import pandas
import numpy

z = numpy.random.randn(1000)
x = z + numpy.random.randn(1000)
y = z + numpy.random.randn(1000)
y_d = [str(int(elm.clip(-2, 2))) for elm in y]

df = pandas.DataFrame({"x": x, "y": y_d, "z": z})

oracle = MixedDataOracle(df, threshold=0.05)
print("Independent: ", oracle("y", "x", ["z"]))
print("pvalue: ", oracle._run("y", "x", ["z"]))

The oracle automatically tags variables as continuous or binary/categorical based on the data. If you want an integer to be treated as categorical, make sure to cast it as a string or object before.

2.3 PC Algorithm

We haven't implemented the full PC algorithm yet, our goal is to actually separate it into the various components:

  1. Skeleton search
  2. V-structures detection
  3. Further edge orientation

For now let's stick to skeleton search:

from magpy.search.pcskeleton import pc_skeleton
from magpy.oracles.oracles import BaseOracle, linear, cubic
import pandas
from typing import Callable



def pc_skeleton_magpy(
    X: pandas.DataFrame,
    learner: Callable = linear,
    intersection_or_union: str = "union",
):
    oracle = BaseOracle(X, threshold=0.05, learner=learner)
    skeleton, sepsets = pc_skeleton(
        oracle, X.columns, intersection_or_union=intersection_or_union
    )
    return skeleton

There are a number of niceties inside the PC skeleton implementation, we'll update the documentation soon to expose them. If you are working with continuous data, we strongly recommend composing the PC skeleton algorithm with a direct search method for orientation.

Composing

Our goal here is to allow for composition of different parts of the causal discovery pipeline. For instance, this is how you will perform a skeleton search using the PC skeleton:

from magpy.search.pcskeleton import pc_skeleton
from magpy.search.astar import AStarSearch, bic_score_node_poly
import pandas
from typing import Callable


def full_composite_search(
    X: pandas.DataFrame,
    learner_pc: Callable = cubic,
    learner_astar: Callable = bic_score_node_poly,
    intersection_or_union: str = "union",
    force=True,
):
    # Fix colinearity
    fix_colinearity(X)

    skeleton = pc_skeleton_magpy(
        X, intersection_or_union=intersection_or_union, learner=learner_pc
    )

    priors = skeleton.copy() * 0
    priors.loc["known_parent", "known_child"] = 1

    astar = AStarSearch(X, super_graph=skeleton, include_graph=priors)
    astar.run_scoring(parallel=False, func=learner_astar, verbose=False)
    y_df = astar.search()

    return y_df

3. Effect Estimation

Under deep development. The SF and the Diabetes notebooks are good starting points.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

causal-magpy-0.1.1.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

causal_magpy-0.1.1-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file causal-magpy-0.1.1.tar.gz.

File metadata

  • Download URL: causal-magpy-0.1.1.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.6

File hashes

Hashes for causal-magpy-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d8dc84dc916b9932aeab7e8a855282a19fb1016a33e8c6c152efd36b93cb5752
MD5 9bf9806651cd4a0763ac93d56ac56a37
BLAKE2b-256 d42d4fdf06f67885edabc28139abf36884010e2c400263d328ab217a57a7f6f3

See more details on using hashes here.

File details

Details for the file causal_magpy-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: causal_magpy-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.6

File hashes

Hashes for causal_magpy-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 19603d98df648929c1d3fd4027a0c6cc6a153828b06c4e2ea39516dd1b9eb76e
MD5 e0241ccd4e44eacfe44ed91ce3e754a5
BLAKE2b-256 d5676fa5d669c81988da97ba7fb83716d52e2cf96572d787b93940d20574ce7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page