Generating production line data with available causal ground truth
Project description
causalAssembly
This repo provides details regarding $\texttt{causalAssembly}$, a causal discovery benchmark data tool based on complex production data. Theoretical details and information regarding construction are presented in the paper:
Göbler, K., Windisch, T., Pychynski, T., Sonntag, S., Roth, M., & Drton, M. (2023). causalAssembly: Generating Realistic Production Data for Benchmarking Causal Discovery. arXiv preprint arXiv:2306.10816.
Authors
Maintainer: Konstantin Goebler
Table of contents
How to install
The package can be installed as follows
pip install causalAssembly
How to use
This is how $\texttt{causalAssembly}$'s functionality may be used. Be sure to read the documentation for more in-depth details regarding available functions and classes.
In case you want to train a distributional random forests yourself (see how to semisynthetsize), you need an R installation as well as the corresponding drf R package. Sampling has first been proposed in [2].
Note: For Windows users the python package rpy2 might cause issues. Please consult their issue tracker on GitHub.
In order to sample semisynthetic data from $\texttt{causalAssembly}$, consider the following example:
import pandas as pd
from causalAssembly.models_dag import ProductionLineGraph
from causalAssembly.drf_fitting import fit_drf
seed = 2023
n_select = 500
assembly_line_data = ProductionLineGraph.get_data()
# take subsample for demonstration purposes
assembly_line_data = assembly_line_data.sample(
n_select, random_state=seed, replace=False
)
# load in ground truth
assembly_line = ProductionLineGraph.get_ground_truth()
# fit drf and sample for entire line
assembly_line.drf = fit_drf(assembly_line, data=assembly_line_data)
assembly_line_sample = assembly_line.sample_from_drf(size=n_select)
# fit drf and sample for station3
assembly_line.Station3.drf = fit_drf(assembly_line.Station3, data=assembly_line_data)
station3_sample = assembly_line.Station3.sample_from_drf(size=n_select)
How to semisynthesize
In order to generate semisynthetic data for data sources outside the manufacturing
context, the class DAG
may be used. We showcase all necessary steps in the example below using the well-known Sachs [3] dataset.
Note, that the cdt
package is only needed to get easy access to data and corresponding ground truth.
import networkx as nx
from cdt.data import load_dataset
from causalAssembly.dag import DAG
from causalAssembly.drf_fitting import fit_drf
# load data set and available ground truth
s_data, s_graph = load_dataset("sachs")
# take subset for faster computation
s_data = s_data.sample(100, random_state=42)
print(nx.is_directed_acyclic_graph(s_graph))
cycles = nx.find_cycle(s_graph)
s_graph.remove_edge(*cycles[0])
if nx.is_directed_acyclic_graph(s_graph):
# convert to DAG instance
sachs_dag = DAG.from_nx(s_graph)
# fit DRF to the conditional distributions implied by
# the factorization over <s_graph>
sachs_dag.drf = fit_drf(graph=sachs_dag, data=s_data)
# sample new data from the trained DRFs
dream_benchmark_data = sachs_dag.sample_from_drf(size=50)
print(dream_benchmark_data.head())
How to generate random production DAGs
The ProductionLineGraph
class can further be used to generate completely random DAGs that follow an assembly line logic. Consider the following example:
from causalAssembly.models_dag import ProductionLineGraph
example_line = ProductionLineGraph()
example_line.new_cell(name='Station1')
example_line.Station1.add_random_module()
example_line.Station1.add_random_module()
example_line.new_cell(name='Station2')
example_line.Station2.add_random_module(n_nodes=5)
example_line.new_cell(name='Station3', is_eol= True)
example_line.Station3.add_random_module()
example_line.Station3.add_random_module()
example_line.connect_cells(forward_probs= [.1])
example_line.show()
How to generate FCMs
$\texttt{causalAssembly}$ also allows creating structural causal models (SCM) or synonymously functional causal models (FCM). In particular, we employ symbolic programming to allow for a seamless interplay between readability and performance. The FCM
class is completely general and inherits no production data logic. See the example below for construction and usage.
import numpy as np
import pandas as pd
from sympy import Eq, Symbol, symbols
from sympy.stats import Gamma, Normal, Uniform, Exponential
from causalAssembly.models_fcm import FCM
# declare variables in FCM as symbols
v, w, x, y, z = symbols("v,w,x,y,z")
# declare symbol for the variance of a Gaussian
delta = Symbol("delta", positive=True)
# Set up FCM
# name for the noise terms is required but mainly for readability
# it gets evaluated equation-by-equation. Therefore repeating names is completely fine.
eq_x = Eq(x, Exponential("source_distribution", 0.5))
eq_v = Eq(v, Gamma("source_distribution", 1, 1))
eq_y = Eq(y, 2 * x**2 - 7 * v + Normal("error", 0, delta))
eq_z = Eq(z, 9 * y * x * Gamma("noise", 0.5, 1))
eq_w = Eq(w, 7 * v - z + Uniform("error", left=-0.5, right=0.5))
# Collect in a list
eq_list = [eq_v, eq_w, eq_x, eq_y, eq_z]
# Create instance
test_fcm = FCM()
# Input list of equations this automatically
# induces the DAG etc.
test_fcm.input_fcm(eq_list)
# There is an option to use real data for source node samples
source_df = pd.DataFrame(
{
"v": np.random.uniform(low=-0.1, high=0.71, size=10),
},
columns=["v"],
)
# Sample from joint distribution
print(test_fcm.sample(size=8, source_df=source_df))
test_fcm.show(header="No Intervention")
# Multiple hard and soft interventions:
test_fcm.intervene_on(nodes_values={z: 2, w: Normal("noise", 3, 1)})
print(test_fcm.interventional_sample(size=8, source_df=source_df))
# Some plotting
test_fcm.show_mutilated_dag()
References
[1] Ćevid, D., Michel, L., Näf, J., Bühlmann, P., & Meinshausen, N. (2022). Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression. Journal of Machine Learning Research, 23(333), 1-79.
[2] Gamella, J.L, Taeb, A., Heinze-Deml, C., & Bühlmann, P. (2022). Characterization and greedy learning of Gaussian structural causal models under unknown noise interventions. arXiv preprint arXiv:2211.14897, 2022.
[3] Sachs, K., Perez, O., Pe'er, D., Lauffenburger, D. A., & Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721), 523-529.
How to test
In general we use pytest and the test suite can be executed locally via
python -m pytest
How to contribute?
Please feel free to contact one of the authors in case you wish to contribute.
Third-Party Licenses
Runtime dependencies
Name | License | Type |
---|---|---|
numpy | BSD-3-Clause License | Dependency |
scipy | BSD-3-Clause License | Dependency |
pandas | BSD 3-Clause License | Dependency |
networkx | BSD-3-Clause License | Dependency |
matplotlib | Other | Dependency |
sympy | BSD-3-Clause License | Dependency |
rpy2 | GNU General Public License v2.0 | Dependency |
Development dependency
Name | License | Type |
---|---|---|
mike | BSD-3-Clause License | Dependency |
mkdocs | BSD-2-Clause License | Dependency |
mkdocs-material | MIT License | Dependency |
mkdocstrings[python] | ISC License | Dependency |
ruff | MIT License | Dependency |
pytest | MIT License | Dependency |
pip-tools | BSD 3-Clause License | Dependency |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for causalAssembly-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bbdee6679763a3eeea9de05b9f70f62b0c54465d3b58fe2f7d9558753c8a66f |
|
MD5 | b41ae26b83e31f9062b1cfad3dd9cf90 |
|
BLAKE2b-256 | 2f75ea1940329e94ad589a1ddc95265c0066026b18cc3772d9a7afc53ba3d1a0 |