Add your description here
Project description
Inga 因果
inga is a toolkit for generating and inspecting synthetic tabular datasets. It constructs arbitrarily complex Structural Causal Models (SCMs), draws samples from them, and computes causal effects and causal biases conditioned on observed variables and outcomes. All computed quantities are stored and made available for causally-informed pre-training of tabular models.
Causal Effect and Causal Bias
The current scope of this repository is restricted to SCMs with continuous variables. Let $V_i$ denote a generic scalar variable in the SCM, and let $U_{V_i} \sim 𝒩(0,1)$ be its corresponding exogenous noise, such that
$$ V_i := f_{V_i}(\mathrm{Pa}(V_i), U_{V_i}) := \bar{f}{V_i}(\mathrm{Pa}(V_i)) + \sigma{V_i} U_{V_i}. $$
Here, $\mathrm{Pa}(V_i)$ denotes the set of parents of $V_i$ in the DAG, $\bar f_{V_i}$ captures the deterministic structural component, and $\sigma_{V_i}$ controls the scale of the exogenous noise.
In particular, let $X$ denote a treatment variable, $Y$ an outcome, and $𝒪$ a set of observed variables. Under mild regularity assumptions (Detommaso et al.), the causal effect and causal bias for a given treatment value $x$ and observation vector $o$ are defined as
\begin{aligned}
𝒞_X(x, o)
&:= 𝔼\big[\nabla_x f_Y^x \,\big|\, x, o\big], \\
ℬ_X(x, o)
&:= -\sum_{V_i \in \{X\}\cup 𝒪}
\frac{1}{\sigma_{V_i}} 𝔼\Big[
\Big(
\nabla_{V_i} f_Y^{x,o} - (f_Y^{x,o} - 𝔼[Y \mid x, o])\, U_{V_i}
\Big)
\nabla_x (f_{V_i}^{x,o} - v_i)
\,\Big|\, x, o
\Big].
\end{aligned}
Here, $f_{V_i}^{a}$ denotes the structural function $f_{V_i}$ under intervention $A=a$. All expectations are taken with respect to the posterior distribution $p(U \mid x, o)$, where $U$ is the vector of all exogenous noise variables.
inga approximates this posterior using a robust Laplace approximation, enabling scalable computation in high-dimensional settings and across batches of observations $(x, o)$.
One can show that the association between treatment $X$ and outcome $Y$ decomposes into causal effect and causal bias:
$$ 𝒜_X(x, o) := \nabla_x 𝔼[Y \mid x, o] = 𝒞_X(x, o) + ℬ_X(x, o). $$
Causally Consistent Pre-Training
Causal effect and causal bias provide a granular characterization of how information propagates from observed variables to the outcome within the DAG.
Standard point-estimation models aim to approximate the conditional expectation $𝔼[Y \mid x, o]$, but they do not distinguish between contributions arising from causal pathways and those arising from non-causal (e.g., confounding or purely statistical) dependencies. As a result, the underlying data-generating process is often unidentifiable, which can lead to suboptimal generalization and brittleness under distribution shift.
Consider an encoder model $z := h(o)$ and a prediction head $\hat{y}(z)$. Introduce two additional heads, $\hat{c}_j(z)$ and $\hat{b}_j(z)$, intended to learn the causal effect and causal bias from $O_j$ (treated as the treatment variable) to $Y$. We say that the model is causally consistent for $O_j$ if
$$ \begin{aligned} \nabla_{o_j} \hat{y} &= \hat{c}_j + \hat{b}_j, \ \hat{c}j &= 𝒞{O_j}(o_j, o), \ \hat{b}j &= ℬ{O_j}(o_j, o). \end{aligned} $$
inga enables causally consistent pre-training by generating synthetic datasets that include the full set of causal effects $𝒞_{O_j}(o_j, o)$ and causal biases $ℬ_{O_j}(o_j, o)$. These quantities can be incorporated directly into training objectives, encouraging models to learn representations that respect the causal structure of the data-generating process.
A Small Benchmark
The small benchmark causal_consistency_benchmark.py demonstrates this intution. A simple MLP encoder is attached to three linear heads, respectively predicting outcomes, causal effects and causal biases. The model is trained and tested individually on splits of 30 randomly generated synthetic dataset.
+--------------------+----------------+-------------------+-------------------------+ | method_type | prediction_mae | causal_effect_mae | prediction_win_fraction | +--------------------+----------------+-------------------+-------------------------+ | standard | 0.7909 [0.31] | 0.3353 [0.45] | 0.0667 | | l2 | 0.7868 [0.31] | 0.3141 [0.46] | 0.0667 | | causal_consistency | 0.7694 [0.31] | 0.0461 [0.21] | 0.8667 | +--------------------+----------------+-------------------+-------------------------+
The table shows that not only the model trained using causal consistency provides much more reliable causal effect estimates, but also decreases the generalization error on ~87% of the datasets. Results can be replicated by running uv run python examples/causal_consistency_benchmark.py.
How To:
Install
Clone the repository:
git clone https://github.com/gianlucadetommaso/inga.git
cd inga
Sync dependencies:
uv sync
Run scripts, for example:
uv run python -m examples.explore
Create a DAG
You can create and draw the DAG of a SCM as follows:
from inga.scm import SCM, Variable
scm = SCM(
variables=[
Variable(name="Z"),
Variable(name="X", parent_names=["Z"]),
Variable(name="Y", parent_names=["Z", "X"]),
]
)
scm.draw(output_path="YOUR_DAG.png")
Create a SCM
The class Variable defines a variable $V_i$ in the DAG, but leaves the mean function $\bar f_{V_i}$. To complete the SCM and compute causal quantities, you must create a child class that defines the mean function. For example:
import torch
from torch import Tensor
from inga.scm import Variable
class MyVariable(Variable):
def f_mean(self, parents: dict[str, Tensor]) -> Tensor:
f_mean: Tensor | float = 0.0
for parent in parents.values():
f_mean = f_mean + torch.sin(parent)
return f_mean
An example of built-in Variable with defined mean function is LinearVariable. Now, Let's update the SCM using our newly defined variable class!
from inga.scm import SCM
scm = SCM(
variables=[
MyVariable(name="Z", sigma=1.0),
MyVariable(name="X", sigma=1.0, parent_names=["Z"]),
MyVariable(name="Y", sigma=1.0, parent_names=["Z", "X"]),
]
)
Compute causal effect and causal bias
We are ready to compute causal effect and causal bias. We need to define treatment variable, outcome variable and observed variables. Note: the treatment should always be observed, while the outcome should never be. Here an example:
from torch import Tensor
treatment_name, outcome_name = "X", "Y"
observed = {"X": Tensor([1.])}
scm.posterior.fit(observed)
causal_effect = scm.causal_effect(
observed=observed,
treatment_name=treatment_name,
outcome_name=outcome_name
)
causal_bias = scm.causal_bias(
observed=observed,
treatment_name=treatment_name,
outcome_name=outcome_name
)
Explore the dataset
You can investigate the dataset interactively by exporting the SCM to HTML:
scm.export_html(
output_path="YOUR_SCM.html",
observed_ranges={"X": (-2.0, 2.0)}
)
Run uv run python examples/explore.py to checkout an example of this!
Generate, save and load SCM datasets
Given that we have constructed our SCM, let's generate, save and load a SCM dataset.
from inga.scm import CausalQueryConfig, load_scm_dataset
dataset = scm.generate_dataset(
num_samples=128,
seed=123,
queries=[
CausalQueryConfig(
treatment_name="X",
outcome_name="Y",
observed_names=["X"],
),
],
)
dataset_path = "YOUR_DATASET.json"
dataset.save(dataset_path)
loaded_dataset = load_scm_dataset(dataset_path)
Cite Inga
If you use inga in academic work, you can cite it with the following BibTeX entry (and optionally replace year and note with the exact release tag/commit and access date you used):
@software{detommaso_inga,
author = {Detommaso, Gianluca},
title = {Inga: Causal Synthetic Tabular Data Toolkit},
url = {https://github.com/gianlucadetommaso/inga},
year = {2026},
note = {GitHub repository}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inga-0.1.0.tar.gz.
File metadata
- Download URL: inga-0.1.0.tar.gz
- Upload date:
- Size: 54.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05d660d4d259097b9fe511f714b6b69748a82a201589a4f4e3f6bbb130332385
|
|
| MD5 |
ce01a34c095e6fd06d159547d26ca12f
|
|
| BLAKE2b-256 |
09efdf1cfd90119cf8159d75e4117973f1238442030c6141015d50bc3bde7048
|
File details
Details for the file inga-0.1.0-py3-none-any.whl.
File metadata
- Download URL: inga-0.1.0-py3-none-any.whl
- Upload date:
- Size: 47.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
291afa06f49488b0746b98f7c3ce3cf24f0e822f23f0eb717c8c2f3da69ec6c2
|
|
| MD5 |
a3436f06f7fc427289a169c7da10731e
|
|
| BLAKE2b-256 |
00d0ade662261aa107f96f235eb628178fb2256d2fb8cde4645cd8fc5f800a3a
|