A conditional independence test for missing data
Project description
citest: Conditional Independence Testing for Missing Data
A hypothesis test for whether an outcome variable is independent of missingness, conditional on the observed explanatory data. The test compares classifier performance in predicting missingness with and without the outcome variable, using multiple imputation and cross-fitting to produce a valid t-statistic and p-value.
Installation
pip install citest
Quick start
import pandas as pd
from citest import CIMissTest
from citest.data import Dataset
# Load your data
data = pd.read_csv("path/to/your/data.csv")
# Define the dataset
dataset = Dataset()
dataset.make(
data,
y="target_variable",
expl_vars=["expl_var1", "expl_var2", ...]
)
# Run the test
test = CIMissTest(
dataset,
classifier_args={"n_estimators": 20, "target_n_jobs": 8},
)
test.run()
# Print results
test.summary()
How the test works
- Multiple imputation -- the missing data are multiply imputed (default: MIDAS denoising autoencoder).
- Classifier comparison -- for each imputed dataset, two classifiers predict the missingness indicator R:
- One using the outcome Y and covariates X
- One using a permuted (uninformative) copy of Y and covariates X
- Cross-fitting -- predictions are made out-of-fold to avoid data leakage.
- Test statistic -- the weighted difference in binary cross-entropy between the two classifiers is combined across imputations using Rubin's rules, yielding a t-statistic and p-value.
A significant result indicates that missingness depends on the outcome even after conditioning on the covariates (i.e. the data are not missing at random with respect to Y).
Customizing the pipeline
Imputers
| Class | Description |
|---|---|
MidasImputer (default) |
MIDAS denoising autoencoder (via midas2) |
IterativeImputer |
scikit-learn iterative imputer with posterior sampling |
IterativeImputer2 |
Robust variant with numerical guards for wide/sparse data |
Classifiers
| Class | Description |
|---|---|
RFClassifier (default) |
Random forest with auto-tuned max_features and min_samples_leaf |
ETClassifier |
Extremely randomized trees |
LogisticClassifier |
Logistic regression |
Example with custom settings
from citest.imputer import IterativeImputer
from citest.classifier import RFClassifier
test = CIMissTest(
dataset,
imputer=IterativeImputer,
classifier=RFClassifier,
n_folds=10,
m=10,
classifier_args={"n_estimators": 100, "target_n_jobs": 8},
imputer_args={"max_iter": 20},
)
Key parameters
| Parameter | Default | Description |
|---|---|---|
m |
10 | Number of multiply imputed datasets |
n_folds |
10 | Number of cross-validation folds |
variance_method |
"mi_crossfit" |
Variance estimator |
target_level |
"variable" |
Granularity of the missingness target: "variable" or "column" |
random_state |
42 | Random seed for reproducibility |
Interpreting results
test.summary() prints the test output:
- Mean difference in BCE -- average reduction in cross-entropy when the real outcome is included. Positive values indicate the outcome helps predict missingness.
- t / p-value -- one-sided test of H0: the outcome does not improve missingness prediction. A small p-value provides evidence against conditional independence (i.e. evidence of MNAR-type missingness).
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file midasverse_citest-0.3.2.tar.gz.
File metadata
- Download URL: midasverse_citest-0.3.2.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddf3de39c4aff558456bd495a56c45c415da02f8915470080b959a57e53b13ad
|
|
| MD5 |
e6a4115183266b95b3fb5e1e180a3b26
|
|
| BLAKE2b-256 |
3a522acdbe2c3988d84345595f2a727a317a329b9e42a517b7760ca1e43a4d7a
|
File details
Details for the file midasverse_citest-0.3.2-py3-none-any.whl.
File metadata
- Download URL: midasverse_citest-0.3.2-py3-none-any.whl
- Upload date:
- Size: 683.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
127b8e585d78c17f398f78596e43c7a6ba4b94c32fb9374fc99d455b63a5dbd2
|
|
| MD5 |
2df90ba91557418fc8966a607c55955b
|
|
| BLAKE2b-256 |
cd73b6d0d130f02057466b7265ac63965792e5a4d93dbae477c89d982d956770
|