Skip to main content

A conditional independence test for missing data

Project description

citest: Conditional Independence Testing for Missing Data

A hypothesis test for whether an outcome variable is independent of missingness, conditional on the observed explanatory data. The test compares classifier performance in predicting missingness with and without the outcome variable, using multiple imputation and cross-fitting to produce a valid t-statistic and p-value.

Installation

pip install citest

Quick start

import pandas as pd
from citest import CIMissTest
from citest.data import Dataset

# Load your data
data = pd.read_csv("path/to/your/data.csv")

# Define the dataset
dataset = Dataset()
dataset.make(
    data,
    y="target_variable",
    expl_vars=["expl_var1", "expl_var2", ...]
)

# Run the test
test = CIMissTest(
    dataset,
    classifier_args={"n_estimators": 20, "target_n_jobs": 8},
)
test.run()

# Print results
test.summary()

How the test works

  1. Multiple imputation -- the missing data are multiply imputed (default: MIDAS denoising autoencoder).
  2. Classifier comparison -- for each imputed dataset, two classifiers predict the missingness indicator R:
    • One using the outcome Y and covariates X
    • One using a permuted (uninformative) copy of Y and covariates X
  3. Cross-fitting -- predictions are made out-of-fold to avoid data leakage.
  4. Test statistic -- the weighted difference in binary cross-entropy between the two classifiers is combined across imputations using Rubin's rules, yielding a t-statistic and p-value.

A significant result indicates that missingness depends on the outcome even after conditioning on the covariates (i.e. the data are not missing at random with respect to Y).

Customizing the pipeline

Imputers

Class Description
MidasImputer (default) MIDAS denoising autoencoder (via midas2)
IterativeImputer scikit-learn iterative imputer with posterior sampling
IterativeImputer2 Robust variant with numerical guards for wide/sparse data

Classifiers

Class Description
RFClassifier (default) Random forest with auto-tuned max_features and min_samples_leaf
ETClassifier Extremely randomized trees
LogisticClassifier Logistic regression

Example with custom settings

from citest.imputer import IterativeImputer
from citest.classifier import RFClassifier

test = CIMissTest(
    dataset,
    imputer=IterativeImputer,
    classifier=RFClassifier,
    n_folds=10,
    m=10,
    classifier_args={"n_estimators": 100, "target_n_jobs": 8},
    imputer_args={"max_iter": 20},
)

Key parameters

Parameter Default Description
m 10 Number of multiply imputed datasets
n_folds 10 Number of cross-validation folds
variance_method "mi_crossfit" Variance estimator
target_level "variable" Granularity of the missingness target: "variable" or "column"
random_state 42 Random seed for reproducibility

Interpreting results

test.summary() prints the test output:

  • Mean difference in BCE -- average reduction in cross-entropy when the real outcome is included. Positive values indicate the outcome helps predict missingness.
  • t / p-value -- one-sided test of H0: the outcome does not improve missingness prediction. A small p-value provides evidence against conditional independence (i.e. evidence of MNAR-type missingness).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

midasverse_citest-0.3.2.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

midasverse_citest-0.3.2-py3-none-any.whl (683.3 kB view details)

Uploaded Python 3

File details

Details for the file midasverse_citest-0.3.2.tar.gz.

File metadata

  • Download URL: midasverse_citest-0.3.2.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for midasverse_citest-0.3.2.tar.gz
Algorithm Hash digest
SHA256 ddf3de39c4aff558456bd495a56c45c415da02f8915470080b959a57e53b13ad
MD5 e6a4115183266b95b3fb5e1e180a3b26
BLAKE2b-256 3a522acdbe2c3988d84345595f2a727a317a329b9e42a517b7760ca1e43a4d7a

See more details on using hashes here.

File details

Details for the file midasverse_citest-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for midasverse_citest-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 127b8e585d78c17f398f78596e43c7a6ba4b94c32fb9374fc99d455b63a5dbd2
MD5 2df90ba91557418fc8966a607c55955b
BLAKE2b-256 cd73b6d0d130f02057466b7265ac63965792e5a4d93dbae477c89d982d956770

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page