Skip to main content

Bayesian MI-LASSO for variable selection on multiply-imputed data.

Project description

BMIselect

This package provides five Bayesian MI-LASSO models for Bayesian variable selection on multiply-imputed data. The inference for these five models are based on MCMC chains written in Python. They take multiply-imputed dataset as the input and output posterior samples of each parameter.

Multiple imputation(MI) is one of the major tools to deal with missing data. However, if we conduct variable selection methods on each imputed dataset separately, different sets of important variables may be obtained. It is very hard to decide which variable is important based on these different variable sets. MI-LASSO, proposed in 2013, is one of the most popular solutions to this problem. The MI-LASSO method treats the estimated regression coefficients of the same variable across all imputed datasets as a group and applies the group LASSO penalty to yield a consistent variable selection across multiple-imputed datasets. In this package, we extend MI-LASSO into Bayesian framework. To yield various selection effects, we use totally five different prior distributions as follows:

  • Ridge prior (ARD prior)
  • Horseshoe prior
  • Laplace prior
  • Spike-ridge prior
  • Spike-laplace prior

For more details on the algorithm and its applications, please consult the following paper: "Variable Selection for Multiply-imputed Data: A Bayesian Framework" (Arxiv: https://arxiv.org/abs/2211.00114).

Installation

You can install this package from pip with

pip install bmiselect

Requirement

  • Python >= 3.7
  • pymc3 >= 3.11.5
  • theano-pymc >= 1.1.2
  • mkl >= 2.4.0
  • mkl-service >= 2.4.0
  • numpy
  • matplotlib
  • sklearn
  • pandas
  • seaborn
  • arviz
  • xarray
  • statsmodels

Models

This package is based on the linear regression model:

Different models impose different group-based prior distributions on .

Type Group-based Prior Hyper-parameters Default value
Shrinkage Model Multi-Laplace r 2
s 15
ARD(Ridge) No hyper-parameters
Horseshoe No hyper-parameters
Discrete Mixture Model Spike-Ridge p0 0.5
v0 4
Spike-Laplace lambda 6/11
a 1
b 1

The inference is done with posterior samples by running MCMC.

Usage

Examples about how to use this library are included in bmiselect/main.py.

Input

After installation from pip, we can import Bayesian MI-LASSO classes in Python shell:

from bmiselect.models.ARD import ARD
from bmiselect.models.Horseshoe import Horseshoe
from bmiselect.models.Spike_ridge import Spike_ridge
from bmiselect.models.Laplace import Laplace
from bmiselect.models.Spike_laplace import Spike_laplace

Initialization

Then we can use MI dataset to initialize the models:

# shrinkage models
model1 = Laplace(Y_array, X_array, standardize = True, r = 2, s = 15)
# model1 = Horseshoe(Y_array, X_array, standardize = True)
# model1 = Ridge(Y_array, X_array, standardize = True)

# discrete mixture models
model2 = Spike_laplace(Y_array, X_array, standardize = True, lambda_ = 6/11, a = 1, b = 1)
# model2 = Spike_ridge(Y_array, X_array, standardize = True, p0 = 0.5, v0 = 4)

Here Y_array is a 2-d data array for response variable, its dimension is (n_imputations, n_samples). X_array is a 3-d data array for explanatory variables, its dimension is (n_imputations, n_samples, n_features). If the parameter standardize is True, X_array is standardized and then used to run MCMC chains. If it is False, the original X_array is used to calculate MCMC chains. Other parameters are hyper-parameters for each model.

Posterior Sampling

After initialization, we can use sample function to run MCMC chains and get posterior samples:

model1.sample(n_post = 1000, n_burn = 500, target_accept = 0.9, n_chain = 2, n_thread = 4, max_treedepth = 10, seed = 123)
# model2.sample(n_post = 1000, n_burn = 500, target_accept = 0.9, n_chain = 2, n_thread = 4, max_treedepth = 10, seed = 123)

The parameters for sample function are as follows:

  • n_post(required): number of posterior samples for each chain.
  • n_burn(required): number of burn-in samples for each chain.
  • target_accept(default 0.9): target acceptance probability for NUTS.
  • max_treedepth(default 10): maximum tree depth for NUTS.
  • n_chain(default 1): number of parallel chains to run.
  • n_thread(default 4): number of threads to run parallel computing.
  • seed(default None): random seed. If seed is None, seed is equals to the current time in seconds since the Epoch.

We can use get_posterior_samples function to get posterior samples:

model1.get_posterior_samples(var_name = "beta", rescale = True)   # get posterior samples for the coefficients vector
model2.get_posterior_samples(var_name = "alpha", rescale = True)  # get posterior samples for the intercept
model2.get_posterior_samples(var_name = "g", rescale = True)      # get posterior samples for the hidden binary variables in discrete mixture models

Here var_name is the variable we want to sample for. rescale specifies whether to return coefficients in the original scale; if it is False, then coefficients corresponding to standardized covariates are return; if it is True, all the coefficients are rescaled to the original scale. If standardize = False in initialization stage, rescale has no effect. For MI data, we simply mixed up the posterior samples for each grouped coefficient among all MI sets. So the dimension of posterior samples for coefficients vector beta is (n_chains, n_imputations * n_samples, n_features). And the dimension of intercept alpha is (n_chains, n_imputations * n_samples).

Summary Statistics

Our library provides a summary function to generate summary statistics for all the variables in the hierachical model:

summary_stats1 = model1.summary(rescale = True)
print(summary_stats1)

Here rescale is the same as it in function get_posterior_samples.

Variable Selection

Users can use select function to select important variables:

select1 = model1.select(value = 0.95, rescale = True) # Credible Interval Criterion for Shrinkage Models
select2 = model2.select(value = 0.5,  rescale = True) # Cutting-off point for Discrete Mixture Models

The meaning of value depends on the type of models. For shrinkage models, value is the credible interval criterion for selection. For discrete mixture models, value stands for the cutting-off point for selection. For more details, please consult Chapter 3.2 in the paper: "Variable Selection for Multiply-imputed Data: A Bayesian Framework" (Arxiv: https://arxiv.org/abs/2211.00114).

Evaluation

There are some evaluation functions in this library:

from bmiselect.utils.evaluation import sensitivity, specificity, f1_score, mse

sensitivity(select = select1, truth = truth)                                           # sensitivity
specificity(select = select2, truth = truth)                                           # specificity
f1_score(select = select2, truth = truth)                                              # f1 score
mse(beta, covariance, select = select1, X = X_array, Y = Y_array, intercept = True)    # mse, given coefficients and covariance matrix of ground truth

Here select and truth are binary vectors with length (n_features). select[i] = True means i-th variable is selected.

Refitting Linear Regression

After we complete the variable selection by Bayesian MI-LASSO, users can apply fit_lr to fit ordinary linear regression separately on each imputed dataset. Alternatively, users can utilize pooled_coefficients and pooled_covariance to directly get pooled coefficients and covariance matrix with Rubin`s Rule.

from bmiselect.utils.evaluation import fit_lr, pooled_coefficients, pooled_covariance

# refit linear regression by using selected variabels
lr_models = fit_lr(select1, X_array, Y_array, intercept = True))
for lr in lr_models:
    print(lr.summary())

# get pooled coefficients estimates by using Rubin`s rule
lr_coef = pooled_coefficients(select2, X_array, Y_array, intercept = True))
print(lr_coef)

# get pooled covariance estimates by using Rubin`s rule
lr_covariance = pooled_covariance(select2, X_array, Y_array, intercept = True))
print(lr_covariance)

If intercept = True, then an intercept is added to each ordinary linear regression respecitively. If intercept = False, then no intercept is used in linear regression.

Disclaimer

If you find there is any bug, please contact me: jungang.zou@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bmiselect-0.0.9.tar.gz (20.3 kB view hashes)

Uploaded Source

Built Distribution

bmiselect-0.0.9-py3-none-any.whl (28.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page