Skip to main content

This package aims to employ the boosting algorithm to do variable selection and estimation for measurement error in binary responses and high-dimensional covariates.

Project description

BOOME

A python package BOOME, known as BOOsting algorithm for Measurement Error in binary responses and high-dimensional covariates, is used to select informative variables and estimate associated parameters with correction of misclassification in responses and measurement error effects in covariates simultaneously. The key idea of the method in this package is to develop an unbiased estimating function based on logistic regression models or probit models by correcting for measurement error effects in responses (e.g., Yi 2017, ISBN: 978-1493966387) and covariates (e.g., Stefanski and Carroll 1987, doi:10.2307/2336464; Chen and Yi 2021, doi:10.1007/s10463-020-00755-2). After that, the proposed boosting algorithm is applied to a corrected estimating function and select variables and derive the corresponding estimators. This package can be adopted to model financial data (e.g., the bankruptcy data in https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction) or gene expression data (e.g., the Leukemia data in R package "SIS").

There are three functions in this package: ME_Generate, LR_Boost, and PM_Boost. The function ME_Generate is used to generate the artificial data subject to error-prone covariates and misclassified binary responses. Two functions LR_Boost and PM_Boost aim to correct for measurement error effects in responses, covariates, or both, and then implement the boosting procedure to do variable selection and estimation for logistic regression models (LM) and probit models (PM), respectively.

Installation

The package requires numpy, pandas, scipy.stats, numpy.linalg and math. In addition, the package is tested on Python version 3.5 and above.

To install the package, download this folder and execute:

pip install BOOME

Demonstration and Example Implementation

We describe three functions in the package and give a simple demonstration.

1. ME_Generate(n,beta,matrix,X,gamma)

This function is used to generate the error-prone data.

Arguments:

  • n: The number of observations.
  • beta: A p-dimensional vector of parameter in regression models to characterize the response and covariates.
  • matrix: A user-specific covariance matrix in the classical measurement error model.
  • X: A user-specific matrix of predictors.
  • gamma: A p-dimensional vector of parameter to model misclasified probability and can be specified by users.

Outcomes:

  • data: A dataset with error prone predictors and responses. It is a dataframe, where the column with label y represents the response, and the column with label j represents the jth predictor.
  • pr: Two misclassification probabilities (specificity and sensitivity) in misclassification models.

Example code:

Construction of X

X=[]
for i in range(100):
    X.append(np.random.normal(0, 1, 1000))
X=np.array(X)

Parameters

n=100
beta=[1]*3+[0]*997
cov=np.identity(1000)*0.2
gamma=[[1],[1]*1000,[1],[1]*1000]

Generation of error-prone data

from BOOME.algorithm  import ME_Generate
ME=ME_Generate(n,beta,cov,X,gamma)data=ME[1]
pr=ME[0]
data=ME[1]

2. LR_Boost(X,Y,ite,thres,correct_X,correct_Y,pr,lr,matrix)

This function implements the BOOME method for logistic regression models

Arguments:

  • X: A matrix of continuous predictors that are precisely measured or subject to measurement error.
  • Y: A vector of binary responses that are precisely measured or subject to measurement error.
  • ite: A number of iteration for the boosting algorithm.
  • thres: A threshold value used to retain variables.
  • correct_X: Determine the correction of measurement error in predictors. Select"1" if correction is needed, and "0" otherwise.
  • correct_Y: Determine the correction of measurement error in the response. Select"1" if correction is needed, and "0" otherwise.
  • pr: Two misclassification probabilities (specificity and sensitivity) in misclassification models.
  • lr: A learning rate used to update estimators in the boosting algorithm.
  • matrix: A covariance matrix in the classical measurement error model.

Outcomes:

  • estimated coefficients: A vector of estimators
  • predictors: Indexes of nonzero values in estimated coefficients
  • number of predictors: The number of nonzero values in estimated coefficients

Example code:

Decompose data to 'X' and 'Y'

x=data.drop(['y'],axis=1)
y=data[['y']]

Implementation of LR_Boost

ite=1000
thres=0.9
lr=0.00015

LR_Boost(x,y,ite,thres,1,1,pr,lr,cov)
    estimated coefficient :[1.034212979725094, 1.1510170699913091, 1.0499444673865401, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16574049090435286, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11880684398132933, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.012977857969749283, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.010305349538670168, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.05350619906941062, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0355414321875259, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.2542110310704013, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.08084583303060171, 0.0, 0.0, 0.0, ]
    predictors:[1, 2, 3, 26, 33, 52, 129, 148, 229, 300, 346, 421, 480, 520, 523, 543, 562, 589, 590, 592, 634, 639, 640, 650, 668, 739, 768, 774, 798, 851, 932, 936, 983]
    number of predictors:33

3. PM_Boost(X,Y,ite,thres,correct_X,correct_Y,pr,lr,matrix)

The arguments in PM_Boost as well as the outcomes produced by PM_Boost are the same as those in LR_Boost.

Example code:

Construction of X

X=[]
for i in range(1000):
    X.append(np.random.normal(0, 1, 100))
X=np.array(X)

Parameters

n=1000
beta=[1]*3+[0]*97
cov=np.identity(100)*0.2
gamma=[[1],[1]*100,[1],[1]*100]

Generation of error-prone data

ME=ME_Generate(n,beta,cov,X,gamma)
data=ME[1]
pr=ME[0]
x=data.drop(['y'],axis=1)
y=data[['y']]

Implementation of PM_Boost

ite=1000
thres=0.9
lr=0.000012

PM_Boost(x,y,ite,thres,1,1,pr,lr,cov)
    estimated coefficients:[0.9769572712462923, 1.0148426909295771, 0.9694489897164331, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.02976937695270201, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.02489111873996184, 0.0, 0.0, 0.010078027643568152, 0.0, 0.0, 0.0, 0.030030345938361926, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011519695016413343, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
    predictors:[1, 2, 3, 10, 31, 34, 38, 92]
    number of predictors:8

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BOOME-0.0.2.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

BOOME-0.0.2-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file BOOME-0.0.2.tar.gz.

File metadata

  • Download URL: BOOME-0.0.2.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for BOOME-0.0.2.tar.gz
Algorithm Hash digest
SHA256 fa10aa35337680544186a8e5b1f56e14cc2a2d40927fac963d85e18473be6b76
MD5 6fdc8dc291ef9cf8152f8ce38239250f
BLAKE2b-256 edce4ee2b72014584b03cebb9c80c2340595a26582c2c356e7b329731a85415e

See more details on using hashes here.

File details

Details for the file BOOME-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: BOOME-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for BOOME-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 16745cc725e37f70e62205acf30f44bd75173c9844c104458dcd957a7393902f
MD5 4821dc4fb67ecc66874c4fe15566a947
BLAKE2b-256 b796b57516f1a4002286cf86f33b0e03357cfb03d5e2a9d870724fbfeb8a5f50

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page