Harmonizing neuroimaging data across sites. Implementation of neurocombat using sklearn format
Project description
NeuroCombatsklearn
Adjust for batch effects using an empirical Bayes framework
ComBat allows users to adjust for batch effects in datasets where the batch covariate is known,using methodology described in Johnson et al. 2007. It uses either parametric or nonparametricempirical Bayes frameworks for adjusting data for batch effects. Users are returned an expressionmatrix that has been corrected for batch effects. The input data are assumed to be cleaned andnormalized before batch effect removal.
TheComBatfunction adjusts for known batches using an empirical Bayesianframework [1]. In order to use the function, you must have a known batchvariable in your dataset.
in scikitlearn compatible format
he aim of the standardization procedure presented in section 3.1 is to reduce genetogene variation in the data, because genes in the array are expected to have different expression profiles or distributions. However, we do expect that phenomena that cause batch effects to affect many genes in similar ways. To more clearly extract the common batch biases from the data, the standardization procedure standardizes all genes to have the similar overall mean and variance. On this scale, batch effect estimators can be compared and pooled across genes to create robust estimators for batch effects. Without standardization, the genespecific variation increases the noise in the data and inflates the prior variance, decreasing the amount of shrinkage that occurs. Therefore standardization is crucial for EB shrinkage methods. However this feature is not present in many EB methods for Affymetrix arrays.
empirical Bayes (EB) method that is robust for adjusting for batch effects in data whose batch sizes are small.
Location and scale (L/S) adjustments can be defined as a wide family of adjustments in which one assumes a model for the location (mean) and/or scale (variance) of the data within batches and then adjusts the batches to meet assumed model specifications. Therefore, L/S batch adjustments assume that the batch effects can be modeled out by standardizing means and variances across batches. These adjustments can range from simple genewise mean and variance standardization to complex linear or nonlinear adjustments across the genes. One straightforward L/S batch adjustment is to mean center and standardize the variance of each batch for each gene independently. Such a method is currently implemented in the dChip software (Li and Wong, 2003), designated as “using standardized separators” (see Figure 1(b)). In more complex situations such as unbalanced designs or when incorporating numerical covariates, a more general L/S framework must be used. For example, let Yijg represent the expression value for gene g for sample j from batch i. Define an L/S model that assumes graphic (2.1) where αg is the overall gene expression, X is a design matrix for sample conditions, and βg is the vector of regression coefficients corresponding to X. The error terms, εijg, can be assumed to follow a Normal distribution with expected value of zero and variance σ2g. The γig and δig represent the additive and multiplicative batch effects of batch i for gene g, respectively. The batchadjusted data, Y∗ijg, are given by graphic (2.2) where αˆg,βˆg,γˆig,andδˆig are estimators for the parameters αg, βg, γig, and δig based on the model.
ComBat for correcting batch effects using the Scikitlearn format
Check https://github.com/scikitlearn/scikitlearn/blob/1495f6924/sklearn/preprocessing/label.py
#' Adjust for batch effects using an empirical Bayes framework #' #' ComBat allows users to adjust for batch effects in datasets where the batch covariate is known, using methodology #' described in Johnson et al. 2007. It uses either parametric or nonparametric empirical Bayes frameworks for adjusting data for #' batch effects. Users are returned an expression matrix that has been corrected for batch effects. The input #' data are assumed to be cleaned and normalized before batch effect removal.
ust as withsva, we then need to create a model matrix for the adjustmentvariables, including the variable of interest. Note that you do not include batchin creating this model matrix  it will be included later in theComBatfunction.In this case there are no other adjustment variables so we simply fit an interceptterm.> modcombat = model.matrix(~1, data=pheno)Note that adjustment variables will be treated as given to theComBatfunction.This means if you are trying to adjust for a categorical variable with p differentlevels, you will need to giveComBatp1 indicator variables for this covariate. Werecommend using themodel.matrixfunction to set these up. For continuousadjustment variables, just give a vector in the containing the covariate valuesin a single column of the model matrix.We now apply theComBatfunction to the data, using parametric empiricalBayesian adjustments.> combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)Standardizing Data across genesThis returns an expression matrix, with the same dimensions as your originaldataset. This new expression matrix has been adjusted for batch. Significanceanalysis can then be performed directly on the adjusted data using the modelmatrix and null model matrix as described before:> pValuesComBat = f.pvalue(combat_edata,mod,mod0)> qValuesComBat = p.adjust(pValuesComBat,method="BH")These Pvalues and Qvalues now account for the known batch effects includedin the batch variable.There are a few additional options for theComBatfunction. By default, itperforms parametric empirical Bayesian adjustments. If you would like to usenonparametric empirical Bayesian adjustments, use thepar.prior=FALSEoption (this will take longer). Additionally, use theprior.plots=TRUEoption togive prior plots with black as a kernel estimate of the empirical batch effectdensity and red as the parametric estimate. For example, you might chose touse the parametric Bayesian adjustments for your data, but then can check theplots to ensure that the estimates were reasonable.Also, we have now added themean.only=TRUEoption, that only adjusts themean of the batch effects across batches (default adjusts the mean and variance). This option is recommended for cases where milder batch effects areexpected (so no need to adjust the variance), or in cases where the variances are expected to be different across batches due to the biology. For example,suppose a researcher wanted to project a knockdown genomic signature to beprojected into the TCGA data. In this case, the knockdowns samples may bevery similar to each other (low variance) whereas the signature will be at varying levels in the TCGA patient data. Thus the variances may be very differentbetween the two batches (signature perturbation samples vs TCGA), so onlyadjusting the mean of the batch effect across the samples might be desired inthis case.Finally, we have now added aref.batchparameter, which allows users to selectone batch as a reference to which other batches will be adjusted. Specifically,the means and variances of the nonreference batches will be adjusted to makethe mean/variance of the reference batch. This is a useful feature for caseswhere one batch is larger or better quality. In addition, this will be useful inbiomarker situations where the researcher wants to fix the traning set/modeland then adjust test sets to the reference/training batch. This avoids testsetbias in such studies.
References: If you are using ComBat for the harmonization of multisite imaging data, please cite the following papers:
Citation  Paper Link  

ComBat for multisite DTI data  JeanPhilippe Fortin, Drew Parker, Birkan Tunc, Takanori Watanabe, Mark A Elliott, Kosha Ruparel, David R Roalf, Theodore D Satterthwaite, Ruben C Gur, Raquel E Gur, Robert T Schultz, Ragini Verma, Russell T Shinohara. Harmonization Of MultiSite Diffusion Tensor Imaging Data. NeuroImage, 161, 149170, 2017  Link 
ComBat for multisite cortical thickness measurements  JeanPhilippe Fortin, Nicholas Cullen, Yvette I. Sheline, Warren D. Taylor, Irem Aselcioglu, Philip A. Cook, Phil Adams, Crystal Cooper, Maurizio Fava, Patrick J. McGrath, Melvin McInnis, Mary L. Phillips, Madhukar H. Trivedi, Myrna M. Weissman, Russell T. Shinohara. Harmonization of cortical thickness measurements across scanners and sites. NeuroImage, 167, 104120, 2018  Link 
Original ComBat paper for gene expression array  W. Evan Johnson and Cheng Li, Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118127, 2007.  Link 
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size neurocombat_sklearn0.1.2a0py3noneany.whl (10.9 kB)  File type Wheel  Python version py3  Upload date  Hashes View 
Filename, size neurocombatsklearn0.1.2a0.tar.gz (9.4 kB)  File type Source  Python version None  Upload date  Hashes View 
Hashes for neurocombat_sklearn0.1.2a0py3noneany.whl
Algorithm  Hash digest  

SHA256  3173aa53f92fc61a091d19bf91680fdd9558b49a6844bc1c07b8d62695205ae5 

MD5  14c11a2bdcf7e814102e7e274e8cb536 

BLAKE2256  b989a2553d5650f2adf949e4ef6c5b4b0fda27be729b91dfb7df1b7767adce01 
Hashes for neurocombatsklearn0.1.2a0.tar.gz
Algorithm  Hash digest  

SHA256  713abe4c98decbff88cddaa55711e25b823ad69641d4a6f1e45ad8d0e8a35a47 

MD5  577bdaabb3d7b2463b47c839baf5b0ae 

BLAKE2256  be4bd4e9dc4c6e560feb5e70f7d50f4bf5b46aab394ff8648f50456629859247 