Algorithms for constrained Lasso problems
Project description
classo: a Python package for constrained sparse regression and classification
classo is a Python package that enables sparse and robust linear regression and classification with linear equality constraints on the model parameters. For detailed info, one can check the documentation.
The forward model is assumed to be:
Here, y and X are given outcome and predictor data. The vector y can be continuous (for regression) or binary (for classification). C is a general constraint matrix. The vector β comprises the unknown coefficients and σ an unknown scale.
The package handles several different estimators for inferring β (and σ), including the constrained Lasso, the constrained scaled Lasso, and sparse Huber Mestimation with linear equality constraints. Several different algorithmic strategies, including path and proximal splitting algorithms, are implemented to solve the underlying convex optimization problems.
We also include two model selection strategies for determining the sparsity of the model parameters: kfold crossvalidation and stability selection.
This package is intended to fill the gap between popular python tools such as scikitlearn which CANNOT solve sparse constrained problems and generalpurpose optimization solvers that do not scale well for the considered problems.
Below we show several use cases of the package, including an application of sparse logcontrast regression tasks for compositional microbiome data.
The code builds on results from several papers which can be found in the References. We also refer to the accompanying JOSS paper submission, also available on arXiv.
Table of Contents
Installation
classo is available on pip. You can install the package in the shell using
pip install classo
To use the classo package in Python, type
from classo import *
The classo
package depends on the following Python packages:
numpy
;matplotlib
;scipy
;pandas
;h5py
.
Regression and classification problems
The classo package can solve six different types of estimation problems: four regressiontype and two classificationtype formulations.
[R1] Standard constrained Lasso regression:
This is the standard Lasso problem with linear equality constraints on the β vector. The objective function combines LeastSquares for model fitting with l1 penalty for sparsity.
[R2] Contrained sparse Huber regression:
This regression problem uses the Huber loss as objective function for robust model fitting with l1 and linear equality constraints on the β vector. The parameter ρ=1.345.
[R3] Contrained scaled Lasso regression:
This formulation is similar to [R1] but allows for joint estimation of the (constrained) β vector and the standard deviation σ in a concomitant fashion (see References [4,5] for further info). This is the default problem formulation in classo.
[R4] Contrained sparse Huber regression with concomitant scale estimation:
This formulation combines [R2] and [R3] to allow robust joint estimation of the (constrained) β vector and the scale σ in a concomitant fashion (see References [4,5] for further info).
[C1] Contrained sparse classification with Square Hinge loss:
where the x_{i} are the rows of X and l is defined as:
This formulation is similar to [R1] but adapted for classification tasks using the Square Hinge loss with (constrained) sparse β vector estimation.
[C2] Contrained sparse classification with Huberized Square Hinge loss:
where the x_{i} are the rows of X and l_{ρ} is defined as:
This formulation is similar to [C1] but uses the Huberized Square Hinge loss for robust classification with (constrained) sparse β vector estimation.
Getting started
Basic example
We begin with a basic example that shows how to run classo on synthetic data. This example and the next one can be found on the notebook 'Synthetic data Notebook.ipynb'
The classo package includes
the routine random_data
that allows you to generate problem instances using normally distributed data.
m, d, d_nonzero, k, sigma = 100, 200, 5, 1, 0.5 (X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum=True, seed=1)
This code snippet generates a problem instance with sparse β in dimension
d=100 (sparsity d_nonzero=5). The design matrix X comprises n=100 samples generated from an i.i.d standard normal
distribution. The dimension of the constraint matrix C is d x k matrix. The noise level is σ=0.5.
The input zerosum=True
implies that C is the allones vector and Cβ=0. The ndimensional outcome vector y
and the regression vector β is then generated to satisfy the given constraints.
Next we can define a default classo problem instance with the generated data:
problem = classo_problem(X, y, C)
You can look at the generated problem instance by typing:
print(problem)
This gives you a summary of the form:
FORMULATION: R3
MODEL SELECTION COMPUTED:
Stability selection
STABILITY SELECTION PARAMETERS:
numerical_method : not specified
method : first
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lamin = 0.01
Nlam = 50
As we have not specified any problem, algorithm, or model selection settings, this problem instance represents the default settings for a classo instance:
 The problem is of regression type and uses formulation [R3], i.e. with concomitant scale estimation.
 The default optimization scheme is the path algorithm (see Optimization schemes for further info).
 For model selection, stability selection at a theoretically derived λ value is used (see Reference [4] for details). Stability selection comprises a relatively large number of parameters. For a description of the settings, we refer to the more advanced examples below and the API.
You can solve the corresponding classo problem instance using
problem.solve()
After completion, the results of the optimization and model selection routines can be visualized using
print(problem.solution)
The command shows the running time(s) for the classo problem instance, and the selected variables for sability selection
STABILITY SELECTION :
Selected variables : 1 5 14 17 18
Running time : 0.663s
Here, we only used stability selection as default model selection strategy. The command also allows you to inspect the computed stability profile for all variables at the theoretical λ
The refitted β values on the selected support are also displayed in the next plot
Advanced example
In the next example, we show how one can specify different aspects of the problem formulation and model selection strategy.
m, d, d_nonzero, k, sigma = 100, 200, 5, 0, 0.5 (X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum = True, seed = 4) problem = classo_problem(X, y, C) problem.formulation.huber = True problem.formulation.concomitant = False problem.model_selection.CV = True problem.model_selection.LAMfixed = True problem.model_selection.PATH = True problem.model_selection.StabSelparameters.method = 'max' problem.model_selection.CVparameters.seed = 1 problem.model_selection.LAMfixedparameters.rescaled_lam = True problem.model_selection.LAMfixedparameters.lam = .1 problem.solve() print(problem) print(problem.solution)
Results :
FORMULATION: R2
MODEL SELECTION COMPUTED:
Lambda fixed
Path
Cross Validation
Stability selection
LAMBDA FIXED PARAMETERS:
numerical_method = PathAlg
rescaled lam : True
threshold = 0.106
lam = 0.1
theoretical_lam = 0.224
PATH PARAMETERS:
numerical_method : PathAlg
lamin = 0.001
Nlam = 80
CROSS VALIDATION PARAMETERS:
numerical_method : PathAlg
oneSE method : True
Nsubset = 5
lamin = 0.001
Nlam = 80
STABILITY SELECTION PARAMETERS:
numerical_method : PathAlg
method : max
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lamin = 0.01
Nlam = 50
LAMBDA FIXED :
Selected variables : 17 59 76 123 137
Running time : 0.234s
PATH COMPUTATION :
Running time : 0.557s
CROSS VALIDATION :
Selected variables : 16 17 57 59 64 73 74 76 93 115 123 134 137 181
Running time : 1.751s
STABILITY SELECTION :
Selected variables : 1 3 7 12
Running time : 8.391s
Logcontrast regression for microbiome data
In the the accompanying notebook we study several microbiome data sets. We showcase two examples below.
BMI prediction using the COMBO dataset
We first consider the COMBO data set and show how to predict Body Mass Index (BMI) from microbial genus abundances and two noncompositional covariates using "filtered_data".
from classo import * # Load microbiome and covariate data X X0 = csv_to_np('COMBO_data/complete_data/GeneraCounts.csv', begin = 0).astype(float) X_C = csv_to_np('COMBO_data/CaloriData.csv', begin = 0).astype(float) X_F = csv_to_np('COMBO_data/FatData.csv', begin = 0).astype(float) # Load BMI measurements y y = csv_to_np('COMBO_data/BMI.csv', begin = 0).astype(float)[:, 0] labels = csv_to_np('COMBO_data/complete_data/GeneraPhylo.csv').astype(str)[:, 1] # Normalize/transform data y = y  np.mean(y) #BMI data (n = 96) X_C = X_C  np.mean(X_C, axis = 0) #Covariate data (Calorie) X_F = X_F  np.mean(X_F, axis = 0) #Covariate data (Fat) X0 = clr(X0, 1 / 2).T # Set up design matrix and zerosum constraints for 45 genera X = np.concatenate((X0, X_C, X_F, np.ones((len(X0), 1))), axis = 1) # Joint microbiome and covariate data and offset label = np.concatenate([labels, np.array(['Calorie', 'Fat', 'Bias'])]) C = np.ones((1, len(X[0]))) C[0, 1], C[0, 2], C[0, 3] = 0., 0., 0. # Set up classso problem problem = classo_problem(X, y, C, label = label) # Use stability selection with theoretical lambda [Combettes & Müller, 2020b] problem.model_selection.StabSelparameters.method = 'lam' problem.model_selection.StabSelparameters.threshold_label = 0.5 # Use formulation R3 problem.formulation.concomitant = True problem.solve() print(problem) print(problem.solution) # Use formulation R4 problem.formulation.huber = True problem.formulation.concomitant = True problem.solve() print(problem) print(problem.solution)
pH prediction using the Central Park soil dataset
The next microbiome example considers the Central Park Soil dataset from Ramirez et al.. The sample locations are shown in the Figure on the right. The task is to predict pH concentration in the soil from microbial abundance data. This task was also considered in TreeAggregated Predictive Modeling of Microbiome Data.
Code to run this application is available in the accompanying notebook under pH data
. Below is a summary of a classo problem instance (using the R3 formulation).
FORMULATION: R3
MODEL SELECTION COMPUTED:
Lambda fixed
Path
Stability selection
LAMBDA FIXED PARAMETERS:
numerical_method = PathAlg
rescaled lam : True
threshold = 0.004
lam : theoretical
theoretical_lam = 0.2182
PATH PARAMETERS:
numerical_method : PathAlg
lamin = 0.001
Nlam = 80
STABILITY SELECTION PARAMETERS:
numerical_method : PathAlg
method : lam
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lam = theoretical
theoretical_lam = 0.3085
The classo estimation results are summarized below:
LAMBDA FIXED :
Sigma = 0.198
Selected variables : 14 18 19 39 43 57 62 85 93 94 104 107
Running time : 0.008s
PATH COMPUTATION :
Running time : 0.12s
STABILITY SELECTION :
Selected variables : 2 12 15
Running time : 0.287s
Optimization schemes
The available problem formulations [R1C2] require different algorithmic strategies for efficiently solving the underlying optimization problem. We have implemented four algorithms (with provable convergence guarantees) that vary in generality and are not necessarily applicable to all problems. For each problem type, classo has a default algorithm setting that proved to be the fastest in our numerical experiments.
Path algorithms (PathAlg)
This is the default algorithm for nonconcomitant problems [R1,R3,C1,C2]. The algorithm uses the fact that the solution path along λ is piecewise affine (as shown, e.g., in [1]). When LeastSquares is used as objective function, we derive a novel efficient procedure that allows us to also derive the solution for the concomitant problem [R2] along the path with little extra computational overhead.
Projected primaldual splitting method (PPDS):
This algorithm is derived from [2] and belongs to the class of proximal splitting algorithms. It extends the classical ForwardBackward (FB) (aka proximal gradient descent) algorithm to handle an additional linear equality constraint via projection. In the absence of a linear constraint, the method reduces to FB. This method can solve problem [R1]. For the Huber problem [R3], PPDS can solve the meanshift formulation of the problem (see [6]).
Projectionfree primaldual splitting method (PFPDS):
This algorithm is a special case of an algorithm proposed in [3] (Eq.4.5) and also belongs to the class of proximal splitting algorithms. The algorithm does not require projection operators which may be beneficial when C has a more complex structure. In the absence of a linear constraint, the method reduces to the ForwardBackwardForward scheme. This method can solve problem [R1]. For the Huber problem [R3], PFPDS can solve the meanshift formulation of the problem (see [6]).
DouglasRachfordtype splitting method (DR)
This algorithm is the most general algorithm and can solve all regression problems [R1R4]. It is based on Doulgas Rachford splitting in a higherdimensional product space. It makes use of the proximity operators of the perspective of the LS objective (see [4,5]) The Huber problem with concomitant scale [R4] is reformulated as scaled Lasso problem with the mean shift (see [6]) and thus solved in (n + d) dimensions.
Structure of the code
References

[1] B. R. Gaines, J. Kim, and H. Zhou, Algorithms for Fitting the Constrained Lasso, J. Comput. Graph. Stat., vol. 27, no. 4, pp. 861–871, 2018.

[2] L. BricenoArias and S.L. Rivera, A Projected Primal–Dual Method for Solving Constrained Monotone Inclusions, J. Optim. Theory Appl., vol. 180, Issue 3, March 2019.

[3] P. L. Combettes and J.C. Pesquet, PrimalDual Splitting Algorithm for Solving Inclusions with Mixtures of Composite, Lipschitzian, and ParallelSum Type Monotone Operators, SetValued and Variational Analysis, vol. 20, pp. 307330, 2012.

[4] P. L. Combettes and C. L. Müller, Perspective Mestimation via proximal decomposition, Electronic Journal of Statistics, 2020, Journal version

[5] P. L. Combettes and C. L. Müller, Regression models for compositional data: General logcontrast formulations, proximal optimization, and microbiome data applications, Statistics in Bioscience, 2020.

[6] A. Mishra and C. L. Müller, Robust regression with compositional covariates, arXiv, 2019.

[7] S. Rosset and J. Zhu, Piecewise linear regularized solution paths, Ann. Stat., vol. 35, no. 3, pp. 1012–1030, 2007.

[8] J. Bien, X. Yan, L. Simpson, and C. L. Müller, TreeAggregated Predictive Modeling of Microbiome Data, biorxiv, 2020.
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size c_lasso1.0.3py3noneany.whl (52.4 kB)  File type Wheel  Python version py3  Upload date  Hashes View 
Filename, size classo1.0.3.tar.gz (61.5 kB)  File type Source  Python version None  Upload date  Hashes View 