Cardinality- and budget-constrained feature selection for logistic regression using mixed-integer conic optimization
Project description
l0l2learn
Feature selection for logistic regression using mixed-integer conic optimization. Unlike Lasso-based approaches, l0l2learn directly optimizes feature subsets under explicit cardinality or budget constraints.
Overview
l0l2learn is a Python package that provides sklearn-style estimators for cardinality- and budget-constrained feature selection in logistic regression. The package currently includes:
- L0L2Classifier: L0-constrained L2-regularized logistic regression
- ResampledL0L2Classifier: resampling-based feature selection with frequency-based aggregation to improve the selection stability
Installation
To install the package, use the following command:
pip install l0l2learn
Please check the MOSEK website to request and set up a license for the conic solver.
Quick Start
Feature Selection Without Resampling
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from l0l2learn import L0L2Classifier
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y
)
clf = L0L2Classifier(
b=3,
lambd=1.0
)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
print("ROC AUC: ", roc_auc_score(y_test, y_proba[:, 1]))
print("Coefficients: ", clf.coef_)
print("Intercept: ", clf.intercept_)
print("Support: ", clf.support_)
Feature Selection With Resampling
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from l0l2learn import ResampledL0L2Classifier
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y
)
clf = ResampledL0L2Classifier(
b=3,
param_grid={"lambd": [1.0]},
n_resamples=3
)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
print("ROC AUC: ", roc_auc_score(y_test, y_proba[:, 1]))
print("Coefficients: ", clf.coef_)
print("Intercept: ", clf.intercept_)
print("Support: ", clf.support_)
print("VIFs: ", clf.variable_inclusion_frequencies_[clf.support_])
print("MSFs: ", clf.model_selection_frequencies_)
Hyperparameters
Feature Costs
Feature-specific costs can be supplied through c:
clf = L0L2Classifier(c=[1, 2, 5])
The optimization then accounts for some variables to be more expensive than others.
Feature Selection Budget
The feature selection budget is controlled through b:
clf = L0L2Classifier(b=5)
When all feature costs are equal to one (the default), b directly controls the maximum number of selected features.
L2 Regularization
The L2 regularization strength is given by lambd:
clf = L0L2Classifier(lambd=0.1)
Larger values can attenuate overfitting and increase robustness.
Number of Resamples
n_resamples determines how many resampled models are fitted:
clf = ResampledL0L2Classifier(b=5, n_resamples=99)
Larger values can improve frequency estimates but increase runtime.
Other Hyperparameters
L0L2Classifier
-
fit_intercept: Whether an intercept term is included in the logistic regression model. -
time_limit: Maximum runtime in seconds for the optimization problem. -
mosek_log: Enables printing of MOSEK solver output.
ResampledL0L2Classifier
-
resampling: Controls whether and how rows, columns, both, or neither are resampled. -
n_row_subsamples: Number or fraction of observations used during row subsampling. -
n_column_subsamples: Number or fraction of features used during column subsampling. -
aggregation: Whether model selection or variable inclusion frequencies are used for aggregation. -
vif_threshold: Minimum variable inclusion frequency required when usingaggregation="VIF". -
estimator: Alternative base estimator used instead of the defaultL0L2Classifier. -
param_grid: Hyperparameter grid used for cross-validation when tuninglambd. -
cv: Cross-validation strategy used for hyperparameter tuning. -
scoring: Scoring metric used to select the best hyperparameter configuration. -
numerical_features: Specifies which DataFrame columns should be treated as numerical features. -
categorical_features: Specifies which DataFrame columns should be treated as categorical features. -
fit_intercept: Whether an intercept term is included in the logistic regression model. -
mosek_time_limit: Maximum runtime in seconds for each individual optimization problem. -
total_time_limit: Maximum runtime in seconds for the complete resampling procedure. -
max_consecutive_failures: Stops resampling if too many consecutive model fits fail. -
mosek_log: Enables printing of MOSEK solver output. -
n_jobs: Number of parallel workers used during resampling. -
random_state: Controls the randomness of resampling and cross-validation procedures.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Authors
- Ricardo Knauer (HTW Berlin)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file l0l2learn-0.1.0.tar.gz.
File metadata
- Download URL: l0l2learn-0.1.0.tar.gz
- Upload date:
- Size: 4.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d26aa7f0905884d64079545d4d4515857ad3d3de17670490069b709d55fa468
|
|
| MD5 |
e379fece4906ad0dae2b41d7b46d081d
|
|
| BLAKE2b-256 |
8a4f02aa11109725502d734cc246eb9901290ea53a14017ada381c16892bbf0e
|
File details
Details for the file l0l2learn-0.1.0-py3-none-any.whl.
File metadata
- Download URL: l0l2learn-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a07e5e37d17871d09cf88900f9c6f2af9099bc1b9fc523034768cad586bf013
|
|
| MD5 |
5b93c7be936216569652da939411360b
|
|
| BLAKE2b-256 |
da3f47ea2914452cd41f579f2b5e44832b108e2fbaa46b9093e99f42415f8b73
|