Python implementation of integrated path stability selection (IPSS)
Project description
Integrated path stability selection (IPSS)
Fast, flexible feature selection with false discovery control
Given an n-by-p feature matrix X (n = number of samples, p = number of features), and an n-dimensional
response variable y, IPSS applies a base selection algorithm to subsamples of the data to select features
(columns of X) that are related to the response. The final outputs are q-values and efp scores for
each feature.
False discovery control
- The q-value of feature j is the smallest false discovery rate (FDR) when feature j is selected.
- So to control the FDR at
target_fdr, select the features with q-values at mosttarget_fdr.
- So to control the FDR at
- The efp score of feature j is the expected number of false positives, E(FP), when j is selected.
- So to control the E(FP) at
target_fp, select the features with efp scores at mosttarget_fp.
- So to control the E(FP) at
Flexible selection
IPSS applies to a wide range of base feature selection algorithms, including regularized models and any method that computes feature importance scores. This package includes three built-in base selection algorithms: IPSS for L1-regularized linear models (IPSSL1), IPSS for importance scores from gradient boosting (IPSSGB), and IPSS for importance scores from random forests (IPSSRF). It also allows users to seamlessly apply IPSS with their own customized feature importance scores.
Speed
For example, in simulation studies using real RNA-sequencing data from ovarian cancer patients, IPSSL1, IPSSGB,
and IPSSRF all run in under 20 seconds (without parallelization) when n=500 and p=5000.
Easy to use
The only required inputs are the feature matrix X and response vector y.
Associated papers
IPSS for regularized models: https://arxiv.org/abs/2403.15877
IPSS for arbitrary feature importance scores: https://arxiv.org/abs/2410.02208v1
Installation
Install from PyPI:
pip install ipss
Usage
from ipss import ipss
# load n-by-p feature matrix X and n-by-1 response vector y
# run ipss
ipss_output = ipss(X,y)
# select features based on target FDR
target_fdr = 0.1
q_values = ipss_output['q_values']
selected_features = [idx for idx, q_value in q_values.items() if q_value <= target_fdr]
print(f'Selected features (target FDR = {target_fdr}): {selected_features}')
Output
ipss_output = ipss(X,y) is a dictionary containing:
efp_scores: Dictionary whose keys are feature indices and values are their efp scores (dict of lengthp).q_values: Dictionary whose keys are feature indices and values are their q-values (dict of lengthp).runtime: Runtime of the algorithm in seconds (float).selected_features: Indices of features selected by IPSS; empty list iftarget_fpandtarget_fdrare not specified (list of ints).stability_paths: Estimated selection probabilities at each parameter value (array of shape(n_alphas, p))
Usage with custom feature importance scores
For custom feature importance scores, selector must be a function that takes X and y as inputs (as well as an optional
dictionary of arguments selector_args specific to the feature importance function), and returns a list or NumPy array of
importance scores, one per feature, that must align with the column order in X.
from ipss import ipss
# define custom feature importance function based on ridge regression
from sklearn.linear_model import Ridge
selector_args = {'alpha':1}
def ridge_selector(X, y, alpha):
model = Ridge(alpha=alpha)
model.fit(X,y)
feature_importance_scores = np.abs(model.coef_)
return feature_importance_scores
# load n-by-p feature matrix X and n-by-1 response vector y
# run ipss
ipss_output = ipss(X, y, selector=ridge_selector, selector_args=selector_args)
# select features based on target FDR
target_fdr = 0.1
q_values = ipss_output['q_values']
selected_features = [idx for idx, q_value in q_values.items() if q_value <= target_fdr]
print(f'Selected features (target FDR = {target_fdr}): {selected_features}')
Examples
The examples folder includes
- A simple simulation:
simple_example.py(Open in Google Colab). - Analyze cancer data:
cancer.py(Open in Google Colab).
Full list of ipss arguments
Required arguments:
X: Features (array of shape(n,p)), wherenis the number of samples andpis the number of features.y: Response (array of shape(n,)or(n, 1)).ipssautomatically detects ifyis continuous or binary.
Optional arguments:
selector: Base algorithm to use (str; default'gb'). Options:'gb': Gradient boosting (uses XGBoost).'l1': L1-regularized linear or logistic regression (uses scikit-learn).'rf': Random forest (uses scikit-learn).- Custom function that computes feature importance scores (see usage example above).
selector_args: Arguments for the base algorithm (dict; defaultNone).preselect: Preselect/filter features prior to subsampling (bool; defaultTrue).preselect_args: Arguments for preselection algorithm (dict; defaultNone).target_fp: Target number of false positives to control (positive float; defaultNone).target_fdr: Target false discovery rate (FDR) (positive float; defaultNone).B: Number of subsampling steps (int; default100for IPSSGB,50otherwise).n_alphas: Number of values in the regularization or threshold grid (int; default25if'l1'else100).ipss_function: Function to apply to selection probabilities (str; default'h2'if'l1'else'h3'). Options:'h1': Linear function,h1(x) = 2x - 1 if x >= 0.5 else 0.'h2': Quadratic function,h2(x) = (2x - 1)**2 if x >= 0.5 else 0.'h3': Cubic function,h3(x) = (2x - 1)**3 if x >= 0.5 else 0.
cutoff: Maximum value of the theoretical integral boundI(Lambda)(positive float; default0.05).delta: Defines probability measure; seeAssociated papers(float; defaults depend onselector).standardize_X: Scale features to have mean 0, standard deviation 1 (bool; defaultNone).center_y: Center response to have mean 0 (bool; defaultNone).n_jobs: Number of jobs to run in parallel (int; default1).
General observations/recommendations:
- IPSSGB is usually best for capturing nonlinear relationships between features and response
- IPSSL is usually best for capturing linear relationships between features and response
- For FDR control, it is usually best to compute q-values with
ipssand then use them to select features at the desired FDR threshold (as in the Usage section above), rather than specifytarget_fdr, which should be left asNone. This provides greater flexibility when selecting features. - For E(FP) control, it is usually best to compute efp scores with
ipssand then use them to select features at the desired false positive threshold, rather than specifytarget_fp, which should be left asNone. This provides greater flexibility when selecting features. - In general, all other parameters should not be changed
selector_argsinclude, e.g., decision tree parameters for tree-based models- Results are robust to
Bprovided it is greater than25 'h3'is less conservative than'h2'which is less conservative than'h1'.- Preselection can significantly reduce computation time.
- Results are robust to
cutoffprovided it is between0.025and0.1. - Results are robust to
deltaprovided it is between0and1.5. - Standardization is automatically applied for IPSSL. IPSSGB and IPSSRF are unaffected by this.
- Centering
yis automatically applied for IPSSL. IPSSGB and IPSSRF are unaffected by this.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ipss-1.1.1.tar.gz.
File metadata
- Download URL: ipss-1.1.1.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5e2891396072fbf02f8f93c4c66194dea81f1591aa2fb054448eed89db31122
|
|
| MD5 |
21438c768e419615c9b0d50b5d7194e2
|
|
| BLAKE2b-256 |
20b66ce52add066653b2d39d6917fe524980c674f12f461ecab2edb40c9f8873
|
File details
Details for the file ipss-1.1.1-py3-none-any.whl.
File metadata
- Download URL: ipss-1.1.1-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
357f2f41f8174872f007dad174598cb458d1958a005788cbab091a009051d08c
|
|
| MD5 |
564f0e4c8e4948347677d5c994e6cd19
|
|
| BLAKE2b-256 |
f64101d0da1bbd601c1fc7ac165ab9a99863a4377a3a5b4eb6d4db46f4151e7b
|