Python implementation of integrated path stability selection (IPSS)
Project description
Integrated path stability selection (IPSS)
Fast, flexible feature selection with false discovery control
Associated papers
-
Integrated path stability selection
Accepted in Journal of the American Statistical Association and available on arXiv -
Nonparametric IPSS: Fast, flexible feature selection with false discovery control
Published in Bioinformatics and available on arXiv
"Integrated path stability selection" introduces IPSS and applies it to regularized regression models like lasso.
"Nonparametric IPSS: Fast, flexible feature selection with false discovery control" extends IPSS to arbitrary feature importance scores, with a focus on scores from gradient boosting and random forests.
Installation
pip install ipss
Usage
from ipss import ipss
# load n-by-p feature matrix X and n-by-1 response vector y
# run ipss
ipss_output = ipss(X,y)
# select features based on target FDR
target_fdr = 0.1
q_values = ipss_output['q_values']
selected_features = [idx for idx, q_value in q_values.items() if q_value <= target_fdr]
print(f'Selected features (target FDR = {target_fdr}): {selected_features}')
Outputs
ipss_output = ipss(X,y) is a dictionary containing:
efp_scores: Dictionary whose keys are feature indices and values are their efp scores (dict of lengthp).q_values: Dictionary whose keys are feature indices and values are their q-values (dict of lengthp).runtime: Runtime of the algorithm in seconds (float).selected_features: Indices of features selected by IPSS; empty list iftarget_fpandtarget_fdrare not specified (list of ints).stability_paths: Estimated selection probabilities at each parameter value (array of shape(n_alphas, p))
Selecting features
Each feature (column of X) is assigned:
- a q-value: the minimum false discovery rate (FDR) at which the feature is selected
- an efp score: the minimum expected number of false positives (E(FP)) at which the feature is selected
To select features:
- Control FDR by choosing all features with
q_value ≤ target_fdr
Example: Selecting features withq_value ≤ 0.1controls the FDR at level 0.1 - Control E(FP) by choosing all features with
efp_score ≤ target_fp
Example: Selecting features withefp_score ≤ 2controls the E(FP) at level 2
In general, we recommend selecting features using
q_valuesorefp_scoresafter runningipss, rather than specifyingtarget_fdrortarget_fpas arguments (see General observations/recommendations).
Usage with custom feature importance scores
For custom feature importance scores, selector must be a function that takes X and y as inputs (as well as an optional
dictionary of arguments selector_args specific to the feature importance function), and returns a list or NumPy array of
importance scores, one per feature, that must align with the column order in X.
from ipss import ipss
# define custom feature importance function based on ridge regression
from sklearn.linear_model import Ridge
selector_args = {'alpha':1}
def ridge_selector(X, y, alpha):
model = Ridge(alpha=alpha)
model.fit(X,y)
feature_importance_scores = np.abs(model.coef_)
return feature_importance_scores
# load n-by-p feature matrix X and n-by-1 response vector y
# run ipss
ipss_output = ipss(X, y, selector=ridge_selector, selector_args=selector_args)
# select features based on target FDR
target_fdr = 0.1
q_values = ipss_output['q_values']
selected_features = [idx for idx, q_value in q_values.items() if q_value <= target_fdr]
print(f'Selected features (target FDR = {target_fdr}): {selected_features}')
Examples
The examples folder includes analyses of
- Simulated data:
simple_example.py(Open in Google Colab). - Cancer data:
cancer.py(Open in Google Colab).
Full list of ipss arguments
Required arguments:
X: Features (array of shape(n,p)), wherenis the number of samples andpis the number of features.y: Response (array of shape(n,)or(n, 1)).ipssautomatically detects ifyis binary.
Optional arguments:
selector: Base algorithm to use (str; default'gb'). Options:'gb': Gradient boosting (uses XGBoost).'l1': L1-regularized linear or logistic regression (uses scikit-learn).'rf': Random forest (uses scikit-learn).- Custom function that computes feature importance scores (see usage example above).
selector_args: Arguments for the base algorithm (dict; defaultNone).preselect: Preselect/filter features prior to subsampling (bool; defaultTrue).preselect_args: Arguments for preselection algorithm (dict; defaultNone).target_fp: Target number of false positives to control (positive float; defaultNone).target_fdr: Target false discovery rate (FDR) (positive float; defaultNone).B: Number of subsampling steps (int; default100for IPSSGB,50otherwise).n_alphas: Number of values in the regularization or threshold grid (int; default25if'l1'else100).ipss_function: Function to apply to selection probabilities (str; default'h2'if'l1'else'h3'). Options:'h1': Linear function,h1(x) = 2x - 1 if x >= 0.5 else 0.'h2': Quadratic function,h2(x) = (2x - 1)**2 if x >= 0.5 else 0.'h3': Cubic function,h3(x) = (2x - 1)**3 if x >= 0.5 else 0.
cutoff: Maximum value of the theoretical integral boundI(Lambda)(positive float; default0.05).delta: Defines probability measure; seeAssociated papers(float; defaults depend onselector).standardize_X: Scale features to have mean 0, standard deviation 1 (bool; defaultNone).center_y: Center response to have mean 0 (bool; defaultNone).n_jobs: Number of jobs to run in parallel (int; default1).
General observations/recommendations:
- IPSSGB is usually best for capturing nonlinear relationships between features and response
- IPSSL is usually best for capturing linear relationships between features and response
- For FDR control, we generally recommend computing q-values with
ipssand then using them to select features at the desired FDR threshold (as in the Usage section above), rather than specifyingtarget_fdr, which should be left asNone. This provides greater flexibility when selecting features. - For E(FP) control, we generally recommend computing efp scores with
ipssand then using them to select features at the desired false positive threshold, rather than specifyingtarget_fp, which should be left asNone. This provides greater flexibility when selecting features. - In general, all other parameters should not be changed
selector_argsinclude, e.g., decision tree parameters for tree-based models- Results are robust to
Bprovided it is greater than25 'h3'is less conservative than'h2'which is less conservative than'h1'.- Preselection can significantly reduce computation time.
- Results are robust to
cutoffprovided it is between0.025and0.1. - Results are robust to
deltaprovided it is between0and1.5. - Standardization is automatically applied for IPSSL. IPSSGB and IPSSRF are unaffected by this.
- Centering
yis automatically applied for IPSSL. IPSSGB and IPSSRF are unaffected by this.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ipss-1.1.2.tar.gz.
File metadata
- Download URL: ipss-1.1.2.tar.gz
- Upload date:
- Size: 15.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7dcffaf341495ae36a21dd60c4dea6ad3db42a0749c5d93b061834ea39c54f9f
|
|
| MD5 |
a5869f6155e747ebf71386c8e3918b81
|
|
| BLAKE2b-256 |
742c328ba3a36d2109d6fddbb9b7102448b8b14a64ca55124bfb6cf97fdc2a3d
|
File details
Details for the file ipss-1.1.2-py3-none-any.whl.
File metadata
- Download URL: ipss-1.1.2-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
477bbb35f46b7c973344682a9caf0061fd9ed7fe29777e169fe54d7a512937f8
|
|
| MD5 |
339bacc354a02b97475941335eae214d
|
|
| BLAKE2b-256 |
7801d39d4a2a2c2436dbfcfd24b971dab1be013539f7f45e7f9a1f5476ff2dc6
|