Python implementation of integrated path stability selection (IPSS)
Project description
Integrated path stability selection (IPSS)
Integrated path stability selection (IPSS) is a general method for improving feature selection algorithms. Given an n-by-p data matrix X (n = number of samples, p = number of features), and an n-dimensional response variable y, IPSS applies a base selection algorithm to subsamples of the data to select features (columns of X) that are most related to y. This package includes IPSS for gradient boosting (IPSSGB), random forests (IPSSRF), and L1-regularized linear models (IPSSL). The final outputs are efp scores and q-values for each feature in X.
- The efp score of feature j is the expected number of false positives, E(FP), selected when j is selected.
- So to control the E(FP) at
target_fp
, select the features with efp scores at mosttarget_fp
.
- So to control the E(FP) at
- The q-value of feature j is the false discovery rate (FDR) when feature j is selected.
- So to control the FDR at
target_fdr
, select the features with q-values at mosttarget_fdr
.
- So to control the FDR at
Key attributes
- Error control: IPSS controls the number of false positives and the FDR.
- Generality: IPSSGB and IPSSRF are nonlinear, nonparametric methods. IPSSL is linear.
- Speed: IPSS is efficient. For example, IPSSGB runs in <20 seconds when
n = 500
andp = 5000
. - Simplicity: The only required inputs are
X
andy
. Users can also specify the base method (IPSSGB, IPSSRF, or IPSSL), and the target number of false positives or the target FDR.
Associated papers
IPSSL: https://arxiv.org/abs/2403.15877
IPSSGB and IPSSRF: https://arxiv.org/abs/2410.02208v1
Installation
Install from PyPI:
pip install ipss
Tests
Basic test (test_basic.py
)
- Run the test:
python3 test_basic.py
- Expected output: "All tests passed."
Ovarian cancer: microRNAs and tumor purity (test_oc.py
)
- Identify microRNAs related to tumor purity in tumor samples from ovarian cancer patients
- Data are from LinkedOmics and located in
examples/cancer/ovarian
- Inputs:
- Features: matrix of microRNA expression levels for
p = 585
microRNAs fromn = 451
patients - Response: tumor purity (proportion of cancerous cells in a tissue sample) from
n = 451
patients
- Features: matrix of microRNA expression levels for
- Run the test:
python3 test_oc.py
- Expected output: q-values and efp scores for the top ranked microRNAs
Usage
from ipss import ipss
# load n-by-p feature matrix X and n-by-1 response vector y
# run ipss:
ipss_output = ipss(X, y)
# select features based on target number of false positives
target_fp = 1
efp_scores = ipss_output['efp_scores']
selected_features = [idx for idx, efp_score in efp_scores.items() if efp_score <= target_fp]
print(f'Selected features (target E(FP) = {target_fp}): {selected_features}')
# select features based on target FDR
target_fdr = 0.1
q_values = ipss_output['q_values']
selected_features = [idx for idx, q_value in q_values.items() if q_value <= target_fdr]
print(f'Selected features (target FDR = {target_fdr}): {selected_features}')
Results
ipss_output = ipss(X, y)
is a dictionary containing:
efp_scores
: Dictionary where keys are feature indices and values are their efp scores (dict of lengthp
).q_values
: Dictionary where keys are feature indices and values are their q-values (dict of lengthp
).runtime
: Runtime of the algorithm in seconds (float).selected_features
: List of indices of features selected by IPSS; empty list iftarget_fp
andtarget_fdr
are not specified (list of ints).stability_paths
: Estimated selection probabilities at each parameter value (array of shape(n_alphas, p)
)
Examples
Additional examples are available in the examples folder. These include
- A simple example,
simple_example.py
(Open in Google Colab). - Various simulation experiments,
simulation.py
(Open in Google Colab). - IPSS applied to cancer data,
cancer.py
(Open in Google Colab).
Full list of ipss
arguments
Required arguments:
X
: Features (array of shape(n, p)
), wheren
is the number of samples andp
is the number of features.y
: Response (array of shape(n,)
or(n, 1)
).ipss
automatically detects ify
is continuous or binary.
Optional arguments:
selector
: Base algorithm to use (str; default'gb'
). Options are:'gb'
: Gradient boosting with XGBoost.'l1'
: L1-regularized linear or logistic regression.'rf'
: Random forest with sci-kit learn.
selector_args
: Arguments for the base algorithm (dict; defaultNone
).target_fp
: Target number of false positives to control (positive float; defaultNone
).target_fdr
: Target false discovery rate (FDR) (positive float; defaultNone
).B
: Number of subsampling steps (int; default100
for IPSSGB,50
otherwise).n_alphas
: Number of values in the regularization or threshold grid (int; default100
).ipss_function
: Function to apply to selection probabilities (str; default'h3'
). Options are:'h1'
: Linear function,h1(x) = 2x - 1 if x >= 0.5 else 0
.'h2'
: Quadratic function,h2(x) = (2x - 1)**2 if x >= 0.5 else 0
.'h3'
: Cubic function,h3(x) = (2x - 1)**3 if x >= 0.5 else 0
.
preselect
: Number (if int) or percentage (if float) of features to preselect.False
for no preselection (default0.05
).preselect_min
: Minimum number of features to keep in the preselection step (int; default200
).preselect_args
: Arguments for the preselection algorithm (dict; defaultNone
).cutoff
: Maximum value of the theoretical integral boundI(Lambda)
(positive float; default0.05
).delta
: Defines probability measure; seeAssociated papers
(float; default1
).standardize_X
: Scale features to have mean 0, standard deviation 1 (bool; defaultNone
).center_y
: Center response to have mean 0 (bool; defaultNone
).true_features
: List of true feature indices when known, e.g., in simulations (list; defaultNone
).n_jobs
: Number of jobs to run in parallel (int; default1
).
General observations/recommendations:
- IPSSGB is usually best for capturing nonlinear relationships between features and response
- IPSSL is usually best for capturing linear relationships between features and response
target_fp
ortarget_fdr
(at most one is specified) are problem specific/left to the user- In general, all other parameters should not changed
selector_args
include, e.g., decision tree parameters for tree-based models- Results are robust to
B
provided it is bigger than25
- Results are robust to
n_alphas
provided it is bigger than50
'h3'
yields the most true positives.'h2'
is more conservative, and'h1'
even more so.- Preselection can significantly reduce computation time. Results are robust otherwise.
- Results are robust to
cutoff
provided it is between0.025
and0.1
. - Results are robust to
delta
provided it is between0
and1.5
. - Standardization is automatically applied for IPSSL. IPSSGB and IPSSRF are unaffected by this.
- Centering
y
is automatically applied for IPSSL. IPSSGB and IPSSRF are unaffected by this.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ipss-1.0.11.tar.gz
.
File metadata
- Download URL: ipss-1.0.11.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d4318c57ec4d14296a00c3cc3174a0e9c466213495a43352cfa3390d9e6ae40 |
|
MD5 | 5d77f581bf7e13e5d93e853698cc3376 |
|
BLAKE2b-256 | bfc656614c123904580ca7e9d3c9f4fb766e18d358ef034fd7c88f1beac5d618 |
File details
Details for the file ipss-1.0.11-py3-none-any.whl
.
File metadata
- Download URL: ipss-1.0.11-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85123dd0804548f624a77f5aab3ba8dd5274e1cabd8cc46ebf459e6b05db5468 |
|
MD5 | f5475b4ba4d3f4353169f01774fe737d |
|
BLAKE2b-256 | 089c4cbc8c0b3447159b54272e4c971f67f23a52c0edd3e11df9fc4f570b303a |