Skip to main content

Python implementation of integrated path stability selection (IPSS)

Project description

Integrated path stability selection (IPSS)

Fast, flexible feature selection with false discovery control

Associated papers

"Integrated path stability selection" introduces IPSS and applies it to penalized parametric models such as lasso, adaptive lasso, MCP, and SCAD (see Optional arguments below). "Nonparametric IPSS: Fast, flexible feature selection with false discovery control" extends IPSS to arbitrary feature importance scores, with a focus on scores from gradient boosting and random forests.

Installation

pip install ipss

Usage

from ipss import ipss

# load n-by-p feature matrix X and n-by-1 response vector y

# run ipss
ipss_output = ipss(X,y)

# select features based on target FDR
target_fdr = 0.1
q_values = ipss_output['q_values']
selected_features = [idx for idx, q_value in q_values.items() if q_value <= target_fdr]
print(f'Selected features (target FDR = {target_fdr}): {selected_features}')

Outputs

ipss_output = ipss(X,y) is a dictionary containing:

  • efp_scores: Dictionary whose keys are feature indices and values are their efp scores (dict of length p).
  • q_values: Dictionary whose keys are feature indices and values are their q-values (dict of length p).
  • runtime: Runtime of the algorithm in seconds (float).
  • selected_features: Indices of features selected by IPSS; empty list if target_fp and target_fdr are not specified (list of ints).
  • stability_paths: Estimated selection probabilities at each parameter value (array of shape (n_alphas, p))

Selecting features

Each feature (column of X) is assigned:

  • a q-value: the minimum false discovery rate (FDR) at which the feature is selected
  • an efp score: the minimum expected number of false positives (E(FP)) at which the feature is selected

To select features:

  • Control FDR by choosing all features with q_value ≤ target_fdr
    Example: Selecting features with q_value ≤ 0.1 controls the FDR at level 0.1
  • Control E(FP) by choosing all features with efp_score ≤ target_fp
    Example: Selecting features with efp_score ≤ 2 controls the E(FP) at level 2

In general, we recommend selecting features using q_values or efp_scores after running ipss, rather than specifying target_fdr or target_fp as arguments (see General observations/recommendations).

Usage with custom feature importance scores

For custom feature importance scores, selector must be a function that takes X and y as inputs (as well as an optional dictionary of arguments selector_args specific to the feature importance function), and returns a list or NumPy array of importance scores, one per feature, that must align with the column order in X.

from ipss import ipss

# define custom feature importance function based on ridge regression
from sklearn.linear_model import Ridge
selector_args = {'alpha':1}
def ridge_selector(X, y, alpha):
	model = Ridge(alpha=alpha)
	model.fit(X,y)
	feature_importance_scores = np.abs(model.coef_)
	return feature_importance_scores

# load n-by-p feature matrix X and n-by-1 response vector y

# run ipss
ipss_output = ipss(X, y, selector=ridge_selector, selector_args=selector_args)

# select features based on target FDR
target_fdr = 0.1
q_values = ipss_output['q_values']
selected_features = [idx for idx, q_value in q_values.items() if q_value <= target_fdr]
print(f'Selected features (target FDR = {target_fdr}): {selected_features}')

Examples

The examples folder includes analyses of

Full list of ipss arguments

Required arguments:

  • X: Features (array of shape (n,p)), where n is the number of samples and p is the number of features.
  • y: Response (array of shape (n,) or (n, 1)). ipss automatically detects if y is binary.

Optional arguments:

  • selector: Base algorithm to use (str; default 'gb').
    • Nonparametric methods
      • 'gb': Gradient boosting (XGBoost).
      • 'rf': Random forest (scikit-learn).
    • Penalized regression methods
      • 'adaptive_lasso': Adaptive lasso (skglm)
      • 'l1': L1-regularized linear or logistic regression (scikit-learn).
      • 'mcp': Minimax concave penalty (skglm).
      • 'scad': Smoothly clipped absolute deviation penalty (skglm).
    • Custom
      • Users can provide their own feature importance function (see example above).
  • selector_args: Arguments for the base algorithm (dict; default None).
  • preselect: Preselect/filter features prior to subsampling (bool; default True).
  • preselect_args: Arguments for preselection algorithm (dict; default None).
  • target_fp: Target number of false positives to control (positive float; default None).
  • target_fdr: Target false discovery rate (FDR) (positive float; default None).
  • B: Number of subsampling steps (int; default 100 for IPSSGB, 50 otherwise).
  • n_alphas: Number of values in the regularization or threshold grid (int; default 25 if 'l1' else 100).
  • ipss_function: Function to apply to selection probabilities (str; default 'h2' if 'l1' else 'h3'). Options:
    • 'h1': Linear function, h1(x) = 2x - 1 if x >= 0.5 else 0.
    • 'h2': Quadratic function, h2(x) = (2x - 1)**2 if x >= 0.5 else 0.
    • 'h3': Cubic function, h3(x) = (2x - 1)**3 if x >= 0.5 else 0.
  • cutoff: Maximum value of the theoretical integral bound I(Lambda) (positive float; default 0.05).
  • delta: Defines probability measure; see Associated papers (float; defaults depend on selector).
  • standardize_X: Scale features to have mean 0, standard deviation 1 (bool; default None).
  • center_y: Center response to have mean 0 (bool; default None).
  • n_jobs: Number of jobs to run in parallel (int; default 1).

General observations/recommendations:

  • selector = 'gb' often best for capturing nonlinear relationships
  • selector = 'l1' or 'adaptive_lasso' often best for capturing linear relationships
  • For FDR control, we generally recommend computing q-values with ipss and then using them to select features at the desired FDR threshold (as in the Usage section above), rather than specifying target_fdr, which should be left as None. This provides greater flexibility when selecting features.
  • For E(FP) control, we generally recommend computing efp scores with ipss and then using them to select features at the desired false positive threshold, rather than specifying target_fp, which should be left as None. This provides greater flexibility when selecting features.
  • In general, all other parameters should not be changed
    • selector_args include, e.g., decision tree parameters for tree-based models
    • Results are robust to B provided it is greater than 25
    • 'h3' is less conservative than 'h2' which is less conservative than 'h1'.
    • Preselection can significantly reduce computation time.
    • Results are robust to cutoff provided it is between 0.025 and 0.1.
    • Results are robust to delta provided it is between 0 and 1.5.
    • Features are automatically standardized for the penalized regression methods.
    • The response is automatically centered for the penalized regression methods.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipss-1.1.3.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ipss-1.1.3-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file ipss-1.1.3.tar.gz.

File metadata

  • Download URL: ipss-1.1.3.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for ipss-1.1.3.tar.gz
Algorithm Hash digest
SHA256 1e9fab8339ba3d02c50a6dbbe7eb88d0721b717efca3a4df154e2a6a9f51d1a9
MD5 7881e8a23013232f365ec85ec07fd0fd
BLAKE2b-256 137f5da65e2417a0b431f72e1b96bcaeb1f1c900126162959f899415eb77bcdf

See more details on using hashes here.

File details

Details for the file ipss-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: ipss-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 17.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for ipss-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 593f9c3db77feb1103d21e85824ae4fcba518f2f72cf9c64d5e1defa0181055f
MD5 337b7a5385783f7b646a1014a554aeb8
BLAKE2b-256 662f0995e45bc29ce6f30d25957e2f7251276f8c8973aafbaff4179529e2b3f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page