A package for preforming dynamic recursive feature elimination with sklearn.
Project description
dRFEtools - dynamic Recursive Feature Elimination
dRFEtools
is a package for dynamic recursive feature elimination with
sklearn. Currently supporting random forest classification and regression,
and linear models (linear, lasso, ridge, and elastic net).
Authors: Apuã Paquola, Kynon Jade Benjamin, and Tarun Katipalli
Package developed in Python 3.7+.
This package has several function to run dynamic recursive feature elimination (dRFE) for random forest and linear model classifier and regression models. For random forest, it assumes Out-of-Bag (OOB) is set to True. For linear models, it generates a developmental set. For both classification and regression, three measurements are calculated for feature selection:
Classification:
- Normalized mutual information
- Accuracy
- Area under the curve (AUC) ROC curve
Regression:
- R2 (this can be negative if model is arbitrarily worse)
- Explained variance
- Mean squared error
The package has been split in to three additional scripts for:
- Random forest feature elimination (AP)
- Linear model regression feature elimination (KJB)
- Rank features function (TK)
- Lowess redundant selection (KJB)
Table of Contents
Citation
If using please cite: XXX.
Installation
pip install --user dRFEtools
Reference Manual
dRFEtools main functions
-
dRFE - Random Forest
feature_elimination
Runs random forest feature elimination step over iterator process.
Args:
- estimator: Random forest classifier object
- X: a data frame of training data
- Y: a vector of sample labels from training data set
- features: a vector of feature names
- fold: current fold
- out_dir: output directory. default '.'
- elimination_rate: percent rate to reduce feature list. default .2
- RANK: Output feature ranking. default=True (Boolean)
Yields:
- dict: a dictionary with number of features, normalized mutual information score, accuracy score, and array of the indexes for features to keep
-
dRFE - Linear Models
dev_rfe
Runs recursive feature elimination for linear model step over iterator process assuming developmental set is needed.
Args:
- estimator: regression linear model object
- X: a data frame of training data
- Y: a vector of sample labels from training data set
- features: a vector of feature names
- fold: current fold
- out_dir: output directory. default '.'
- elimination_rate: percent rate to reduce feature list. default .2
- dev_size: developmental set size. default '0.20'
- RANK: run feature ranking, default 'True'
- SEED: random state. default 'True'
Yields:
- dict: a dictionary with number of features, r2 score, mean square error, expalined variance, and array of the indices for features to keep
-
Feature Rank Function
feature_rank_fnc
This function ranks features within the feature elimination loop.
Args:
- features: A vector of feature names
- rank: A vector with feature ranks based on absolute value of feature importance
- n_features_to_keep: Number of features to keep. (Int)
- fold: Fold to analyzed. (Int)
- out_dir: Output directory for text file. Default '.'
- RANK: Boolean (True or False)
Yields:
- Text file: Ranked features by fold tab-delimited text file, only if RANK=True
-
N Feature Iterator
n_features_iter
Determines the features to keep.
Args:
- nf: current number of features
- keep_rate: percentage of features to keep
Yields:
- int: number of features to keep
Redundant features functions
-
Run lowess
run_lowess
This function runs the lowess function and caches it to memory.
Args:
- x: the x-values of the observed points
- y: the y-values of the observed points
- frac: the fraction of the data used when estimating each y-value. default 3/10
Yields:
- z: 2D array of results
-
Convert array to tuple
array_to_tuple
This function attempts to convert a numpy array to a tuple.
Args:
- np_array: numpy array
Yields:
- tuple
-
Extract dRFE as a dataframe
get_elim_df_ordered
This function converts the dRFE dictionary to a pandas dataframe.
Args:
- d: dRFE dictionary
- multi: is this for multiple classes. (True or False)
Yields:
- df_elim: dRFE as a dataframe with log10 transformed features
-
Calculate lowess curve
cal_lowess
This function calculates the lowess curve.
Args:
- d: dRFE dictionary
- frac: the fraction of the data used when estimating each y-value
- multi: is this for multiple classes. (True or False)
Yields:
- x: dRFE log10 transformed features
- y: dRFE metrics
- z: 2D numpy array with lowess curve
- xnew: increased intervals
- ynew: interpolated metrics for xnew
-
Calculate lowess curve for log10
cal_lowess
This function calculates the rate of change on the lowess fitted curve with log10 transformated input.
Args:
- d: dRFE dictionary
- frac: the fraction of the data used when estimating each y-value
- multi: is this for multiple classes. default False
Yields:
- data frame: dataframe with n_features, lowess value, and rate of change (DxDy)
-
Extract max lowess
extract_max_lowess
This function extracts the max features based on rate of change of log10 transformed lowess fit curve.
Args:
- d: dRFE dictionary
- frac: the fraction of the data used when estimating each y-value. default 3/10
- multi: is this for multiple classes. default False
Yields:
- int: number of max features (smallest subset)
-
Extract redundant lowess
extract_redundant_lowess
This function extracts the redundant features based on rate of change of log10 transformed lowess fit curve.
Args:
- d: dRFE dictionary
- frac: the fraction of the data used when estimating each y-value. default 3/10
- step_size: rate of change step size to analyze for extraction. default 0.05
- multi: is this for multiple classes. default False
Yields:
- int: number of redundant features
-
Optimize lowess plot
plot_with_lowess_vline
Redundant set selection optimization plot. This will be ROC AUC for multiple classification (3+), NMI for binary classification, or R2 for regression. The plot returned has fraction and step size as well as lowess smoothed curve and indication of predicted redundant set.
Args:
- d: feature elimination class dictionary
- fold: current fold
- out_dir: output directory. default '.'
- frac: the fraction of the data used when estimating each y-value. default 3/10
- step_size: rate of change step size to analyze for extraction. default 0.05
- classify: is this a classification algorithm. default True
- multi: does this have multiple (3+) classes. default True
Yields:
- graph: plot of dRFE with estimated redundant set indicated as well as fraction and set size used. It automatically saves files as pdf, png, and svg
-
Plot lowess vline
plot_with_lowess_vline
Plot feature elimination results with the redundant set indicated. This will be ROC AUC for multiple classification (3+), NMI for binary classification, or R2 for regression.
Args:
- d: feature elimination class dictionary
- fold: current fold
- out_dir: output directory. default '.'
- frac: the fraction of the data used when estimating each y-value. default 3/10
- step_size: rate of change step size to analyze for extraction. default 0.05
- classify: is this a classification algorithm. default True
- multi: does this have multiple (3+) classes. default True
Yields:
- graph: plot of dRFE with estimated redundant set indicated, automatically saves files as pdf, png, and svg
Plotting functions
-
Save plots
save_plots
This function save plot as svg, png, and pdf with specific label and dimension.
Args:
- p: plotnine object
- fn: file name without extensions
- w: width, default 7
- h: height, default 7
Yields: SVG, PNG, and PDF of plotnine object
-
Plot dRFE Accuracy
plot_acc
Plot feature elimination results for accuracy.
Args:
- d: feature elimination class dictionary
- fold: current fold
- out_dir: output directory. default '.'
Yields:
- graph: plot of feature by accuracy, automatically saves files as pdf, png, and svg
-
Plot dRFE NMI
plot_nmi
Plot feature elimination results for normalized mutual information.
Args:
- d: feature elimination class dictionary
- fold: current fold
- out_dir: output directory. default '.'
Yields:
- graph: plot of feature by NMI, automatically saves files as pdf, png, and svg
-
Plot dRFE ROC AUC
plot_roc
Plot feature elimination results for AUC ROC curve.
Args:
- d: feature elimination class dictionary
- fold: current fold
- out_dir: output directory. default '.'
Yields:
- graph: plot of feature by AUC, automatically saves files as pdf, png, and svg
-
Plot dRFE R2
plot_r2
Plot feature elimination results for R2 score. Note that this can be negative if model is arbitarily worse.
Args:
- d: feature elimination class dictionary
- fold: current fold
- out_dir: output directory. default '.'
Yields:
- graph: plot of feature by R2, automatically saves files as pdf, png, and svg
-
Plot dRFE MSE
plot_mse
Plot feature elimination results for mean squared error score.
Args:
- d: feature elimination class dictionary
- fold: current fold
- out_dir: output directory. default '.'
Yields:
- graph: plot of feature by mean squared error, automatically saves files as pdf, png, and svg
-
Plot dRFE Explained Variance
plot_evar
Plot feature elimination results for explained variance score.
Args:
- d: feature elimination class dictionary
- fold: current fold
- out_dir: output directory. default '.'
Yields:
- graph: plot of feature by explained variance, automatically saves files as pdf, png, and svg
Metric functions
-
OOB Prediction
oob_predictions
Extracts out-of-bag (OOB) predictions from random forest classifier classes.
Args:
- estimator: Random forest classifier object
Yields:
- vector: OOB predicted labels
-
OOB Accuracy Score
oob_score_accuracy
Calculates the accuracy score from the OOB predictions.
Args:
- estimator: Random forest classifier object
- Y: a vector of sample labels from training data set
Yields:
- float: accuracy score
-
OOB Normalized Mutual Information Score
oob_score_nmi
Calculates the normalized mutual information score from the OOB predictions.
Args:
- estimator: Random forest classifier object
- Y: a vector of sample labels from training data set
Yields:
- float: normalized mutual information score
-
OOB Area Under ROC Curve Score
oob_score_roc
Calculates the area under the ROC curve score for the OOB predictions.
Args:
- estimator: Random forest classifier object
- Y: a vector of sample labels from training data set
Yields:
- float: AUC ROC score
-
OOB R2 Score
oob_score_r2
Calculates the r2 score from the OOB predictions.
Args:
- estimator: Random forest regressor object
- Y: a vector of sample labels from training data set
Yields:
- float: r2 score
-
OOB Mean Squared Error Score
oob_score_mse
Calculates the mean squared error score from the OOB predictions.
Args:
- estimator: Random forest regressor object
- Y: a vector of sample labels from training data set
Yields:
- float: mean squared error score
-
OOB Explained Variance Score
oob_score_evar
Calculates the explained variance score for the OOB predictions.
Args:
- estimator: Random forest regressor object
- Y: a vector of sample labels from training data set
Yields:
- float: explained variance score
-
Developmental Test Set Predictions
dev_predictions
Extracts predictions using a development fold for linear regressor.
Args:
- estimator: Linear model regression classifier object
- X: a data frame of normalized values from developmental dataset
Yields:
- vector: Development set predicted labels
-
Developmental Test Set R2 Score
dev_score_r2
Calculates the r2 score from the developmental dataset predictions.
Args:
- estimator: Linear model regressor object
- X: a data frame of normalized values from developmental dataset
- Y: a vector of sample labels from developmental dataset
Yields:
- float: r2 score
-
Developmental Test Set Mean Squared Error Score
dev_score_mse
Calculates the mean squared error score from the developmental dataset predictions.
Args:
- estimator: Linear model regressor object
- X: a data frame of normalized values from developmental dataset
- Y: a vector of sample labels from developmental dataset
Yields:
- float: mean squared error score
-
Developmental Test Set Explained Variance Score
dev_score_evar
Calculates the explained variance score for the develomental dataset predictions.
Args:
- estimator: Linear model regression classifier object
- X: a data frame of normalized values from developmental dataset
- Y: a vector of sample labels from developmental data set
Yields:
- float: explained variance score
Linear model classes for dRFE
-
Lasso Class
Lasso
andLassoCV
Add feature importance to Lasso class similar to random forest output. LassoCV uses cross-validation for alpha tuning.
-
Ridge Class
Ridge
andRidgeCV
Add feature importance to Ridge class similar to random forest output. LassoCV uses cross-validation for alpha tuning.
-
ElasticNet Class
ElasticNet
andElasticNetCV
Add feature importance to ElasticNet class similar to random forest output. ElasticNetCV uses cross-validation to chose alpha.
-
LinearRegression Class
LinearRegression
Add feature importance to LinearRegression class similar to random forest output.
-
LogisticRegression
LogisticRegression
Adds feature importance to LogisticRegression class similar to random forest output. This was originally modified from Apua Paquola script.
Random forest helper functions
-
dRFE Subfunction
rf_fe
Iterate over features to by eliminated by step.
Args:
- estimator: Random forest classifier object
- X: a data frame of training data
- Y: a vector of sample labels from training data set
- n_features_iter: iterator for number of features to keep loop
- features: a vector of feature names
- fold: current fold
- out_dir: output directory. default '.'
- RANK: Boolean (True or False)
Yields:
- list: a list with number of features, normalized mutual information score, accuracy score, and array of the indices for features to keep
-
dRFE Step function
rf_fe_step
Apply random forest to training data, rank features, conduct feature elimination.
Args:
- estimator: Random forest classifier object
- X: a data frame of training data
- Y: a vector of sample labels from training data set
- n_features_to_keep: number of features to keep
- features: a vector of feature names
- fold: current fold
- out_dir: output directory. default '.'
- RANK: Boolean (True or False)
Yields:
- dict: a dictionary with number of features, normalized mutual information score, accuracy score, and selected features
Linear model helper functions
-
dRFE Subfunction
regr_fe
Iterate over features to by eliminated by step.
Args:
- estimator: regression linear model object
- X: a data frame of training data
- Y: a vector of sample labels from training data set
- n_features_iter: iterator for number of features to keep loop
- features: a vector of feature names
- fold: current fold
- out_dir: output directory. default '.'
- dev_size: developmental test set propotion of training
- SEED: random state
- RANK: Boolean (True or False)
Yields:
- list: a list with number of features, r2 score, mean square error, expalined variance, and array of the indices for features to keep
-
dRFE Step function
regr_fe_step
Split training data into developmental dataset and apply estimator to developmental dataset, rank features, and conduct feature elimination, single steps.
Args:
- estimator: regression linear model object
- X: a data frame of training data
- Y: a vector of sample labels from training data set
- n_features_to_keep: number of features to keep
- features: a vector of feature names
- fold: current fold
- out_dir: output directory. default '.'
- dev_size: developmental test set propotion of training
- SEED: random state
- RANK: Boolean (True or False)
Yields:
- dict: a dictionary with number of features, r2 score, mean square error, expalined variance, and selected features
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dRFEtools-0.1.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cffb79fc122e70290ff482ee4de07454ba33a8b37c21587a682ebca796525b24 |
|
MD5 | 1c201102ef756336557f103198efd9ef |
|
BLAKE2b-256 | b8135b755c047f579e90c4663a024ed7a8fae5c6822b4e3e294d03a9f6cfcd39 |