Feature ranking ensemble
Project description
FEATURE RANKER
featureranker is a lightweight Python package for the feature ranking ensemble developed by Logan Hallee, featured in the following works:
Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life
cdsBERT - Extending Protein Language Models with Codon Awareness
Exploring Phylogenetic Classification and Further Applications of Codon Usage Frequencies
The ensemble utilizes l1 penalization, random forests, extreme gradient boosting, ANOVA F values, and mutual information to effectively rank the importance of features for regression and classification tasks. Scoring lists are concatenated with a weighted voting scheme.
Usage
Install
!pip install featureranker
Imports
from featureranker.utils import *
from featureranker.plots import *
from featureranker.rankers import *
import pandas as pd
from sklearn.datasets import load_diabetes, load_breast_cancer
import warnings
warnings.filterwarnings('ignore')
Regression example (diabetes dataset)
diabetes = load_diabetes(as_frame=True)
df = diabetes.data.merge(diabetes.target, left_index=True, right_index=True)
view_data(df)
X, y = get_data(df, labels='target')
hypers = regression_hyper_param_search(X, y, 3, 5)
xb_hypers = hypers[0]['best_params']
rf_hypers = hypers[1]['best_params']
ranking = regression_ranking(X, y, rf_hypers, xb_hypers)
scoring = voting(ranking)
plot_ranking(scoring, title='Regression example')
Classification example (breast cancer dataset)
cancer = load_breast_cancer(as_frame=True)
df = cancer.data.merge(cancer.target, left_index=True, right_index=True)
view_data(df)
X, y = get_data(df, labels='target')
hypers = classification_hyper_param_search(X, y, 3, 5)
xb_hypers = hypers[0]['best_params']
rf_hypers = hypers[1]['best_params']
ranking = classification_ranking(X, y, rf_hypers, xb_hypers)
scoring = voting(ranking)
plot_ranking(scoring, title='Classification example')
Documentation
See documentation via the link above for more details
Citation
Please cite Hallee, L., Khomtchouk, B.B. Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life. Sci Rep 13, 2088 (2023). https://doi.org/10.1038/s41598-023-28965-7
and
Logan Hallee, Nikolaos Rafailidis, Jason P. Gleghorn bioRxiv 2023.09.15.558027; doi: https://doi.org/10.1101/2023.09.15.558027
News
- 7/21/2022: A preliminary version of this feature ranker leveraging lasso and random forests is published in BioRxiv for phylogenetic and organelle prediction.
- 2/6/2023: The preliminary work makes its way into Nature Scientific Reports!
- 9/17/2023: The feature ranker is now a proper ensemble, with a custom soft voting scheme. XGboost, recursive feature elimination, and mutual information are also leveraged. The ensemble is used to unify the results of the previous papers in the cdsBERT paper.
- 10/15/2023: A separate classification and regression version are developed for more reliable results. Logistic regression (OvR) with an l1 penalty takes the place of lasso for classification.
- 11/7/2023: Recursive feature extraction is replaced with ANOVA F-scores due to its ability to rank based on modeled variance.
- 11/8/2023: Various utility helpers and plot functions are added for ease of use. The proper l1 penalty constant is now found automatically. The automatic hyperparameter search also returns the best metrics found via the methodologies.
- 11/9/2023: Version 1.0.0 of the package is published for testing on TestPyPI.
- 11/10/2023: Version 1.0.1 is published in PyPI under featureranker.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for featureranker-1.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e6bc820f57505de7192a3da5a6fb1761beb565285844eca967a9be32ace4c01 |
|
MD5 | 5f88585cabb83ef8b529e706770cd0bf |
|
BLAKE2b-256 | d1ed54677f060efa2060e0909896fa760c476a9f125c2981988b768dd6edbbc4 |