Features selection algorithm based on self selected algorithm, loss function and validation method
Project description
General features selection based on certain machine learning algorithm and evaluation methods
Divesity, Flexible and Easy to use
More features selection method will be included in the future!
Quick Installation
pip3 install MLFeatureSelection
Modulus in version 0.0.7
Modulus for selecting features based on greedy algorithm (from MLFeatureSelection import sequence_selection)
Modulus for removing features based on features importance (from MLFeatureSelection import importance_selection)
Modulus for removing features based on correlation coefficient (from MLFeatureSelection import coherence_selection)
Modulus for reading the features combination from log file (from MLFeatureSelection.tools import readlog)
Modulus Usage
sequence_selection
from MLFeatureSelection import sequence_selection
from sklearn.linear_model import LogisticRegression
sf = sequence_selection.Select(Sequence = True, Random = True, Cross = True)
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function handle and optimize direction, 'ascend' for AUC, ACC, 'descend' for logloss etc.
sf.InitialNonTrainableFeatures(notusable) #those features that is not trainable in the dataframe, user_id, string, etc
sf.InitialFeatures(initialfeatures) #initial initialfeatures as list
sf.SelectRemoveMode(batch = 2)
sf.GenerateCol() #generate features for selection
sf.clf = LogisticRegression() #set the selected algorithm, can be any algorithm
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, validate is the function handle of the validation function, return best features combination
importance_selection
from MLFeatureSelection import importance_selection
import xgboost as xgb
sf = importance_selection.Select()
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction
sf.InitialFeatures() #initial features, input
sf.SelectRemoveMode(batch = 2)
sf.clf = xgb.XGBClassifier()
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, return best features combination
coherence_selection
from MLFeatureSelection import coherence_selection
import xgboost as xgb
sf = coherence_selection.Select()
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction
sf.InitialFeatures() #initial features, input
sf.SelectRemoveMode(batch = 2)
sf.clf = xgb.XGBClassifier()
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, return best features combination
log reader
from MLFeatureSelection.tools import readlog
logfile = 'record.log'
logscore = 0.5 #any score in the logfile
features_combination = readlog(logfile, logscore)
Function Parameters
sf.ImportDF(df,label)
df: pd.DataFrame, include all features label: str, name of the label column
sf.ImportLossFunction(lossfunction,direction)
lossfunction: handle of the loss function, function return score as scalar value (logloss, AUC, etc) direction: 'ascend'/'descend', direction to improve
sf.InitialFeatures(features)
features: list of initial features combination, empty list will drive code to start from nothing list with all trainable features will drive code to start backward searching at the beginning
sf.InitialNonTrainableFeatures(features) #only for sequence selection
features: list of features that not trainable (string, datetime, etc)
sf.GenerateCol(key=None,selectstep=1) #only for sequence selection
key: str for the selected features, only the features with keyword will be seleted, default to be None selectstep: int, value for features selection step, default to be 1
sf.SelectRemoveMode(frac=1,batch=1,key=’’)
frac: float, percentage of delete features from all features default to be 1 as using the batch batch: int, delete features quantity every iteration key: str, only delete the features with keyword
sf.SetTimeLimit(TimeLimit)
TimeLimit: float, maximum running time, unit in minute
sf.SetFeaturesLimit(FeaturesLimit)
FeaturesLimit: int, maximum feature quantity
sf.SetClassifier(clf)
clf: classfier or estimator, sklearn, xgboost, lightgbm, etc
sf.SetLogFile(logfile)
logfile: str, log file name
sf.run(validate)
validate: function handle with score and classifier return
def validate(X, y, features, clf, lossfunction):
"""define your own validation function with 5 parameters
input as X, y, features, clf, lossfunction
clf is set by SetClassifier()
lossfunction is import earlier
features will be generate automatically
function return score and trained classfier
"""
clf.fit(X[features],y)
y_pred = clf.predict(X[features])
score = lossfuntion(y_pred,y)
return score, clf
def lossfunction(y_pred, y_test):
"""define your own loss function with y_pred and y_test
return score
"""
return np.mean(y_pred == y_test)
DEMO
More examples are added in example folder include:
PLAN
better API introduction will be completed next before the end of 06/2018
This features selection method achieved
1st in Rong360
– https://github.com/duxuhao/rong360-season2
Temporary Top 10 in JData-2018 (Peter Du)
12nd in IJCAI-2018 1st round
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.