Features selection algorithm based on self selected algorithm, loss function and validation method
Project description
This code is for general features selection based on certain machine learning algorithm and evaluation methos
You can modified you validation method and loss function all by yourself
How to run
The demo is based on the IJCAI-2018 data moning competitions
Import library from FeatureSelection.py and also other necessary library
from MLFeatureSelection import FeatureSelection as FS
from sklearn.metrics import log_loss
import lightgbm as lgbm
import pandas as pd
import numpy as np
Generate for dataset
def prepareData():
df = pd.read_csv('IJCAI-2018/data/train/trainb.csv')
df = df[~pd.isnull(df.is_trade)]
item_category_list_unique = list(np.unique(df.item_category_list))
df.item_category_list.replace(item_category_list_unique, list(np.arange(len(item_category_list_unique))), inplace=True)
return df
Define your loss function
def modelscore(y_test, y_pred):
return log_loss(y_test, y_pred)
Define the way to validate
def validation(X,y, features, clf, lossfunction):
totaltest = 0
for D in [24]:
T = (X.day != D)
X_train, X_test = X[T], X[~T]
X_train, X_test = X_train[features], X_test[features]
y_train, y_test = y[T], y[~T]
clf.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_test, y_test)], eval_metric='logloss', verbose=False,early_stopping_rounds=200) #the train method must match your selected algorithm
totaltest += lossfunction(y_test, clf.predict_proba(X_test)[:,1])
totaltest /= 1.0
return totaltest
Define the cross method (required when Cross = True)
def add(x,y):
return x + y
def substract(x,y):
return x - y
def times(x,y):
return x * y
def divide(x,y):
return (x + 0.001)/(y + 0.001)
CrossMethod = {'+':add,
'-':substract,
'*':times,
'/':divide,}
Initial the seacher with customized procedure (sequence + random + cross)
sf = FS.Select(Sequence = False, Random = True, Cross = False) #select the way you want to process searching
Import loss function
sf.ImportLossFunction(modelscore,direction = 'descend')
Import dataset
sf.ImportDF(prepareData(),label = 'is_trade')
Import cross method (required when Cross = True)
sf.ImportCrossMethod(CrossMethod)
Define non-trainable features
sf.InitialNonTrainableFeatures(['used','instance_id', 'item_property_list', 'context_id', 'context_timestamp', 'predict_category_property', 'is_trade'])
Define initial features’ combination
sf.InitialFeatures(['item_category_list', 'item_price_level','item_sales_level','item_collected_level', 'item_pv_level','day'])
Define features with potential that can be added later
sf.AddPotentialFeatures(['user_age_level'])
Define algorithm
sf.clf = lgbm.LGBMClassifier(random_state=1, num_leaves = 6, n_estimators=5000, max_depth=3, learning_rate = 0.05, n_jobs=8)
Define log file name
sf.SetLogFile('record.log')
Set maximum features quantity
sf.SetFeaturesLimit(40) #maximum number of features
Set maximum time limit (in minutes)
sf.SetTimeLimit(100) #maximum running time in minutes
Set sample ratio of total dataset, when samplemode equals to 0, running the same subset, when samplemode equals to 1, subset will be different each time
sf.SetSample(0.1, samplemode = 0)
Generate feature library, can specific certain key word and selection step
sf.GenerateCol(key = 'mean', selectstep = 2) #can iterate different features set
Run with self-define validate method
sf.run(validation)
This code take a while to run, you can stop it any time and restart by replace the best features combination in temp sf.InitialFeatures()
This features selection method achieved
1st in Rong360
– https://github.com/duxuhao/rong360-season2
Temporary Top 10 in JData-2018
12nd in IJCAI-2018 1st round
Algorithm details
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for MLFeatureSelection-0.0.4.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e852c771161fd570d17215da827053bbeadc64b8e67869efa847db1fba91dfe0 |
|
MD5 | b279ef040ecdfd434f96bda93ad43483 |
|
BLAKE2b-256 | 0750ee698b361200cdb7832e74d0381edfba82cb3646a24c2e5b6aeb496e6910 |