Skip to main content

Features selection algorithm based on self selected algorithm, loss function and validation method

Project description

This code is for general features selection based on certain machine learning algorithm and evaluation methos

You can modified you validation method and loss function all by yourself

How to run

The demo is based on the IJCAI-2018 data moning competitions

  • Import library from FeatureSelection.py and also other necessary library

from MLFeatureSelection import FeatureSelection as FS
from sklearn.metrics import log_loss
import lightgbm as lgbm
import pandas as pd
import numpy as np
  • Generate for dataset

def prepareData():
    df = pd.read_csv('IJCAI-2018/data/train/trainb.csv')
    df = df[~pd.isnull(df.is_trade)]
    item_category_list_unique = list(np.unique(df.item_category_list))
    df.item_category_list.replace(item_category_list_unique, list(np.arange(len(item_category_list_unique))), inplace=True)
    return df
  • Define your loss function

def modelscore(y_test, y_pred):
    return log_loss(y_test, y_pred)
  • Define the way to validate

def validation(X,y, features, clf, lossfunction):
    totaltest = 0
    for D in [24]:
        T = (X.day != D)
        X_train, X_test = X[T], X[~T]
        X_train, X_test = X_train[features], X_test[features]
        y_train, y_test = y[T], y[~T]
        clf.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_test, y_test)], eval_metric='logloss', verbose=False,early_stopping_rounds=200) #the train method must match your selected algorithm
        totaltest += lossfunction(y_test, clf.predict_proba(X_test)[:,1])
    totaltest /= 1.0
    return totaltest
  • Define the cross method (required when Cross = True)

def add(x,y):
    return x + y

def substract(x,y):
    return x - y

def times(x,y):
    return x * y

def divide(x,y):
    return (x + 0.001)/(y + 0.001)

CrossMethod = {'+':add,
               '-':substract,
               '*':times,
               '/':divide,}
  • Initial the seacher with customized procedure (sequence + random + cross)

sf = FS.Select(Sequence = False, Random = True, Cross = False) #select the way you want to process searching
  • Import loss function

sf.ImportLossFunction(modelscore,direction = 'descend')
  • Import dataset

sf.ImportDF(prepareData(),label = 'is_trade')
  • Import cross method (required when Cross = True)

sf.ImportCrossMethod(CrossMethod)
  • Define non-trainable features

sf.InitialNonTrainableFeatures(['used','instance_id', 'item_property_list', 'context_id', 'context_timestamp', 'predict_category_property', 'is_trade'])
  • Define initial features’ combination

sf.InitialFeatures(['item_category_list', 'item_price_level','item_sales_level','item_collected_level', 'item_pv_level','day'])
  • Define features with potential that can be added later

sf.AddPotentialFeatures(['user_age_level'])
  • Define algorithm

sf.clf = lgbm.LGBMClassifier(random_state=1, num_leaves = 6, n_estimators=5000, max_depth=3, learning_rate = 0.05, n_jobs=8)
  • Define log file name

sf.SetLogFile('record.log')
  • Set maximum features quantity

sf.SetFeaturesLimit(40) #maximum number of features
  • Set maximum time limit (in minutes)

sf.SetTimeLimit(100) #maximum running time in minutes
  • Set sample ratio of total dataset, when samplemode equals to 0, running the same subset, when samplemode equals to 1, subset will be different each time

sf.SetSample(0.1, samplemode = 0)
  • Generate feature library, can specific certain key word and selection step

sf.GenerateCol(key = 'mean', selectstep = 2) #can iterate different features set
  • Run with self-define validate method

sf.run(validation)
  • This code take a while to run, you can stop it any time and restart by replace the best features combination in temp sf.InitialFeatures()

This features selection method achieved

  • 1st in Rong360

https://github.com/duxuhao/rong360-season2

  • Temporary Top 10 in JData-2018

  • 12nd in IJCAI-2018 1st round

Algorithm details

Procedure

Procedure

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MLFeatureSelection-0.0.6.2.tar.gz (13.8 kB view details)

Uploaded Source

File details

Details for the file MLFeatureSelection-0.0.6.2.tar.gz.

File metadata

File hashes

Hashes for MLFeatureSelection-0.0.6.2.tar.gz
Algorithm Hash digest
SHA256 73064a205441ccbf0e3fd27034049f99da0c90bd12ee0592779ff6aaf3b0a1ba
MD5 cf70b09af0d92a4ceea95fc2f5e1ea4d
BLAKE2b-256 c712f3b0b46d36db9ee0077de89fb30e9ac19845ec5541717020f5e243f93052

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page