geenral select features based on selected algorithm
Project description
Features Selection
=====================================
This code is for general features selection based on certain machine learning algorithm and evaluation methos
How to run (see demo.py)
------------------------------------------------
The demo is based on the IJCAI-2018 data moning competitions
- Import library from FeatureSelection.py and also other necessary library
.. code-block:: python
import MLFeaturesSelection as FS
from sklearn.metrics import log_loss
import lightgbm as lgbm
import pandas as pd
import numpy as np
- Generate for dataset
.. code-block:: python
def prepareData():
df = pd.read_csv('IJCAI-2018/data/train/trainb.csv')
df = df[~pd.isnull(df.is_trade)]
item_category_list_unique = list(np.unique(df.item_category_list))
df.item_category_list.replace(item_category_list_unique, list(np.arange(len(item_category_list_unique))), inplace=True)
return df
- Define your loss function
.. code-block:: python
def modelscore(y_test, y_pred):
return log_loss(y_test, y_pred)
- Define the way to validate
.. code-block:: python
def validation(X,y,clf,lossfunction):
totaltest = 0
for D in [24]:
T = (X.day != D)
X_train, X_test = X[T], X[~T]
X_train, X_test = X_train, X_test
y_train, y_test = y[T], y[~T]
clf.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_test, y_test)], eval_metric='logloss', verbose=False,early_stopping_rounds=200) #the train method must match your selected algorithm
totaltest += lossfunction(y_test, clf.predict_proba(X_test)[:,1])
totaltest /= 2.0
return totaltest
- Define the cross method (required when *Cross = True*)
.. code-block:: python
def add(x,y):
return x + y
def substract(x,y):
return x - y
def times(x,y):
return x * y
def divide(x,y):
return (x + 0.001)/(y + 0.001)
CrossMethod = {'+':add,
'-':substract,
'*':times,
'/':divide,}
- Initial the seacher with customized procedure (sequence + random + cross)
.. code:: python
sf = FS.Select(Sequence = False, Random = True, Cross = False) #select the way you want to process searching
- Import loss function
.. code:: python
sf.ImportDF(prepareData(),label = 'is_trade')
- Import cross method (required when *Cross = True*)
.. code:: python
sf.ImportCrossMethod(CrossMethod)
- Define non-trainable features
.. code:: python
sf.NonTrainableFeatures = ['used','instance_id', 'item_property_list', 'context_id', 'context_timestamp', 'predict_category_property', 'is_trade']
- Define initial features' combination
.. code:: python
sf.InitialFeatures(['item_category_list', 'item_price_level','item_sales_level','item_collected_level', 'item_pv_level'])
- Define algorithm
.. code:: python
sf.clf = lgbm.LGBMClassifier(random_state=1, num_leaves = 6, n_estimators=5000, max_depth=3, learning_rate = 0.05, n_jobs=8)
- Define log file name
.. code:: python
sf.logfile = 'record.log'
- Run with self-define validate method
.. code:: python
sf.run(validation)
- This code take a while to run, you can stop it any time and restart by replace the best features combination in temp sf.InitialFeatures()
This features selection method achieved
------------------------------------------------------------------------------
- **1st** in Rong360
- https://github.com/duxuhao/rong360-season2
- **12nd** in IJCAI-2018 1st round
Algorithm details
----------------------------------
.. image:: (https://github.com/duxuhao/Feature-Selection/blob/master/Procedure.png)
=====================================
This code is for general features selection based on certain machine learning algorithm and evaluation methos
How to run (see demo.py)
------------------------------------------------
The demo is based on the IJCAI-2018 data moning competitions
- Import library from FeatureSelection.py and also other necessary library
.. code-block:: python
import MLFeaturesSelection as FS
from sklearn.metrics import log_loss
import lightgbm as lgbm
import pandas as pd
import numpy as np
- Generate for dataset
.. code-block:: python
def prepareData():
df = pd.read_csv('IJCAI-2018/data/train/trainb.csv')
df = df[~pd.isnull(df.is_trade)]
item_category_list_unique = list(np.unique(df.item_category_list))
df.item_category_list.replace(item_category_list_unique, list(np.arange(len(item_category_list_unique))), inplace=True)
return df
- Define your loss function
.. code-block:: python
def modelscore(y_test, y_pred):
return log_loss(y_test, y_pred)
- Define the way to validate
.. code-block:: python
def validation(X,y,clf,lossfunction):
totaltest = 0
for D in [24]:
T = (X.day != D)
X_train, X_test = X[T], X[~T]
X_train, X_test = X_train, X_test
y_train, y_test = y[T], y[~T]
clf.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_test, y_test)], eval_metric='logloss', verbose=False,early_stopping_rounds=200) #the train method must match your selected algorithm
totaltest += lossfunction(y_test, clf.predict_proba(X_test)[:,1])
totaltest /= 2.0
return totaltest
- Define the cross method (required when *Cross = True*)
.. code-block:: python
def add(x,y):
return x + y
def substract(x,y):
return x - y
def times(x,y):
return x * y
def divide(x,y):
return (x + 0.001)/(y + 0.001)
CrossMethod = {'+':add,
'-':substract,
'*':times,
'/':divide,}
- Initial the seacher with customized procedure (sequence + random + cross)
.. code:: python
sf = FS.Select(Sequence = False, Random = True, Cross = False) #select the way you want to process searching
- Import loss function
.. code:: python
sf.ImportDF(prepareData(),label = 'is_trade')
- Import cross method (required when *Cross = True*)
.. code:: python
sf.ImportCrossMethod(CrossMethod)
- Define non-trainable features
.. code:: python
sf.NonTrainableFeatures = ['used','instance_id', 'item_property_list', 'context_id', 'context_timestamp', 'predict_category_property', 'is_trade']
- Define initial features' combination
.. code:: python
sf.InitialFeatures(['item_category_list', 'item_price_level','item_sales_level','item_collected_level', 'item_pv_level'])
- Define algorithm
.. code:: python
sf.clf = lgbm.LGBMClassifier(random_state=1, num_leaves = 6, n_estimators=5000, max_depth=3, learning_rate = 0.05, n_jobs=8)
- Define log file name
.. code:: python
sf.logfile = 'record.log'
- Run with self-define validate method
.. code:: python
sf.run(validation)
- This code take a while to run, you can stop it any time and restart by replace the best features combination in temp sf.InitialFeatures()
This features selection method achieved
------------------------------------------------------------------------------
- **1st** in Rong360
- https://github.com/duxuhao/rong360-season2
- **12nd** in IJCAI-2018 1st round
Algorithm details
----------------------------------
.. image:: (https://github.com/duxuhao/Feature-Selection/blob/master/Procedure.png)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Close
Hashes for MLFeatureSelection-0.0.1.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 447a34a0828b172464b084e66bce9bfc1d7234a8f7295f3286b5e35725520680 |
|
MD5 | 924c9462b333c51558011ad4a0ab3ba8 |
|
BLAKE2b-256 | 00cd1e64fd86e7ea0eeadc01b9d30f4d5b0103fce4d24e63878f3c01becdbb08 |