geenral select features based on selected algorithm
Project description
Features Selection
=====================================
This code is for general features selection based on certain machine learning algorithm and evaluation methos
How to run (see demo.py)
------------------------------------------------
The demo is based on the IJCAI-2018 data moning competitions
- Import library from FeatureSelection.py and also other necessary library
.. code-block:: python
import MLFeaturesSelection as FS
from sklearn.metrics import log_loss
import lightgbm as lgbm
import pandas as pd
import numpy as np
- Generate for dataset
.. code-block:: python
def prepareData():
df = pd.read_csv('IJCAI-2018/data/train/trainb.csv')
df = df[~pd.isnull(df.is_trade)]
item_category_list_unique = list(np.unique(df.item_category_list))
df.item_category_list.replace(item_category_list_unique, list(np.arange(len(item_category_list_unique))), inplace=True)
return df
- Define your loss function
.. code-block:: python
def modelscore(y_test, y_pred):
return log_loss(y_test, y_pred)
- Define the way to validate
.. code-block:: python
def validation(X,y,clf,lossfunction):
totaltest = 0
for D in [24]:
T = (X.day != D)
X_train, X_test = X[T], X[~T]
X_train, X_test = X_train, X_test
y_train, y_test = y[T], y[~T]
clf.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_test, y_test)], eval_metric='logloss', verbose=False,early_stopping_rounds=200) #the train method must match your selected algorithm
totaltest += lossfunction(y_test, clf.predict_proba(X_test)[:,1])
totaltest /= 2.0
return totaltest
- Define the cross method (required when *Cross = True*)
.. code-block:: python
def add(x,y):
return x + y
def substract(x,y):
return x - y
def times(x,y):
return x * y
def divide(x,y):
return (x + 0.001)/(y + 0.001)
CrossMethod = {'+':add,
'-':substract,
'*':times,
'/':divide,}
- Initial the seacher with customized procedure (sequence + random + cross)
.. code:: python
sf = FS.Select(Sequence = False, Random = True, Cross = False) #select the way you want to process searching
- Import loss function
.. code:: python
sf.ImportDF(prepareData(),label = 'is_trade')
- Import cross method (required when *Cross = True*)
.. code:: python
sf.ImportCrossMethod(CrossMethod)
- Define non-trainable features
.. code:: python
sf.NonTrainableFeatures = ['used','instance_id', 'item_property_list', 'context_id', 'context_timestamp', 'predict_category_property', 'is_trade']
- Define initial features' combination
.. code:: python
sf.InitialFeatures(['item_category_list', 'item_price_level','item_sales_level','item_collected_level', 'item_pv_level'])
- Define algorithm
.. code:: python
sf.clf = lgbm.LGBMClassifier(random_state=1, num_leaves = 6, n_estimators=5000, max_depth=3, learning_rate = 0.05, n_jobs=8)
- Define log file name
.. code:: python
sf.logfile = 'record.log'
- Run with self-define validate method
.. code:: python
sf.run(validation)
- This code take a while to run, you can stop it any time and restart by replace the best features combination in temp sf.InitialFeatures()
This features selection method achieved
------------------------------------------------------------------------------
- **1st** in Rong360
- https://github.com/duxuhao/rong360-season2
- **12nd** in IJCAI-2018 1st round
Algorithm details
----------------------------------
.. image:: (https://github.com/duxuhao/Feature-Selection/blob/master/Procedure.png)
=====================================
This code is for general features selection based on certain machine learning algorithm and evaluation methos
How to run (see demo.py)
------------------------------------------------
The demo is based on the IJCAI-2018 data moning competitions
- Import library from FeatureSelection.py and also other necessary library
.. code-block:: python
import MLFeaturesSelection as FS
from sklearn.metrics import log_loss
import lightgbm as lgbm
import pandas as pd
import numpy as np
- Generate for dataset
.. code-block:: python
def prepareData():
df = pd.read_csv('IJCAI-2018/data/train/trainb.csv')
df = df[~pd.isnull(df.is_trade)]
item_category_list_unique = list(np.unique(df.item_category_list))
df.item_category_list.replace(item_category_list_unique, list(np.arange(len(item_category_list_unique))), inplace=True)
return df
- Define your loss function
.. code-block:: python
def modelscore(y_test, y_pred):
return log_loss(y_test, y_pred)
- Define the way to validate
.. code-block:: python
def validation(X,y,clf,lossfunction):
totaltest = 0
for D in [24]:
T = (X.day != D)
X_train, X_test = X[T], X[~T]
X_train, X_test = X_train, X_test
y_train, y_test = y[T], y[~T]
clf.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_test, y_test)], eval_metric='logloss', verbose=False,early_stopping_rounds=200) #the train method must match your selected algorithm
totaltest += lossfunction(y_test, clf.predict_proba(X_test)[:,1])
totaltest /= 2.0
return totaltest
- Define the cross method (required when *Cross = True*)
.. code-block:: python
def add(x,y):
return x + y
def substract(x,y):
return x - y
def times(x,y):
return x * y
def divide(x,y):
return (x + 0.001)/(y + 0.001)
CrossMethod = {'+':add,
'-':substract,
'*':times,
'/':divide,}
- Initial the seacher with customized procedure (sequence + random + cross)
.. code:: python
sf = FS.Select(Sequence = False, Random = True, Cross = False) #select the way you want to process searching
- Import loss function
.. code:: python
sf.ImportDF(prepareData(),label = 'is_trade')
- Import cross method (required when *Cross = True*)
.. code:: python
sf.ImportCrossMethod(CrossMethod)
- Define non-trainable features
.. code:: python
sf.NonTrainableFeatures = ['used','instance_id', 'item_property_list', 'context_id', 'context_timestamp', 'predict_category_property', 'is_trade']
- Define initial features' combination
.. code:: python
sf.InitialFeatures(['item_category_list', 'item_price_level','item_sales_level','item_collected_level', 'item_pv_level'])
- Define algorithm
.. code:: python
sf.clf = lgbm.LGBMClassifier(random_state=1, num_leaves = 6, n_estimators=5000, max_depth=3, learning_rate = 0.05, n_jobs=8)
- Define log file name
.. code:: python
sf.logfile = 'record.log'
- Run with self-define validate method
.. code:: python
sf.run(validation)
- This code take a while to run, you can stop it any time and restart by replace the best features combination in temp sf.InitialFeatures()
This features selection method achieved
------------------------------------------------------------------------------
- **1st** in Rong360
- https://github.com/duxuhao/rong360-season2
- **12nd** in IJCAI-2018 1st round
Algorithm details
----------------------------------
.. image:: (https://github.com/duxuhao/Feature-Selection/blob/master/Procedure.png)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file MLFeatureSelection-0.0.1.2.tar.gz.
File metadata
- Download URL: MLFeatureSelection-0.0.1.2.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
447a34a0828b172464b084e66bce9bfc1d7234a8f7295f3286b5e35725520680
|
|
| MD5 |
924c9462b333c51558011ad4a0ab3ba8
|
|
| BLAKE2b-256 |
00cd1e64fd86e7ea0eeadc01b9d30f4d5b0103fce4d24e63878f3c01becdbb08
|