Skip to main content

Autopredict is a package to automate Machine learning model selection/ feature selection tasks

Project description

autopredict

N|Solid

Build Status

Autopredict is a simple yet powerful library which can be used by Data Scientists to create multiple prediction (regression,classification) data models.Pass in the data to autopredict and sit back as it does all the work for you. It is very powerfull in creating intial baseline models and also has ready made tweaked parameters for multiple models to generate highly accurate predictions.

  • Automate Model Selection
  • Hyperparameter tuning
  • Feature selection/ranking
  • Feature Compression
  • Stacked Ensemble Models

This software has been designed with much Joy, by Sanchit Latawa & is protected by The Apache Licensev2.0.

New Features!

  • Added new classification Models
  • Allow grid tuning parameters to be passed in as argument

Tech

Sample Usage

>> from autopredict.classification import autoClassify
>> model =autoClassify(encoder='label',scaler='minmax',useGridtuning=False)
>> model.train(X,y)
>>print(model.getModelScores())  

Output

modelName                  score     roc_auc_score  f1_score
LogisticRegression         0.927464   0.639570      0.000000
DecisionTreeClassifier     0.937422   0.788967      0.285612
GaussianNB                 0.935352   0.760670      0.203207
RandomForestClassifier     0.937297   0.791552      0.248444
GradientBoostingClassifier 0.937472   0.792435      0.257557

Sample Run for Iris Dataset

Below shows sample code flow for using autopredict , you can get this sample file from -> */autopredict/tests/sample_iris_classification.py

# Loading Libraries
import pandas as pd
from autopredict.classification import autoClassify
from autopredict.features import rankFeatures,reduce_memory

# Setting display options
pd.set_option('display.max_columns',50000)
pd.set_option('display.width', 500000)

# Load the data into a dataframe
df = pd.read_csv('./tests/iris.csv')

# St target and feature values
X=df.drop('Species',axis=1)
y=df['Species']

# step 1  Feature Importance/Evaluation
# rankFeatures is a function in autopredict which you can
# use for feature evaluation it will give you the ranking of your features
# based on importance , with the most important feature starting from 1
print(rankFeatures(X,y))
## Sample Output - showing features along with their realtive rank  ########
#     Column-name  Importance-Rank
# 0   Petal.Width              1.0
# 1  Petal.Length              1.5
# 2  Sepal.Length              3.0
# 3   Sepal.Width              3.0

## Once you have the list of importance of the features
## you can either drop or add some new features which 
## would be used in the prediction modeling

## step 2 Train the model/evaluate

################ sample usage 2.1 ########################
# the below fit statement trains the model
model = autoClassify(scaler='standard',useGridtuning=False,gridDict = None).fit(X,y)
## get model scores 
print(model.getModelScores())
################ sample usage 2.2 ########################
## the below fit statement would ask autopredict to perform
## hyper parameter tuning using Gridsearch
model = autoClassify(scaler='standard',useGridtuning=True,gridDict = None).fit(X,y)
####### sample useage 2.3 ##################
##### the below fit uses grid tuning to fit models but over-rides
### auto-predict's base grid search parameters and models

# Define the grid that you want to run 
grid = {'LogisticRegression':{'penalty':['l2']
                               ,'C':[0.001,0.1,1,10]}
        ,'DecisionTreeClassifier': {'max_depth':[4,5,6,7,8,9,10]}
        ,'RandomForestClassifier':{'n_estimators':[100,500,1000],
                                   'max_depth':[4,5]}
        ,'GradientBoostingClassifier':{'learning_rate':[0.01,0.1,0.2,0.3],
                                      'n_estimaors':[1000]}
                           }

# train the model , passing useGridtuning as True which tells
# the function to use Grid Tuning and pass the grid to be used
# in case you pass gridDict as NUll default options set in autopredict
# would be used
model = autoClassify(scaler='standard',useGridtuning=True,gridDict = grid).fit(X,y)

# Step 3 get the score board
print(model.getModelScores())
print(model._predict_df)

# step 4 if you want to get a model object back to predict ouput
# below gets the best model object based onb accuracy score 
# you can over-ride the default scoring mechanism by using 
# score paramter in the the getBestModel Function
model_obj = model.getBestModel()

# Step 4.1 In case you want to select any other model 
# the model Name is derived from the output 
# you get when you print print(model.getModelScores())
model_obj = model.getModelObject('DecisionTreeClassifier')

# Step 5 To predict using the model object use the below statement
y_predict = model_obj.predict(validTestData)

# Other Misc features

# 1 If you want to compress memory usage of your datframe use the
# reduce_memory utilty this will compress your feature set and display
# the compress percentage 
df = reduce_memory(df)

# 2 Using stacked Ensemble models is only supported for
# binary classification for now below is sample usage
# where lgb and catboost are being used as base models
# and then the output is consume by LR model to give final ouput

from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from autopredict.stacking import stackClassify
from catboost import CatBoostClassifier

# LightGBM params
lgb_params = {}
lgb_params['learning_rate'] = 0.02
lgb_params['n_estimators'] = 650
lgb_params['max_bin'] = 10
lgb_params['subsample'] = 0.8
lgb_params['subsample_freq'] = 10
lgb_params['colsample_bytree'] = 0.8
lgb_params['min_child_samples'] = 500
lgb_params['seed'] = 99
lgmodel = LGBMClassifier(**lgb_params)

cat_params = {}
cat_params['iterations'] = 900
cat_params['depth'] = 8
cat_params['rsm'] = 0.95
cat_params['learning_rate'] = 0.03
cat_params['l2_leaf_reg'] = 3.5
cat_params['border_count'] = 8
catmodel = CatBoostClassifier(**cat_params)

logmodel = LogisticRegression()
tmp = stackClassify(splits=2,stackerModel=logmodel ,
              models= [lgmodel,catmodel],score='roc_auc',seed=100)

y_pred_prob = tmp.fit_predict(X=train,y=target_train,test=test)
# y_pred_prob has the predict probability for True class

Development

Want to contribute? Please reach out to me at slatawa@yahoo.in and we can go over the Queue items planned for the next release

Todos

  • Write MORE Tests
  • Build catboost,LGB,XGB as a seperate feature

License

Apache v2.0

Free Software, Hell Yeah!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autopredict-1.0.9.tar.gz (16.4 kB view details)

Uploaded Source

File details

Details for the file autopredict-1.0.9.tar.gz.

File metadata

  • Download URL: autopredict-1.0.9.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.0

File hashes

Hashes for autopredict-1.0.9.tar.gz
Algorithm Hash digest
SHA256 8d7d8d1270971df3b6d5db6890922aeff0cd3e4988648e3735d535ee0ce4c5c1
MD5 bf209f21c9ed4e12803e12f02b10dd6d
BLAKE2b-256 6fd7cb26ea7a70df3062bd3a415f466de395c3b5420beabb79f41e69fab6c3bf

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page