A light package for automatic model tuning and stacking
Project description
auto-modelling
Auto-modelling is a convenient library to train and tune machine models automatically.
Its main features include the following:
- preprocessing columns in all datatypes. (numeric, categorical, text)
- train machine models and tune parameters automatically.
- return top n best models with optimized parameters.
- Apply stacking technique to combine the n best models returned by the repo or self-determined fitted models together to get an even better result.
The machine learning models include the following:
- Classification:
- ExtraTreesClassifier
- RandomForestClassifier
- KNeighborsClassifier
- LogisticRegression
- XGBClassifier
- Regression:
- ExtraTreesRegressor
- GradientBoostingRegressor
- AdaBoostRegressor
- DecisionTreeRegressor
- RandomForestRegressor
- XGBRegressor
- Stack:
- for classify: LogisticRegression
- for regression: LinearRegression
reference: https://github.com/EpistasisLab/tpot/blob
Installation
pip install auto-modelling
Usage Example
from auto_modelling.classification import GoClassify
from auto_modelling.regression import GoRegress
from auto_modelling.preprocess import DataManager
from auto_modelling.stack import Stack
# preprocessing data
dm = DataManager(directory = 'preprocess_tools')
train, test = dm.drop_sparse_columns(x_train, x_test)
train, test = dm.process_data(x_train, x_test)
# the encoders are stored in the directory called data_process_tools.
# use the same processing tools to process new data
predict_data = dm.process_predict_data(predict_x)
# predict_x should have the same format as x_train/x_test
# classification
clf = GoClassify(n_best=1)
best = clf.train(x_train, y_train)
y_pred = best.predict(x_test)
# regression
reg = GoRegress(n_best=1)
best = reg.train(x_train, y_train)
y_pred = best.predict(x_test)
# get top 3 best models
clf = GoClassify(n_best=3)
bests = clf.train(x_train, y_train)
y_preds = [m.predict(x_test) for m in bests]
# Stack top 3 best models
stack = Stack(n_models = 3)
level_0_models, level_1_model = stack.train(x_train, y_train, x_test, y_test)
There are examples test.py
and sample.py
in the root directory of this package. run
python test.py
/python sample.py
.
Development Guide
-
Clone the repo
-
Create the virtual environment
mkvirtualenv auto
workon auto
pip install requirements.txt
if you have issues in installing xgboost
refrence:
https://xgboost.readthedocs.io/en/latest/build.html#
https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_on_Mac_OSX?lang=en
Note
- TO DO: Feature selection, evaluation metricss
Thoughts
- Ideally, any dataframe being throw into this repo, it should be processed.
-
pre-processing
- drop column that have too many null(Done)
- fill na for both numeric and non-numeric values(Done)
- encoded for non-numeric values(Done)
- scale values if needed
- balance the dataset if needed
-
model-training
- mode =
classification
,regression
,auto
(Done) - split data-set
- tuning parameters and model selection (Done)
- feature selection
- return a model with parameters, columns and a script to process x_test(Done)
- stacking with customized fitted models (Done)
- mode =
-
model-evualation
Other reference
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for auto_modelling-1.2.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16b9644ca661101ff41dd94417ec57c6d623d7945c6ca2827d06ea938f1e3e3a |
|
MD5 | 627b3f96650826304729de44585ea865 |
|
BLAKE2b-256 | dbb23369d9649e7f38f67069067d5139c5e15c787217355bffa08ebe6a619c24 |