Auto machine learning with scikit-learn and TensorFlow framework.
Project description
auto-ml-cl
How to create a machine learning and deep learning models with just a few lines of code by just provide data, then framework will get best trained models based on the data we have? We don't need to care about Data Loading
, Feature Engineering
, Model Training
, Model Selection
, Model Evaluation
and Model Sink
, even RESTful
with best trained model. Now this is auto_ml_cl comes in.
This repository is based on scikit-learn and TensorFlow to create both machine learning models and nueral network models with few lines of code by just providing a training file, if there is a test file will be nicer to evaluate trained model without any bias, but if with just one file will also be fine. Currently this repository is only support with Classification
problem.
Key features highlights:
machine learning
andneural network models
are supported.Automatically data pre-processing
with missing, unstable, categorical various data types.Ensemble logic
to combine models to build more powerful models.Nueral network models search
withkerastunner
to find best hyper-parameter for specific type of algorithm.Cloud files
are supported like:Cloud storage
for GCP or local files.Logging
different processing information into one date file for future reference.Processing monitoring
for each algorithm training status.RESTful API
for API call to get prediction based on best trained model.
Sample code to use auto_ml
package by using Titanic
dataset from Kaggle competion, as this dataset contain different kinds of data types also contain some missing values with different threasholds.
from auto_ml.automl import ClassificationAutoML, FileLoad
file_name = 'train.csv'
file_path = r"C:\auto_ml\test" # Absolute path
file_load = FileLoad(file_name, file_path=file_path, label_name='Survived')
auto_cl = ClassificationAutoML()
auto_cl.fit(file_load=file_load, val_split=0.2)
# Get prediction based on best trained models
test_file_name = 'test.csv'
file_load_test = FileLoad(test_file_name, file_path=file_path)
pred = auto_cl.predict(file_load=file_load_test)
Then we could get whole trained models' evaluation score for each trained model score, we could get best trained model based on validation score if we would love to use trained model for production, one important thing is that these models are stored in local server, we could use them any time with RESTFul API calls.
If we want to use GCP cloud storage as a data source for train and test data, what needed is just get the service account file with proper authority, last is just provide with parameter: service_account_name
and file local path: service_account_file_path
to FileLoad
object, then training will start automatically.
file_name="train.csv"
file_path = "gs://bucket_name"
service_account_name = "service_account.json"
service_account_file_path = r"C:\auto_ml\test"
file_load = FileLoad(file_name, file_path, label_name='Survived',
service_account_file_name=service_account_name, service_account_file_path=service_account_file_path)
auto_cl = ClassificationAutoML()
auto_cl.fit(file_load=file_load)
If we have data in memory
, we could also use memory objects to train, test and predict with auto_ml
object, just like scikit-learn
.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
x, y = load_iris(return_X_y=True)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=.2)
auto_cl = ClassificationAutoML()
auto_cl.fit(xtrain, ytrain)
score = auto_cl.score(xtest, ytest)
pred = auto_cl.predict(xtest)
prob = auto_cl.predict_proba(xtest)
Current supported algorithms:
- Logistic Regression
- Support vector machine
- Gradient boosting tree
- Random forest
- Decision Tree
- Adaboost Tree
- K-neighbors
- XGBoost
- LightGBM
- Deep nueral network
Also supported with Ensemble
logic to combine different models to build more powerful model by adding model diversity:
- Voting
- Stacking
For raw data file, will try with some common pre-procesing steps to create dataset for algorithms, currently some pre-processing algorithms are supported:
- Imputation with statistic analysis for continuous and categorical columns, also support with KNN imputaion for categorical columns.
- Standarize with data standard data
- Normalize
- OneHot Encoding for categorical columns
- MinMax for continuous columns to avoid data volumn bias
- PCA to demension reduction with threashold
- Feature selection with variance or LinearRegression or ExtraTree
Insight for logics to auto
machine learning training steps.
-
Load data from file or memory for both training and testinig with class
FileLoad
, support with GCP'sGCS
files as source file. -
Build processing pipeline object based on data.
(1).
Imputation
for both categorical and numerical data with different logic, if data missing column is over a threshold, will delete that column. Support with algorithmKNNImputer
to impute data orSimpleImputer
to fill missing data.(2).
OneHot Encoding
for categorical columns and add created columns into original data.(3).
Standardize
data to avoid data range, also benefit for some algorithms likeSVM
etc.(4).
MinMax
data to keep data into a 0-1 range.(5).
FeatureSelection
to keep features with a default threshold or using algorithm withExtraTree
orLinearRegreesion
to select features.(6).
PCA
to reduce dimenssion if feature variance over a threshold and just keep satisfied features. -
Build a
Singleton
backend object to do file or data related functions. -
Build training pipeline to instant each algorithm with a
factory
class based on pre-defined used algorithms. -
Build a
SearchModel
class for each algorithm to find best parameters based onRandomSearch
orGridSearch
. -
Pre-processing pipeline
fit
andtranform
, save trained pipeline into disk for future use. -
Start
training
with training pipeline with processed data with doing parameters search to findbest parameter's model
, also combined with Neural network search to find best neural models. If needvalidation
will use some data to do validation that will reduce training data size, or could use traindedauto_ml
object to do validation will also be fine. -
Use
Ensemble
logic to dovoting
orstacking
to combine trained models as a new more diverse model based on best trained model. -
Evaluate
each trained models based on validation data and return a ditionary withtraining model name
,training score
andvalidation score
. -
Support to
export trained models into a new folder
that we want.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for auto_ml_cl-0.0.7-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1eea721ad93565fbfc5803c05a12246fe674cd08d57159498df0d848dc2c3e1 |
|
MD5 | 5c30d4fdbac491ae8a92df41bfd3d462 |
|
BLAKE2b-256 | 98bd2602a2161f3e8de3c12379997e34a5b5742a847c3bc05cf46de4e900a185 |