3 lines of code for automate machine learning for classification and regression.
Project description
automl-engine
Get best models with only 3 lines of code no matter what type of data with automl-engine.
How to create machine learning and deep learning models with just a few lines of code by just provide data, then framework will get best trained models based on the data we have? We don't need to care about Data Loading, Feature Engineering, Model Training, Model Selection, Model Evaluation and Model Sink, even RESTful with best trained model.
Now automl-engine comes in to show power!
This repository is based on scikit-learn and TensorFlow to create both machine learning models and nueral network models with 3 lines of code by just providing file or sklearn training style, if there is a test file will be nicer to evaluate trained model without any bias.
Happy to accounce:
Both classification and regression problems are supported!
Installation
It's highly recommended that to create a virtual environment to install automl-engine as this will be at least of affect for root user path.
Linux
- Install virtual env:
sudo apt-get install python3-venv - Create virtual env folder:
python3 -m venv your_env_name - activate your virtual env:
source your_env_name/bin/activate - Install
automl-enginepackage:pip install automl-engine
Windows
- Install virtual env:
python -m pip install virtualenv - Create virtual env folder:
python -m venv your_env_name - activate your virtual env:
.\your_env_name\Scripts\activate - Install
automl-enginepackage:pip install automl-engine
Getting started
Classification
Sample code to use automl-engine package by using Titanic dataset from Kaggle competion, as this dataset contain different kinds of data types also contain some missing values with different threasholds.
from automl.estimator import ClassificationAutoML, FileLoad
file_load = FileLoad(file_name="train.csv", file_path = r"C:\auto_ml\test", label_name='Survived')
auto_est = ClassificationAutoML()
auto_est.fit(file_load=file_load, val_split=0.2)
That's it all you need to get best models based on your dataset!
If you need to get model prediction based on best trained model, that's easy just call predict function based on test data file like bellow code.
# Get prediction based on best trained models
file_load_test = FileLoad(file_name="test.csv", file_path = r"C:\auto_ml\test")
pred = auto_est.predict(file_load=file_load_test)
Then we could get whole trained models' evaluation score for each trained model score, we could get best trained model based on validation score if we would love to use trained model for production, one important thing is that these models are stored in local server, we could use them any time with RESTFul API calls.
Cloud file support
If we want to use GCP cloud storage as a data source for train and test data, what needed is just get the service account file with proper authority, last is just provide with parameter: service_account_name and file local path: service_account_file_path to FileLoad object, then training will start automatically.
file_name="train.csv"
file_path = "gs://bucket_name"
service_account_name = "service_account.json"
service_account_file_path = r"C:\auto_ml\test"
file_load = FileLoad(file_name, file_path, label_name='Survived',
service_account_file_name=service_account_name, service_account_file_path=service_account_file_path)
auto_est = ClassificationAutoML()
auto_est.fit(file_load=file_load)
sklearn style
If we have data in memory, we could also use memory objects to train, test and predict with auto_ml object, just like scikit-learn.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
x, y = load_iris(return_X_y=True)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=.2)
auto_est = ClassificationAutoML()
auto_est.fit(xtrain, ytrain)
score = auto_est.score(xtest, ytest)
pred = auto_est.predict(xtest)
prob = auto_est.predict_proba(xtest)
Regression
Full functionality for both classification and regression is same, so the only difference is to change imported class from ClassificationAutoML to RegressionAutoML just like snippet code
from automl.estimator import FileLoad, RegressionAutoML
file_load = FileLoad(file_name="train.csv", file_path = r"C:\auto_ml\test", label_name="label")
# Just change this class
auto_est = RegressionAutoML()
auto_est.fit(file_load=file_load, val_split=0.2)
Key features
machine learningandneural network modelsare supported.Automatically data pre-processingwith missing, unstable, categorical various data types.Ensemble logicto combine models to build more powerful models.Nueral network models searchwithkerastunnerto find best hyper-parameter for specific type of algorithm.Cloud filesare supported like:Cloud storagefor GCP or local files.Loggingdifferent processing information into one date file for future reference.Processing monitoringfor each algorithm training status.RESTful APIfor API call to get prediction based on best trained model.
Algorithms supported
Current supported algorithms:
- Logistic Regression
- Support vector machine
- Gradient boosting tree
- Random forest
- Decision Tree
- Adaboost Tree
- K-neighbors
- XGBoost
- LightGBM
- Deep nueral network
Also supported with Ensemble logic to combine different models to build more powerful model by adding model diversity:
- Voting
- Stacking
For raw data file, will try with some common pre-procesing steps to create dataset for algorithms, currently some pre-processing algorithms are supported:
- Imputation with statistic analysis for continuous and categorical columns, also support with KNN imputaion
- Standarize
- Normalize
- OneHot Encoding
- MinMax
- PCA
- Feature selection with variance or LinearRegression or ExtraTree
Insights
Insight for logics of automl-engine:
-
Load data from file or memory for both training and testinig with class
FileLoad, support with GCP'sGCSfiles as source file. -
Build processing pipeline object based on data.
(1).
Imputationfor both categorical and numerical data with different logic, if data missing column is over a threshold, will delete that column. Support with algorithmKNNImputerto impute data orSimpleImputerto fill missing data.(2).
OneHot Encodingfor categorical columns and add created columns into original data.(3).
Standardizedata to avoid data range, also benefit for some algorithms likeSVMetc.(4).
MinMaxdata to keep data into a 0-1 range.(5).
FeatureSelectionto keep features with a default threshold or using algorithm withExtraTreeorLinearRegreesionto select features.(6).
PCAto reduce dimenssion if feature variance over a threshold and just keep satisfied features. -
Build a
Singletonbackend object to do file or data related functions. -
Build training pipeline to instant each algorithm with a
factoryclass based on pre-defined used algorithms. -
Build a
SearchModelclass for each algorithm to find best parameters based onRandomSearchorGridSearch. -
Pre-processing pipeline
fitandtranform, save trained pipeline into disk for future use. -
Start
trainingwith training pipeline with processed data with doing parameters search to findbest parameter's model, also combined with Neural network search to find best neural models. If needvalidationwill use some data to do validation that will reduce training data size, or could use traindedauto_mlobject to do validation will also be fine. -
Use
Ensemblelogic to dovotingorstackingto combine trained models as a new more diverse model based on best trained model. -
Evaluateeach trained models based on validation data and return a ditionary withtraining model name,training scoreandvalidation score. -
Support to
export trained models into a pre-defined folderthat we want. -
Support
RESTful APIcall based on best trained model based ontest score.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file automl-engine-0.0.6.tar.gz.
File metadata
- Download URL: automl-engine-0.0.6.tar.gz
- Upload date:
- Size: 57.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.4.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
170adb218c61d89c9f65b39c14b90e9be58b45cbecfdef08e8e411b9f9e5a76c
|
|
| MD5 |
b7b4aa4ff91cee2d3ce0764fede7f39c
|
|
| BLAKE2b-256 |
976e8cefdc08cd8ea7380be86c015b6f6b3b4ce9d52a2b35de8ab2f64805de8d
|