A fast, simple way to train machine learning algorithms
Project description
ML Automator
Author: Kevin Vecmanis
Machine Learning Automator (ML Automator) is an automation project that integrates Sequential Model Based Optimization (SMBO) with the main learning algorithms from Python's Sci-kit Learn library to generate a really fast, automated tool for tuning machine learning algorithms. MLAutomator leverages a library called Hyperopt to accomplish this. Read more about Hyperopt here
What is SMBO?
SMBO is a form of hyperparameter tuning, like grid search and randomized search. In contrast to grid and randomized search, however, SMBO used Bayesian Optimization to build a probability model, through trial and error, that is able to better predict what hyperparameters might produce a better model. The "sequential" just means that multiple trials are run, one after another, each time testing better hyper parameters by applying bayesion reasoning and updating the existing probability model.
The trade-off here is that SMBO models spend more time between each iteration "selecting" the next choice of hyperparameters - but this is accepted because the extra time taken to choose the next hyperparameters is typically signigicantly less than each training iteration. In other words, SMBO results in:
- Reduced time tuning hyperparameters compared to grid and random search methods.
- Better scores on the testing set.
Installation:
Installation is easy - of course pip
can be swapped for pip3
or pipenv
(in a virtual environment)
pip install mlautomator
Key features:
- Optimizes across data pre-processing and feature selection in addition to hyperparameters.
- Fast, intelligent scan of parameter space using Hyperopt.
- Optimized parameter search permits scanning a larger cross section of algorithms in the same period of time.
- An exceptional spot-checking algorithm.
Usage
MLAutomator accepts a training dataset X, and a target Y. The user can define their own functions for how these datasets are produced. Note that MLAutomator is designed to be a highly optimized spot-checking algorithm - you should take care to make sure your data is free from errors and any missing values have been dealt with.
MLAutomator will find ways of transforming and pre-processing your data to produce a superior model. Feel free to make your own transformations before passing the data to MLAutomator.
Optional data utilities
I'm building a suite of data utility functions which can prepare most classification and regression datasets. These, however, are optional - MLAutomator only requires X and Y inputs in the form of a numpy ndarray.
from data.utilities import clf_prep
x, y = clf_prep('pima-indians-diabetes.csv')
Once you have training and target data, this is the main call to use MLAutomator...
Classification Example: 2-class
from mlautomator.mlautomator import MLAutomator
automator = MLAutomator(x, y, iterations = 25)
automator.find_best_algorithm()
automator.print_best_space()
MLAutomator can typically find a ~ 98th percentile solution in a fraction of the time of Gridsearch or Randomized search. Here it did a comprehensive scan across all hyperparameters for 6 common machine learning algorithms and produced exceptional model performance for the classic Pima Indians Diabetes dataset.
Best Algorithm Configuration:
Best algorithm: Logistic Regression
Best accuracy : 77.73239917976761%
C : 0.02341
k_best : 6
penalty : l2
scaler : RobustScaler(
copy=True,
quantile_range=(25.0, 75.0),
with_centering=True,
with_scaling=True)
solver : lbfgs
Found best solution on iteration 132 of 150
Validation used: 10-fold cross-validation
Classification Example: Multi-class
Here are the results from the classic iris dataset, a multi-class classification problem with three classes
from data.utilities import from_sklearn
from mlautomator.mlautomator import MLAutomator
x, y = from_sklearn('iris')
automator = MLAutomator(x, y, iterations = 30, algo_type = 'classifier', score_metric = 'accuracy')
automator.find_best_algorithm()
automator.print_best_space()
Best Algorithm Configuration:
Best algorithm: Bag of Support Vector Machine Classifiers
Best accuracy : 96.67%
C : 0.7064
degree : 2
gamma : auto
k_best : 2
kernel : rbf
n_estimators : 9
probability : True
scaler : None
Found best solution on iteration 3 of 30
Validation used: 10-fold cross-validation
Regression Example
ML Automator supports regression problems as well. In this example we call the Boston Housing dataset from sklearn.datasets using one of our utility functions.
from data.utilities import from_sklearn
x, y = from_sklearn('boston')
from mlautomator.mlautomator import MLAutomator
automator = MLAutomator(x, y, iterations = 30, algo_type = 'regressor', score_metric = 'neg_mean_squared_error')
automator.find_best_algorithm()
automator.print_best_space()
Best Algorithm Configuration:
Best algorithm: K-Neighbor Regressor
Best neg_mean_squared_error : 10.41395782834094
algorithm : kd_tree
k_best : 11
n_neighbors : 2
scaler : StandardScaler(copy=True, with_mean=True, with_std=True)
weights : distance
Found best solution on iteration 24 of 30
Validation used: 10-fold cross-validation
Model Persistence
ML Automator allows you save fit, save, and load the optimal pipeline discovered by the find_best_algorithm()
method. A complete workflow would look something like this:
from data.utilities import clf_prep
from mlautomator.mlautomator import MLAutomator
x, y = clf_prep('pima-indians-diabetes.csv')
automator = MLAutomator(x, y, iterations = 30, algo_type = 'classifier', score_metric = 'accuracy')
automator.find_best_algorithm()
automator.fit_best_pipeline()
automator.save_best_pipeline('Path/to/your/directory')
# some time later....
automator.load_best_pipeline('Path/to/your/directory')
Note that MLAutomator is storing the entire transform/feature selection/model pipeline for you so that none of the prerequisite processing needs to be done when you need to make predictions on out-of-sample data.
Existing Algorithm Support
MLAutomator currently supports the following algorithms:
Classification:
- XGBoost Classifier
- Random Forest Classifier
- Support Vector Machines
- Naive Bayes Classifier
- Stochastic Gradient Descent Classification (SGD)
- K-Nearest Neighbors Classification
- Logistic Regression
Regression:
- XGBoost Regressor
- Random Forest Regressor
- Support Vector Machine Regression
- SGD Regression
- K-Nearest Neighbors Regression
Unless otherwise declared using the specific_algos argument, MLAutomator will scan all algorithms to find the best performer.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mlautomator-1.1.3.tar.gz
.
File metadata
- Download URL: mlautomator-1.1.3.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a96adb9848c2562ad0ccf2f844620fe8cd53811332384277ec0caf5abbd8b030 |
|
MD5 | f4a1838165019e90282861823d9b81c3 |
|
BLAKE2b-256 | d4a57e827e5695717abc4743697c2fa76e1d660420a561466f124cf589cf5f17 |