LazyGrid: memoization of ML models

Project description

LazyGrid

LazyGrid is a machine learning model comparator that follows the memoization paradigm, i.e. that is able to save fitted models and return them if required later.

Installation
How to use
Contributing to LazyGrid

Installation

You can install LazyGrid from PyPI:

$ pip install lazygrid

Lazygrid is known to be working on Python 3.5 and above. The package is compatible with scikit-learn 0.21 and Keras 2.2.5.

How to use

LazyGrid has three main features:

it can generate all possible pipelines given a set of steps
it can compare the performance of a list of models using cross-validation and statistical tests
it follows the memoization paradigm, avoiding fitting a model or a pipeline step twice

Pipeline generation

In order to generate all possible pipelines given a set of steps, you should define a list of elements, which in turn are lists of pipeline steps, i.e. preprocessors, feature selectors, classifiers, etc. Each step could be either a sklearn object or a keras model.

Once you have defined the pipeline elements, the generate_grid method will return a list of models of type sklearn.Pipeline.

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import RobustScaler, StandardScaler
import lazygrid as lg

preprocessors = [StandardScaler(), RobustScaler()]
feature_selectors = [SelectKBest(score_func=f_classif, k=1), SelectKBest(score_func=f_classif, k=2)]
classifiers = [RandomForestClassifier(random_state=42), SVC(random_state=42)]

elements = [preprocessors, feature_selectors, classifiers]

list_of_models = lg.generate_grid(elements)

Model comparison

Once you have generated a list of models (or pipelines), LazyGrid provides friendly APIs to compare models' performances by using a cross-validation procedure and by analyzing the outcomes applying statistical hypothesis tests.

First, you should define a classification task (e.g. x, y = make_classification(random_state=42)), define the set of models you would like to compare (e.g. model1 = LogisticRegression(random_state=42)), and call for each model the cross_val_score method provided by sklearn.

Finally, you can collect the cross-validation scores into a single list and call the find_best_solution method provided by LazyGrid. Such method applies the following algorithm:

it looks for the model having the highest mean value over its cross-validation scores ("the best model");
it compares the distribution of the scores of each model against the distribution of the scores of the best model applying a statistical hypothesis test.

You can customize the comparison by modifying the statistical hypothesis test (it should be compatible with scipy.stats) or the significance level for the test.

from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
import lazygrid as lg
from scipy.stats import mannwhitneyu

x, y = make_classification(random_state=42)

model1 = LogisticRegression(random_state=42)
model2 = RandomForestClassifier(random_state=42)
model3 = RidgeClassifier(random_state=42)

score1 = cross_val_score(estimator=model1, X=x, y=y, cv=10)
score2 = cross_val_score(estimator=model2, X=x, y=y, cv=10)
score3 = cross_val_score(estimator=model3, X=x, y=y, cv=10)

scores = [score1, score2, score3]
best_idx, best_solutions_idx, pvalues = lg.find_best_solution(scores, test=mannwhitneyu, alpha=0.05)

Memoization: optimized cross-validation

LazyGrid includes an optimized implementation of cross-validation (cross_validation), specifically devised when a huge number of machine learning pipelines need to be compared.

In fact, once a pipeline step has been fitted, LazyGrid saves the fitted model into a SQLite database. Therefore, should the step be required by another pipeline, LazyGrid fetches the model that has already been fitted from the database.

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.datasets import make_classification
import lazygrid as lg

x, y = make_classification(random_state=42)

preprocessors = [StandardScaler(), RobustScaler()]
feature_selectors = [SelectKBest(score_func=f_classif, k=1), SelectKBest(score_func=f_classif, k=2)]
classifiers = [RandomForestClassifier(random_state=42), SVC(random_state=42)]

elements = [preprocessors, feature_selectors, classifiers]

models = lg.generate_grid(elements)

for model in models:
    score, fitted_models = lg.cross_validation(model=model, x=x, y=y, 
                                               db_name="database", dataset_id=1, 
                                               dataset_name="make-classification")

Plots

LazyGrid includes some standard features for presenting results as plots, among which confusion matrixes and box plots.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import lazygrid as lg

x, y = make_classification(random_state=42)

model = LogisticRegression(random_state=42)
score, fitted_models = lg.cross_validation(model=model, x=x, y=y, 
                                           db_name="database", dataset_id=1, 
                                           dataset_name="make-classification")

conf_mat = lg.confusion_matrix_aggregate(fitted_models, x, y)
classes = ["P", "N"]
title = "Confusion matrix"
lg.plot_confusion_matrix(conf_mat, classes, "conf_mat.png", title)

Automatic comparison

The compare_models method provides a friendly approach to compare a list of models:

it calls the cross_validation method for each model, automatically performing the optimized cross-validation using the memoization paradigm;
it calls the find_best_solution method, applying a statistical test on the cross-validation results;
it returns a Pandas.DataFrame containing a summary of the results.

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.datasets import make_classification
import lazygrid as lg

x, y = make_classification(random_state=42)

preprocessors = [StandardScaler(), RobustScaler()]
feature_selectors = [SelectKBest(score_func=f_classif, k=1), SelectKBest(score_func=f_classif, k=2)]
classifiers = [RandomForestClassifier(random_state=42), SVC(random_state=42)]

elements = [preprocessors, feature_selectors, classifiers]

models = lg.generate_grid(elements)

fit_params = []
for model in models:
    fit_params.append({})

results = lg.compare_models(models=models, x_train=x, y_train=y, params=fit_params,
                            dataset_id=1, dataset_name="make-classification", n_splits=10)

Data sets APIs

LazyGrid includes a set of easy-to-use APIs to fetch OpenML data sets (NB: OpenML has a database of more than 20000 data sets).

The fetch_datasets method allows you to smartly handle such data sets:

it looks for OpenML data sets compliant with the requirements specified;
for such data sets, it fetches the characteristics of their latest version;
it saves in a local cache file the properties of such data sets, so that experiments can be easily reproduced using the same data sets and versions.

The load_openml_dataset method can then be used to download the required data set version.

import lazygrid as lg

datasets = lg.fetch_datasets(task="classification", min_classes=2, 
                             max_samples=1000, max_features=10)

# get the latest (or cached) version of the iris data set
data_id = datasets.loc["iris"].did

x, y, n_classes = lg.load_openml_dataset(data_id)

Licence

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and limitations under the License.

Project details

Release history Release notifications | RSS feed

5.0.6

Jul 22, 2021

5.0.5

Jul 22, 2021

5.0.4

Jan 9, 2020

5.0.3

Dec 14, 2019

5.0.2

Dec 12, 2019

5.0.1

Dec 10, 2019

5.0.0

Dec 8, 2019

4.1.0

Dec 2, 2019

4.0.0

Nov 29, 2019

3.0.1

Nov 20, 2019

3.0.0

Nov 19, 2019

2.1.0

Nov 14, 2019

2.0.3

Nov 13, 2019

2.0.2

Nov 12, 2019

2.0.1

Nov 10, 2019

2.0.0

Nov 10, 2019

1.0.0

Oct 31, 2019

0.2.1

Oct 29, 2019

0.2.0

Oct 28, 2019

This version

0.1.2

Oct 28, 2019

0.1.1

Oct 28, 2019

0.1.0

Oct 28, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lazygrid-0.1.2.tar.gz (24.9 kB view details)

Uploaded Oct 28, 2019 Source

Built Distribution

lazygrid-0.1.2-py3-none-any.whl (30.5 kB view details)

Uploaded Oct 28, 2019 Python 3

File details

Details for the file lazygrid-0.1.2.tar.gz.

File metadata

Download URL: lazygrid-0.1.2.tar.gz
Upload date: Oct 28, 2019
Size: 24.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.9

File hashes

Hashes for lazygrid-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`4d22803524e2f774a5d611dc0b733e6813d7405612ef53e11c234ebb6a9c6bef`
MD5	`eea16a0eea2763e4f204b53920e65264`
BLAKE2b-256	`9bffb2ffbfbef73a671ce8683f8b0ab1ac570111c3756b3592acca98b8b914b7`

See more details on using hashes here.

File details

Details for the file lazygrid-0.1.2-py3-none-any.whl.

File metadata

Download URL: lazygrid-0.1.2-py3-none-any.whl
Upload date: Oct 28, 2019
Size: 30.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.9

File hashes

Hashes for lazygrid-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b8ee6221c743f8c2967497a2eb252e8130012d36775942df9e8db2ff16c78246`
MD5	`c914993832f050c9f4b8ecfbf8ce794c`
BLAKE2b-256	`77fc000223acf1b1f5bd970065dd3ab70731fd2d0a541dab966f451ce364bb4c`

See more details on using hashes here.

lazygrid 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

LazyGrid

Table Of Contents

Installation

How to use

Pipeline generation

Model comparison

Memoization: optimized cross-validation

Plots

Automatic comparison

Data sets APIs

Licence

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes