Skip to main content

An interface to scipy.optimize.minimize for training Keras models with batch optimization algorithms like L-BFGS.

Project description

https://github.com/mbhynes/kormos/workflows/Tests/badge.svg https://readthedocs.org/projects/kormos/badge/?version=latest https://img.shields.io/pypi/v/kormos

Kormos

The kormos package provides an interface between scipy.optimize.minimize and Keras for training models with deterministic minimization algorithms like L-BFGS.

It provides Keras users with:

  • keras.Model subclasses that may be optimized without changes in the API—a model may be trained using either the built-in stochastic mini-batch algorithms or the deterministic batch algorithms from scipy.optimize

  • Out-of-the-box interoperability with the usual Keras utilities and workflows; e.g.:

    • fit() still returns a the history object with optimization metadata and validation metrics from each iteration of the solver and is usable by KerasTuner

    • Support for distributed training strategies (at least in principle—this has admittedly not been integration tested)

  • The ability to use second order optimization methods from scipy by evaluating Hessian-vector-products if you have a very specific need for this (spoiler: you almost certainly do not)

Motivation

Why would anyone want to go full batch in this day and age?

Keras is a powerful tool for developing predictive models and optimizing their parameters in a high-level language. While its primary use case is large-scale deep learning, Keras’s auto-differentiation utilities (from Tensorflow) that enable rapid prototyping and optimization with gradient-based minimization algorithms are great for other use cases in mathematical modelling and numerical optimization.

If you are working with models or datasets in which training data can reasonably fit in memory on a single machine, you may have situations in which deterministic algorithms like L-BFGS or Newton-CG are complementary or viable alternatives to the stochastic optimizers available in Keras, since:

  • deterministic algorithms don’t require additional hyperparamter tuning to ensure convergence; if you’re just prototyping something small and having trouble tuning learning rates, you may just want to crank out L-BFGS for a few minutes as a sanity check that your model can in fact be optimized

  • these algorithms may have faster convergence to accurate solutions of the optimization problem if the dataset is small enough that full batch gradient and Hessian-vector-product computation times aren’t prohibitive

So TL;DR: because Luddites exist even in the field of numerical optimization.

Why The Name Kormos?

Because Keras is a powerful and useful tool, and is named after the Greek word κέρας, which means horn.

Kormos is related to Keras, but it’s not very powerful or useful. It’s named after the Greek word κορμός, which means stump.

License

This project is released under the MIT license, and contains adaptations of other codes released under the Apache and MIT licenses. Please see the header in each source file for the applicable license and copyright notices therein.

Setup

Requirements

The kormos package is built for:

  • Python 3+

  • Tensorflow 2+ (and the respective tensorflow.keras module with it)

  • Scipy 0.1+ (any version really, since the scipy.optimize.minimize signature is stable)

Installation

Install via the PyPI package [kormos](https://pypi.org/project/kormos/) using pip:

pip3 install kormos

Alternatively, if you like your releases bloody rare you may install from git directly:

pip3 install git+https://github.com/mbhynes/kormos

Usage Examples

A kormos model is drag-and-drop replaceable with any Keras model. Below we provide some toy code examples, including Collaborative Filtering and MNIST classification examples adapted from the Code Examples section of keras.io.

Example: Linear Regression with Sequential API

import numpy as np
from tensorflow import keras

import kormos

rank = 50

# Define the model using the keras.model.Model Sequential API
model = kormos.models.BatchOptimizedSequentialModel()
model.add(
    keras.layers.Dense(
        units=1,
        input_shape=(rank,),
        activation=None,
        use_bias=False,
        kernel_regularizer=keras.regularizers.L2(1e-3),
        kernel_initializer="ones",
    )
)
loss = keras.losses.MeanSquaredError()
model.compile(loss=loss, optimizer='l-bfgs-b', metrics=['mean_absolute_error'])

# Generate samples of normally distributed random data
np.random.seed(1)
w = np.random.normal(size=rank)
X = np.random.normal(size=(1000, rank))
y = np.expand_dims(X.dot(w), axis=1)

Xval = np.random.normal(size=(1000, rank))
yval = np.expand_dims(Xval.dot(w), axis=1)

# Fit the model
history = model.fit(
    x=X,
    y=y,
    epochs=10,
    validation_data=(Xval, yval),
    options={"maxcors": 3}, # can pass options payload if so desired
)
best_fit_weights = np.reshape(model.trainable_weights[0].numpy(), (1, -1))
assert np.allclose(best_fit_weights, w, 1e-2)

We can now inspect the optimization metris traced in the history object returned from fit(). The training metrics captured by kormos include the:

  • training loss function value (including regularization terms)

  • 2-norm of the batch gradient

  • number of evaluations of the loss/gradient function (equivalent to an epoch for a stochastic optimizer)

  • number of evaluations of the Hessian-vector-product function, if applicable (equivalent to an epoch for a stochastic optimizer)

>>> import pandas as pd; pd.DataFrame(history.history)
        loss       grad  fg_evals  hessp_evals   val_loss  val_mean_absolute_error
0  79.121972  17.946233         2            0  78.418121                 7.137860
1   0.192005   0.713242         3            0   0.232164                 0.344657
2   0.056429   0.186013         4            0   0.059140                 0.088700
3   0.047397   0.042760         5            0   0.047348                 0.015531
4   0.047006   0.008019         6            0   0.047006                 0.006401
5   0.046991   0.001854         7            0   0.046994                 0.005846
6   0.046990   0.000350         8            0   0.046992                 0.005675
7   0.046990   0.000073         9            0   0.046992                 0.005642
8   0.046990   0.000051        11            0   0.046992                 0.005642

We can now also recompile the model to use a stochastic optimizer; let’s refit the model using ADAM:

# Recompile the model to use a different optimizer (this doesn't change its weights)
model.compile(loss=model.loss, optimizer='adam', metrics=['mean_absolute_error'])

# Reset the weights
model.set_weights([np.random.random(size=(rank, 1))])

# Fit the model using ADAM
history = model.fit(
    x=X,
    y=y,
    epochs=150,
    validation_data=(Xval, yval),
)

This is a somewhat contrived example in modern machine learning (small dataset and simple model with very few parameters), but it’s the kind of classical use case in which a deterministic algorithm will converge faster than a stochastic algorithm. If you were interested in Keras primarily for the nice tensorflow API and autodifferentiation routines, but had unsexy, non-deep modelling goals, this bud’s for you:

>>> import pandas as pd; pd.DataFrame(history.history)
          loss  mean_absolute_error   val_loss  val_mean_absolute_error
0    59.751369             6.218111  52.518566                 5.756832
1    50.042812             5.688218  45.344589                 5.346300
2    43.674156             5.308869  40.368832                 5.043641
3    39.074280             5.021304  36.492527                 4.795147
4    35.389912             4.781666  33.423710                 4.588754
..         ...                  ...        ...                      ...
145   0.047031             0.008966   0.047031                 0.009047
146   0.047023             0.008606   0.047025                 0.008718
147   0.047017             0.008268   0.047019                 0.008344
148   0.047012             0.007934   0.047013                 0.007977
149   0.047008             0.007655   0.047009                 0.007717

[150 rows x 4 columns]

Example: Linear Regression using the Functional API

The same linear regression model as above may be expressed equivalently by the functional API. Here we specify a different scipy solver, the Newton-CG algorithm that uses Hessian-vector-products:

# Define the model using the keras.model.Model functional API
model_input = keras.Input(shape=(rank,), name="input")
model_output = keras.layers.Dense(
    units=1,
    input_shape=(rank,),
    activation=None,
    use_bias=False,
    kernel_regularizer=keras.regularizers.L2(1e-3),
    kernel_initializer="ones",
)(model_input)
model = kormos.models.BatchOptimizedModel(
    inputs=model_input,
    outputs=model_output,
)
loss = keras.losses.MeanSquaredError()
model.compile(loss=loss, optimizer='newton-cg', metrics=['mean_absolute_error'])

# Fit the model on the same data as previously
history = model.fit(
    x=X,
    y=y,
    epochs=10,
    validation_data=(Xval, yval),
)
best_fit_weights = np.reshape(model.trainable_weights[0].numpy(), (1, -1))
assert np.allclose(best_fit_weights, w, 1e-2)

The Newton-CG algorithm has second order convergence, so we should find that the gradient norm has decreased by several orders of magnitude more than with the L-BFGS-B algorithm. (Of course, practically speaking this is a moot point in the world of approximate parameter estimation due to the limitations of both imperfect models and sampling bias that exists in training datasets: the numerical error in the solution is orders of magnitude smaller than other errors…)

Example: Collaborative Filtering for Item Recommendation

We present a simple linear matrix factorization model for building a recommender system using the MovieLens dataset, and use the same preprocessing steps as in the Keras example, Collaborative Filtering for Movie Recommendations.

Define the Model

We define a simple matrix factorization model for factorizing the ratings matrix into the product of 2 latent feature matrices, represented by user and item embeddings:

import tensorflow as tf
from tensorflow import keras
import kormos

def build_model(rank, num_users, num_items, **kwargs):
    inputs = [
        keras.Input(shape=(1,), name="user", dtype=tf.int32),
        keras.Input(shape=(1,), name="item", dtype=tf.int32),
    ]
    user_embedding = keras.layers.Embedding(
        input_dim=(num_users + 1),
        output_dim=rank,
        mask_zero=True,
        embeddings_initializer="normal",
        embeddings_regularizer=keras.regularizers.L2(1e-5),
        name="user_embedding",
    )
    item_embedding = keras.layers.Embedding(
        input_dim=(num_items + 1),
        output_dim=rank,
        mask_zero=True,
        embeddings_initializer="normal",
        embeddings_regularizer=keras.regularizers.L2(1e-5),
        name="item_embedding",
    )
    features = [
        user_embedding(inputs[0]),
        item_embedding(inputs[1]),
    ]
    output = keras.layers.Dot(axes=2, normalize=False)(features)
    model = kormos.models.BatchOptimizedModel(
        inputs=inputs,
        outputs=output,
        **kwargs
    )
    return model

Prepare the Data

We run the same pre-processing steps as in the Keras example above. (Please be aware that there are methodological errors in these steps that we have left unchanged: (1) it is not correct to split the training and testing data uniformly randomly, since some movies have only 1 rating and hence should not be members of the testing set, and (2) it is not possible to construct a factorization model that represents each user/item by a vector of rank k if k is greater than the number of observations (ratings) that that user/item has in the training data—such a system is overdetermined).

import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from pathlib import Path

# Download the data from http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
# Use the ratings.csv file
movielens_data_file_url = (
    "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
)
movielens_zipped_file = keras.utils.get_file(
    "ml-latest-small.zip", movielens_data_file_url, extract=False
)
keras_datasets_path = Path(movielens_zipped_file).parents[0]
movielens_dir = keras_datasets_path / "ml-latest-small"

# Only extract the data the first time the script is run.
if not movielens_dir.exists():
    with ZipFile(movielens_zipped_file, "r") as zip:
        # Extract files
        print("Extracting all the files now...")
        zip.extractall(path=keras_datasets_path)
        print("Done!")

ratings_file = movielens_dir / "ratings.csv"
df = pd.read_csv(ratings_file)

user_ids = df["userId"].unique().tolist()
user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}
movie_ids = df["movieId"].unique().tolist()
movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}
df["user"] = df["userId"].map(user2user_encoded)
df["movie"] = df["movieId"].map(movie2movie_encoded)

num_users = len(user2user_encoded)
num_movies = len(movie_encoded2movie)
df["rating"] = df["rating"].values.astype(np.float32)
# min and max ratings will be used to normalize the ratings later
min_rating = min(df["rating"])
max_rating = max(df["rating"])

print(
    "Number of users: {}, Number of Movies: {}, Min rating: {}, Max rating: {}".format(
        num_users, num_movies, min_rating, max_rating
    )
)

df = df.sample(frac=1, random_state=42)
x = df[["user", "movie"]].values
# Normalize the targets between 0 and 1. Makes it easy to train.
y = df["rating"].apply(lambda x: (x - min_rating) / (max_rating - min_rating)).values
# Assuming training on 90% of the data and validating on 10%.
train_indices = int(0.9 * df.shape[0])
x_train, x_val, y_train, y_val = (
    x[:train_indices],
    x[train_indices:],
    y[:train_indices],
    y[train_indices:],
)

Train the Model

We may now train our factorization model:

rank = 5
model = build_model(rank, num_users, num_movies)
model.compile(
    loss=tf.keras.losses.MeanSquaredError(),
    optimizer="l-bfgs-b",
)

history = model.fit(
  x=(x_train[:, 0], x_train[:, 1]),
  y=y_train,
  batch_size=2**14,
  epochs=10,
  verbose=1
  validation_data=((x_val[:, 0], x_val[:, 1]), y_val),
)
>>> import pandas as pd; pd.DataFrame(history.history)
        loss      grad  fg_evals  hessp_evals  val_loss
0   0.499431  0.001055         2            0  0.497424
1   0.492091  0.010318         5            0  0.496749
2   0.491067  0.015367         7            0  0.499127
3   0.461140  0.012731         9            0  0.472772
4   0.271020  0.017515        12            0  0.327173
5   0.228658  0.021585        14            0  0.298120
6   0.156481  0.012698        16            0  0.226349
7   0.125350  0.007833        17            0  0.193145
8   0.101411  0.007957        18            0  0.169513
9   0.093375  0.013233        19            0  0.162208
10  0.082876  0.005307        20            0  0.152423
11  0.077789  0.004717        21            0  0.149731
12  0.072867  0.004420        22            0  0.144979
13  0.066927  0.006463        23            0  0.137852
14  0.063850  0.004983        24            0  0.136306
15  0.061897  0.002353        25            0  0.133633
16  0.060514  0.001867        26            0  0.132471
17  0.058629  0.002211        27            0  0.131402
18  0.057408  0.003710        28            0  0.130704
19  0.056111  0.001484        29            0  0.129850

Example: MNIST convnet

As a more realistic example of using kormos on a canonical dataset, we adapt the sample classification problem from the MNIST convnet example. Please note that this convolutional network model has a large number of highly correlated parameters to optimize, and stochastic algorithms like ADAM will generally perform better and provide better results. However we provide it as an example of how both stochastic and deterministic algorithms may be combined by recompiling a kormos model.

Prepare the Data

import numpy as np

from tensorflow import keras
from keras import layers

# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# Load the data and split it between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Build the Model

from kormos.models import BatchOptimizedSequentialModel

def build_model():
    model = BatchOptimizedSequentialModel(
        [
            keras.Input(shape=input_shape),
            layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
            layers.MaxPooling2D(pool_size=(2, 2)),
            layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
            layers.MaxPooling2D(pool_size=(2, 2)),
            layers.Flatten(),
            layers.Dropout(0.5),
            layers.Dense(num_classes, activation="softmax"),
        ]
    )
    return model

model = build_model()
model.summary()
Model: "batch_optimized_sequential_model"
_________________________________________________________________
 Layer (type)                   Output Shape              Param #
=================================================================
 conv2d (Conv2D)                (None, 26, 26, 32)        320

 max_pooling2d (MaxPooling2D)   (None, 13, 13, 32)        0

 conv2d_1 (Conv2D)              (None, 11, 11, 64)        18496

 max_pooling2d_1 (MaxPooling2D) (None, 5, 5, 64)          0

 flatten (Flatten)              (None, 1600)              0

 dropout (Dropout)              (None, 1600)              0

 dense (Dense)                  (None, 10)                16010

=================================================================
Total params: 34,826
Trainable params: 34,826
Non-trainable params: 0
_________________________________________________________________

Train the Model

We use this example train the model by running a combination of different algorithms. We start by running ADAM for 1 epoch, and then using this solution as a warm start initial guess for a batch solver by recompiling the model:

loss = keras.losses.CategoricalCrossentropy()
# Train a model with ADAM
model = build_model()
model.compile(loss=loss, optimizer="adam", metrics=["accuracy"])
hist1 = model.fit(x_train, y_train, batch_size=2**5, epochs=1, validation_data=(x_test, y_test))

# Continue training the model with a batch algorithm.
# We can instantiate the optimizer as well instead of a string identifier
optimizer = kormos.optimizers.ScipyBatchOptimizer()
model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"])

# We can specify the method and any options for it in fit as keyword wargs
hist2 = model.fit(
    x_train,
    y_train,
    batch_size=2**14, # this is much larger than for stochastic solvers!
    epochs=3,
    validation_data=(x_test, y_test),
    method='bfgs',
)

Implementation Details

The kormos package implements an interface for batch optimization and wraps scipy.optimize.minimize in that interface in the following steps:

  • We create a subclass of keras.Model, BatchOptimizedModel (and BatchOptimizedSequentialModel to extend the Sequential API).

  • The subclass provides a fit_batch() method with nearly identical signature to the parent fit(), but does not perform stochastic mini-batch optimization. Instead, this method offloads all optimization to the the model’s optimizer attribute, which must implement the method minimize() to perform training by minimizing the the loss function provided during model compilation.

  • When a BatchOptimizedModel is compiled with a BatchOptimzer (or string identifier for one) as its optimizer argument, the fit() method inherited from keras.Model is overriden with a pointer to fit_batch() (such that a BatchOptimizedModel may be trained with either stochastic or deterministic solvers, depending on how it’s compiled).

  • The ScipyBatchOptimizer class extends the BatchOptimizer interface and uses the scipy.optimize.minimize routine to fit the model.

At first face this is more complicated than the recommended way of extending Keras to perform custom training (i.e. by overriding keras.Model.train_step() such as in the article Customizing what happens in fit()). However, unfortunately we found extending train_step() to be awkward or infeasible for implementing a batch optimization algorithm while still making use of the standard Keras utilities for computing validation metrics at each iteration end (epoch). Overriding the model train_step() (and putting the call to scipy.optimize.minimize inside it) would mean that from the Keras model’s perspective only a single epoch would be performed, such that validation metrics would only be computed at the very end of the optimzation routine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kormos-0.1.4.tar.gz (37.4 kB view hashes)

Uploaded Source

Built Distribution

kormos-0.1.4-py3-none-any.whl (33.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page