No project description provided
Project description
Lambda Learner
What is it
Lambda Learner is a library for iterative incremental training of a class of supervised machine learning models. Using the Generalized Additive MixedEffect (GAME) framework, one can divide a model into two components, (a) Fixed Effects  a typically large "fixed effects" model (generalization) that is trained on the whole dataset to improve the model’s performance on previously unseen useritem pairs, and (b) Random Effects  a series of simpler linear "randomeffects" models (memorization) trained on data corresponding to each entity (e.g. user or article or ad) for more granular personalization.
The two main choices in defining a GAME architecture are 1) choosing the model class for the fixed effects model, and 2) choosing which random effects to include. The fixed effects model can be of any model class, typically Tensorflow, DeText, GDMix, XGBoost. As for the random effects, this choice is framed by your training data; specifically by the keys/ids of your training examples. If your training examples are keyed by a single id space (say userId), then you will have one series of random effects keyed by userId (peruser random effects). If your data is keyed by multiple id spaces (say userId, movieId), then you can have up to one series of random effects for every id type (peruser random effects, and permovie random effects). However it's not necessary to have random effects for all ids, with the choice being largely a modeling concern.
Lambda Learner currently supports using any fixedeffects model, but only random effects for a single id type.
Bringing these two pieces together, the residual score from the fixed effects model is improved using a random effect linear model, with the global model's output score acting as the bias/offset for the linear model. Once the fixed effects model has been trained, the training of random effects can occur independently and in parallel. The library supports incremental updates to the random effects components of a GAME model in response to minibatches from data streams. Currently the following algorithms for updating a random effect are supported:
 Linear regression.
 Logistic regression.
 Sequential Bayesian logistic regression (as described in the Lambda Learner paper).
The library supports maintaining a model coefficient Hessian matrix, representing uncertainty about model coefficient values, in addition to point estimates of the coefficients. This allows us to use the random effects as a multiarmed bandit using techniques such as Thompson Sampling.
Why Lambda Learner
One of the most wellestablished applications of machine learning is in deciding what content to show website visitors. When observation data comes from highvelocity, usergenerated data streams, machine learning methods perform a balancing act between model complexity, training time, and computational costs. Furthermore, when model freshness is critical, the training of models becomes timeconstrained. Parallelized batch offline training, although horizontally scalable, is often not timeconsiderate or cost effective.
Lambda Learner is capable of incrementally training the memorization part of the model (the randomeffects components) as a performance booster over the generalization part. The frequent updates to these booster models over already powerful fixedeffect models improve personalization. Additionally, it allows for applications that require online bandits that are updated quickly.
In the GAME paradigm, random effects components can be trained independently of each other. This means that their update can be easily parallelized across nodes in a distributed computation framework. For example, this library can be used on top of Python Beam or PySpark. The distributed compute framework is used for parallelization and data orchestration, while the Lambda Learner library implements the update of random effects in individual compute tasks (DoFns in Beam or Task closures in PySpark).
Installation
pip install lambdalearner
Tutorial: How to use Lambda Learner
Prepare your dataset and initial model
Let's assume we have a minibatch of data, a random effects model for a specific key, and the already trained global fixed effects model. In order to use Lambda Learner, we need to format the data and model into appropriate data structures as follows:
training_data: List[TrainingRecord] = ...
test_data: List[TrainingRecord] = ...
model_coefficients: List[Feature] = ...
model_coefficienet_variances: List[Feature] = ...
A TrainingRecord
represents a labeled example. The most important fields in this structure are:
label
=> The datum label. For example, this could be binarized (0.0, or 1.0) for a classification task, or in the range [0.0,1.0] for a regression task.features
=> A list ofFeature
s.Feature
is a NameTermValue representation which we'll discuss next.offset
=> The score that the associated fixedeffects model produces for this datum. The score from a deep or nonlinear fixedeffect model is captured in just one parameter. We use this score as the residual to train the randomeffect models.
Both features (from the training data) and model coefficients are represented using the Feature
class. Feature
is a NameTermValue (NTV) representation, where the name is the feature name, the term is a string index for the feature (supporting categorical and numerical vector features), and the value is the numerical value corresponding to a nameterm pair. When a Feature
is used to describe a model, the value is the coefficient weight.
Here's a toy example of data and a model using single feature: a categorical representing a user's favorite season of the year. In actual practice, you would create these data structures by reading in external resources and wrangling them into this form.
training_data = [
TrainingRecord(
label=1.0,
weight=1.0,
offset=0.6786987785, # determined by scoring this example using your global model.
features=[
# one feature with multiple terms, a categorical vector
Feature("season", "winter", 1.0),
Feature("season", "spring", 0.0),
Feature("season", "summer", 0.0),
Feature("season", "fall", 0.0)
]
),
# more records...
]
model_coefficients = [
# All models need an intercept feature, corresponding to the `offset` field in the data.
Feature("intercept", "intercept", 1.0),
# one feature with multiple terms, a categorical vector
Feature("season", "winter", 0.423),
Feature("season", "spring", 0.564),
Feature("season", "summer", 0.234),
Feature("season", "fall", 0.0344)
]
In the future, other storage formats besides NTV may be supported.
Create an index map
NTV is a very humanreadable format for representing the model coefficients and data record features. However, in order to train the model, we need to transform both the model data into an indexed, vector representation. An IndexMap
is a (bidirectional) mapping between a NameTerm and an integer index, which we use to translate from the humanreadable NTV representation to an trainable indexed representation.
index_map, index_map_metadata = IndexMap.from_records_means_and_variances(
training_data, model_coefficients, model_coefficienet_variances)
index_map_metadata
contains index map statistics, which can be logged or used for monitoring.
Transform your model and data into an indexed representation
Now that we have an index_map
, we can use helper functions from representation_utils.py
to transform our data and model from NTVrepresentations to indexed representations, as follows:
indexed_training_data = nt_domain_data_to_index_domain_data(training_data, index_map)
indexed_test_data = nt_domain_data_to_index_domain_data(test_data, index_map)
regularization_penalty = 10.0
initial_model = nt_domain_coeffs_to_index_domain_coeffs(model_coefficients, model_coefficienet_variances, index_map, regularization_penalty)
The data and model are now ready for training.
Perform training
To perform training, choose one of the Trainer
subclasses appropriate for your task:
TrainerSquareLossWithL2
for linear regression.TrainerLogisticLossWithL2
orTrainerSequentialBayesianLogisticLossWithL2
for classification.
forgetting_factor = 0.8
lr_trainer = TrainerSequentialBayesianLogisticLossWithL2(
training_data=indexed_training_data,
initial_model=initial_model,
penalty=regularization_penalty,
delta=forgetting_factor)
updated_model, updated_model_loss, training_metadata = lr_trainer.train()
training_metadata
contains the metadata returned by the scipy fmin_l_bfgs_b
optimizer, which be logged or used when debugging. See Scipy docs for more information.
updated_model
is an IndexedModel
which is the result or this minibatch training iteration.
Perform scoring and metric evaluation
Next we'll score our test set using updated_model
, and evaluate the model's performance. evaluate
can compute several metrics in one go, but here we request just AUC (area under the ROC curve), a common binary classification metric.
scores = score_linear_model(updated_model, indexed_test_data)
trained_model_metrics = evaluate(metric_list=['auc'], y_scores=scores, y_targets=indexed_test_data.y)
trained_model_auc = post_train_metrics['auc']
Transform your data back into a human readable representation
Finally, we transform our model back into a NTV representation using another helper from representation_utils.py
.
means, variances = index_domain_coeffs_to_nt_domain_coeffs(updated_model, index_map)
means
and variances
represent the updated model coefficients and their variances. These can now be stored and subsequently used for inference or further updated on the next data minibatch.
Citing
Please cite Lambda Learner in your publications if it helps your research:
@misc{ramanath2020lambda,
title={Lambda Learner: Fast Incremental Learning on Data Streams},
author={Rohan Ramanath and Konstantin Salomatin and Jeffrey D. Gee and Kirill Talanine and Onkar Dalal and Gungor Polatkan and Sara Smoot and Deepak Kumar},
year={2020},
eprint={2010.05154},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Contributing
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
License
This project is licensed under the BSD 2CLAUSE LICENSE  see the LICENSE file for details.
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lambda_learner0.1.0py3noneany.whl
Algorithm  Hash digest  

SHA256  41cc6be4d77a1f85706924fdd02b64d3c8c5544146b752b83fd515f520c68dec 

MD5  99e66a4bf6840796caccb3119e62d968 

BLAKE2256  035c071c9855a7b21c00e3eacc00f789279b71212b2c66dd4730ffd608c5b2f1 