Skip to main content

MLPug is a library for training and evaluating Machine Learning (ML) models, able to use different ML libraries as backends.

Project description

MLPug

MLPug is a library for training and evaluating Machine Learning (ML) models, able to use different ML libraries as backends.

A lot of the functionality you need to train and evaluate your model is independent of the ML library you're using, such as PyTorch, Jax, Apple MLX or TinyGrad. MLPug aims to provide a single framework with a unified API for all such training and evaluation functionality, independent of the ML library you are using.

Thus, when switching ML library, you don't have to learn a new training API and you can reuse your own training code with no, or minimal, change! 🤩🎉

MLPug is at version 0.2!

MLPug is still in development. If you are having trouble using MLPug for your use case, or when you have found a bug, please file an issue.

Dive right in! Installing MLPug as a Python package for your project.

pip install mlpug
# Using MLPug with PyTorch
import mlpug.pytorch as mlp
# Using MLPug with PyTorch/XLA (Training with Pytorch on TPUs)
import mlpug.pytorch.xla as mlp

Basic template of usage

# ################# SETUP ##################
# A TrainModel uses your model to calculate and return the training loss and other basic data
train_model = TrainModel(classifier)

# The trainier knows how to train the model on a batch
trainer = mlp.trainers.DefaultTrainer(
    optimizers=optimizer,
    model_components=classifier
)

# At minimum, you want to log the loss to track training progress
# By default the batch loss and the moving average of the loss are calculated and logged
loss_evaluator = mlp.evaluation.MetricEvaluator(trainer=trainer)
callbacks = [
    mlp.callbacks.TrainingMetricsLogger(metric_evaluator=loss_evaluator),
    # Calculate validation loss only once per epoch over the whole dataset
    mlp.callbacks.DatasetMetricsLogger(validation_dataset,
                                       'validation',
                                       metric_evaluator=loss_evaluator,
                                       batch_level=False),
    # Log the progress during training                                       
    mlp.callbacks.LogProgress(log_period=progress_log_period, set_names=['training', 'validation']),
]

manager = mlp.trainers.TrainingManager(trainer,
                                       training_dataset,
                                       num_epochs=num_epochs,
                                       callbacks=callbacks)

trainer.set_training_model(train_model)

# ################# START! #################
manager.start_training()

Also see section Explaining the MLPug Hello World example.

Running the repository examples

You can find the example code here.

In the MLPug examples package there are three examples:

  • A simple Hello World example
  • A Fashion MNIST example with multiple commandline options for common MLPug features
  • [PyTorch only] A complex example where we train a GPT2 chatbot based on the Persona dataset

To get started with the examples clone the MLPug repo:

git clone https://github.com/nuhame/mlpug.git

To run the examples, be sure to be in the root of the MLPug repository.

cd mlpug

Ensure that the current directory (the MLPug repo root) is added to the python path, such that python can find the example scripts:

export PYTHONPATH=.:$PYTHONPATH

It is advised to run the examples in a Python virtual environment:

python3 -m venv  ~/.virtualenvs/mlpug
# Activate the virtual environment
source ~/.virtualenvs/mlpug/bin/activate

In your >=Python3.9 virtual environment, install the basic requirements:

pip install -r requirements.txt 

Examples of using MLPug with PyTorch

Next to examples for PyTorch, there are equivalent examples for using MLPug with PyTorch/XLA, see the pytorch/xla directories of the examples. For XLA you need to use a TPU on Google Cloud, or use Google Colab. Here we focus on PyTorch.

Hello world

Next to the default requirements, for the Hello World example, install the example specific requirements in your >=Python3.9 virtual environment:

pip install -r requirements.txt
pip install -r examples/hello_world/pytorch/requirements.txt
# MLPug Hello World example
python examples/hello_world/pytorch/train.py

Fashion MNIST

For the Fashion MNIST example install the following requirements:

pip install -r requirements.txt
pip install -r examples/fashion_mnist/pytorch/requirements.txt
# MLPug Fashion MNIST example
python examples/fashion_mnist/pytorch/train.py

# If you have multiple GPUs
python examples/fashion_mnist/pytorch/train.py --distributed

# Use a batch size of 32 with micro-batches of 8 for gradient accumulation
python examples/fashion_mnist/pytorch/train.py --batch-size 32 --micro-batch-size 8

Some other useful flags are:

  • --use-mixed-precision: When flag is set, mixed precision will be applied during training
  • --eager-mode: When flag is set, forward and backward computation graphs will NOT be compiled (i.e. eager mode)

Run train.py -h for all options

Training a Persona Chatbot based on GPT2

Next to the default requirements, install the example specific requirements in your >=Python3.9 virtual environment:

pip install -r requirements.txt 
pip install -r examples/persona_chatbot/requirements.txt 
pip install -r examples/persona_chatbot/pytorch/requirements.txt 

What micro-batch size to use for gradient accumulation depends on your system. The following setup assumes multiple GPUs (--distributed).

python examples/persona_chatbot/pytorch/train.py \
                 --experiment-name persona-bot-experiment \
                 --num-dataloader-workers 2 \
                 --num-choices 8 \
                 --sequence-length-outlier-threshold 0.05 \
                 --batch-size 32 \
                 --micro-batch-size 8 \
                 --learning-rate 1e-4 \
                 --distributed \
                 --num-epochs 6 \
                 --progress-log-period 10

See train.py -h to see all options.

What is MLPug?

MLPug is a library for training and evaluating Machine Learning (ML) models, able to use different ML libraries as backends.

A lot of the functionality you need to train and evaluate your machine learning model is independent of the machine learning library you're using, such as PyTorch, Jax, Apple MLX or TinyGrad. For instance,

  • checkpoint management,
  • evaluation of validation set loss and other custom metrics,
  • progress logging,
  • progress visualization using Tensorboard,
  • the use of gradient accumulation to train with large batch sizes using limited GPU memory, etc..

You need such functionality no matter what machine learning framework you are using.

MLPug aims to provide a single framework with a unified API for all such training and evaluation functionality, independent of the ML library you are using. This also implies that when you switch library you can reuse your training code with no, or minimal, changes.

Supported machine learning libraries

Currently, MLPug supports the following deep learning/machine learning libraries:

  • PyTorch
  • PyTorch/XLA (Training with Pytorch on TPUs)

The ambition is to also add Jax, Apple MLX and TinyGrad as backends.

MLPug focus

Although MLPug should be able to deal with any training job, its functionality is mostly focussed on dealing with
training large models on large datasets, using limited hardware (GPU or TPU) resources and memory.

Detailed documentation

The following sections are documentation ToDo's, but provide insight in to MLPug's features:

Feature parity list

The logs object

Callbacks and the training life cycle

Progress Logging

Model components vs Training model

Distributed training

Checkpoint management
      Using the CheckpointManager
      Using training checkpoints
      Using model checkpoints
      Checkpointing on error or interrupt

MLPug metric evaluators
      Auxiliary batch training results
      Calculating custom metrics
      Conditional computation of metrics

Gradient Accumulation with Micro-batches

Using Tensorboard
      Tensorboard made easy with AutoTensorboard
      More fine grained control

Learning Rate Scheduling

Multi GPU training

Mixed Precision Training

CUDA Memory tools

Using multiple optimizers

Explaining the MLPug Hello World example

Hello World with PyTorch

To use MLPug with Pytorch

import mlpug.pytorch as mlp

Before we can start training we need an iterable dataset that can provide our training batches.

training_dataset = torch.utils.data.DataLoader(training_data,
                                               batch_size=batch_size,
                                               shuffle=False,
                                               num_workers=3)

... and a model we want to train

classifier = torch.nn.Sequential(
    torch.nn.Flatten(),
    torch.nn.Linear(784, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 10))

MLPug needs a way to evaluate the loss of the model. One way to do that is to define a TrainModel that outputs the loss

class TrainModel(torch.nn.Module):
    def __init__(self, classifier):
        super(TrainModel, self).__init__()

        self.classifier = classifier
        self.loss_func = torch.nn.CrossEntropyLoss()

    def forward(self, batch_data, evaluate_settings, inference_mode=None):
        images, true_labels = batch_data

        logits = self.classifier(images)
        return self.loss_func(logits, true_labels)

train_model = TrainModel(classifier)

To train the model we will also need an optimizer

optimizer = torch.optim.Adam(classifier.parameters(), eps=1e-7)

To now use MLPug to start training, we need to create a Trainer which will be used by a TrainingManager.

trainer = mlp.trainers.DefaultTrainer(
    optimizers=optimizer, 
    model_components=classifier
)

MLPug uses a callback system allowing you to customize and extend the training functionality. The list of callback instances you provide the TrainingManager will be called using hooks at different stages of the training process.

# At minimum, you want to log the loss to track training progress
# By default the batch loss and the moving average of the loss are calculated and logged
loss_evaluator = mlp.evaluation.MetricEvaluator(trainer=trainer)
callbacks = [
    mlp.callbacks.TrainingMetricsLogger(metric_evaluator=loss_evaluator),
    # Calculate validation loss only once per epoch over the whole dataset
    mlp.callbacks.DatasetMetricsLogger(validation_dataset,
                                       'validation',
                                       metric_evaluator=loss_evaluator,
                                       batch_level=False),
    # Log the progress during training
    mlp.callbacks.LogProgress(log_period=progress_log_period, set_names=['training', 'validation']),
]

The TrainingMetricsLogger and the DatasetMetricsLogger callback instances log training and validation set loss values in a logs object that is passed through all callbacks during training. The LogProgress callback instance logs the metric values stored in the received logs object to the console (stdout).

We can now instantiate the TrainingManager and pass it the trainer.

manager = mlp.trainers.TrainingManager(trainer,
                                       training_dataset,
                                       num_epochs=num_epochs,
                                       callbacks=callbacks)

Before we can start training we still have to provide the train_model to the trainer.

trainer.set_training_model(train_model)

The final step is to actually start training:

manager.start_training()

Running examples/hello_world/pytorch/train.py finishes like this:

###############################################################################
Epoch 9/9	READY - Duration 0:00:08
Computed over sliding window:
training       : loss          0.239.

Computed over dataset:
validation     : loss          0.355.



INFO    : TrainingManager::_train : Training completed. All good! ❤️

Using the classifier ...
real label = 9, predicted label = 9

Hello World with PyTorch/XLA

The Hello World example with PyTorch/XLA, is largely the same as with PyTorch. There are only two small differences.

To use MLPug with Pytorch/XLA, load the correct backend

import mlpug.pytorch.xla as mlp

Load your model on a TPU core:

import torch_xla.core.xla_model as xm

...

device = xm.xla_device()

train_model = TrainModel(classifier, device)
classifier.to(device)

Feature parity list

Feature PyTorch PyTorch/XLA JAX Comments
Callbacks and training life cycle
Progress Logging
Distributed training Multi-GPU and multi-TPU support
Distributed evaluation Multi-GPU and multi-TPU support
Model and training checkpoint management
Custom metric evaluation
Conditional evaluation of metrics
Gradient accumulation (micro-batches)
Tensorboard support Might be refactored
Learning Rate scheduling Might be refactored
Mixed Precision Training
Using multiple optimizers

Project details


Release history Release notifications | RSS feed

This version

0.2.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlpug-0.2.0.tar.gz (191.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlpug-0.2.0-py3-none-any.whl (250.8 kB view details)

Uploaded Python 3

File details

Details for the file mlpug-0.2.0.tar.gz.

File metadata

  • Download URL: mlpug-0.2.0.tar.gz
  • Upload date:
  • Size: 191.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for mlpug-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c054cffaea38d08b03cc4671b614103d719563b472edccb189705cbb4c1c678f
MD5 4329dedfebecfa1aa7f45401aa8ec48d
BLAKE2b-256 b8f9489591a69bc7c26c046143e2e2990e76e15b27a433012e9f6b81def6420c

See more details on using hashes here.

File details

Details for the file mlpug-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: mlpug-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 250.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for mlpug-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2c7ddb103208b16bcad6c0398407839a7e7d1ca9e898f41efcacf16cef933a1f
MD5 6e5d835a4488c013d6b6cbcad74bc10b
BLAKE2b-256 57d77518129783210f0ab821afa7303ab6ed6e97b1efd2ca7820d51722809e3c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page