Deep Learning and HPC Starter Pack

Project description

Aims of this library:

To be simple and easy to understand so that the focus is on the data science.
To reduce the time taken from implementation to results.
To promote rapid innovation of models via configuration files, class composition and/or class inheritance.
Reduce boilerplate code (sections of code that are repeated in multiple places with little to no variation).
To simplify cluster management and distributed computing with High Performance Computing (HPC).
Be able to easily accommodate multiple research avenues simultaneously.
To cooperatively improve the functionality and documentation of this repository to make it better!

Features:

The PyTorch Lightning LightningModule and Trainer are used to implement, train, and test models. It allows for many of the above aims to be accomplished, such as simplified distributed computing and a reduction of boilerplate code. It also allows us to simply use class inheritance and composition, allowing for rapid innovation.
The Compose API of Hydra is used to create a hierarchical configuration, allowing for rapid innovation.
Neptune.ai is used to track experiments; metric scores are automatically uploaded to Neptune.ai, allowing you to easily track your experiments from your browser.
Scripts for submission to a cluster manager, such as SLURM are written for you. Also, cluster manager jobs are automatically resubmitted and resumed if they haven't finished before the time-limit.

Installation

The Deep Learning and HPC starter pack is available on PyPI:

pip install dlhpcstarter

How to structure your project
Package map
Tasks
Models
Innovate via Model Composition and Inheritance
Configuration YAML files and argparse
Innovate via Configuration Files
Next level: Configuration composition via Hydra
Stages and Trainer
Tying it all together: main.py
Cluster manager and distributed computing
Monitoring using Neptune.ai
Where all the outputs go: exp_dir
Repository Wish List

How to structure your project

There will be a task directory containing each of your tasks, e.g., cifar10. For each task, you will have a set of configurations and models, which are stored in the config and models directories, respectively. Each task will also have a stages module for each stage of model development.

├──  task  
│    │
│    └── TASK_NAME     - name of the task, e.g., cifar10.
│        └── config    - .yaml configuration files for a model.
│        └── models    - .py modules that contain pytorch_lightning.LightningModule definitions that represent models.
│        └── stages.py - training and testing stages for a task.

Package map

The package is structured as follows:

├──  dlhpcstarter
│    │
│    ├── tools                     - for all other modules; tools that are repeadetly used.
│    ├──  __main__.py - __main__.py does the following:
│    │               1. Reads command line arguments using argparse.
│    │               2. Imports the 'stages' function for the task from task/TASK_NAME/stages.py.
│    │               3. Loads the specified configuration .yaml for the job from task/TASK_NAME/config.
│    │               4. Submits the job (the configuration + 'stages') to the cluster manager (or runs it locally if 'submit' is false).
│    └── cluster.py                - contains the cluster management object.
│    └── command_line_arguments.py - argparse for reading command line arguments.
│    └── trainer.py                - contains a wrapper for pytorch_lightning.Trainer.
│    └── utils.py                  - small utility definitions.

Tasks

Tasks are named based on the data and the type of prediction or inference being made. For example:

Two tasks have the same data but require different names due to differing predictions, e.g., MS-COCO Detection and MS-COCO Caption.
Two tasks may have similar predictions but require different names due to differing data, e.g., MNIST and Chinese MNIST.

Some publicly available tasks include:

Image classification tasks, e.g., MNIST, CIFAR10, CIFAR100, ImageNet.
Object detection tasks, e.g., MS-COCO Detection.
Image captioning detection tasks, e.g., MS-COCO Caption.
Speech recognition tasks, e.g., LibriSpeech.
Chest X-Ray report generation, e.g., MIMIC-CXR.

How to add a task:

Adding a task is as simple as creating a directory with the name of the task in task. For example, if we choose CIFAR10 as the task, with the task name cifar10, then we would create the directory task/cifar10. The task directory will then house everything necessary for that task, for example, the models, the configurations for the models, the data pipeline, and the stages of development (training and testing).

Models

Please familiarise yourself with the pytorch_lightning.LightningModule in order to correctly implement a model: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html

Once we have created our task directory (e.g., task/cifar10), we now want to create a model using a pytorch_lightning.LightningModule. Everything we need for the model can be placed in the LightningModule, in including commonly used libraries and objects, for example:

torch.nn.Module: base class for all neural networks in PyTorch.
transformers: a library containing pre-trained Transformer models.
torchvision: a library for image pre-processing and pre-trained computer vision models.
torch.utils.data.Dataset: an object that processes each instance of a dataset.
torch.utils.data.DataLoader: an object that samples mini-batches from a torch.utils.data.Dataset.

Note:

The data pipeline could be implemented within the LightningModule or seperately from a model using a pytorch_lightning.LightningDataModule. The LightningModule instance would then have to be given separately to the pytorch_lightning.Trainer.

Example:

An example model for cifar10 is in task/cifar10/model/baseline.py.

Innovate via Model Composition and Inheritance

To promote rapid innovation of models, we recommend using class composition and/or inheritance. For example, we may have a baseline that not only includes a basic model, but also the data pipeline:

from pytorch_lightning import LightningModule
from torch.utils.data import DataLoader, random_split
import torchvision
import torch

class Baseline(LightningModule):
    def __init__(self, lr, ..., **kwargs):
        super(Baseline, self).__init__()
        self.save_hyperparameters()
        self.lr = lr
        self.model = torchvision.models.resnet18(...)

    def setup(self, stage=None):
        if stage == 'fit' or stage is None:
            train_set = torchvision.datasets.CIFAR10(...)
            self.train_set, self.val_set = random_split(train_set, [45000, 5000])

        if stage == 'test' or stage is None:
            self.test_set = torchvision.datasets.CIFAR10(...)

    def train_dataloader(self, shuffle=True):
        return DataLoader(self.train_set, ...)

    def val_dataloader(self):
        return DataLoader(self.val_set, ...)

    def test_dataloader(self):
        return DataLoader(self.test_set, ...)

    def configure_optimizers(self):     
        optimiser = {'optimizer': torch.optim.SGD(self.parameters(), lr=self.lr, momentum=0.9)}
        return optimiser

    def forward(self, images):
        return self.model(images)

    def training_step(self, batch, batch_idx):
        images, labels = batch
        y_hat = self(images)
        loss = self.loss(y_hat, labels)
        self.log_dict({'train_loss': loss}, ...)
        return loss

    def validation_step(self, batch, batch_idx):
        images, labels = batch
        y_hat = self(images)
        loss = self.loss(y_hat, labels)
        self.val_accuracy(torch.argmax(y_hat['logits'], dim=1), labels)
        self.log_dict({'val_acc': self.val_accuracy, 'val_loss': loss}, ...)

    def test_step(self, batch, batch_idx):
        images, labels = batch
        y_hat = self(images)
        self.test_accuracy(torch.argmax(y_hat['logits'], dim=1), labels)
        self.log_dict({'test_acc': self.test_accuracy}, ...)

After training and testing the baseline, we may want to improve upon its performance. For example, if we wanted to make the following modifications:

Use a DenseNet instead of a ResNet.
Use the AdamW optimiser.
Use a warmup learning rate scheduler.

All we would need to do is inherit the baseline and make our modifications:

from transformers import get_constant_schedule_with_warmup

class Inheritance(Baseline):

    def __init__(self, num_warmup_steps, **kwargs):
        super(Inheritance, self).__init__(**kwargs)
        self.save_hyperparameters()
        self.num_warmup_steps = num_warmup_steps
        self.model = torchvision.models.densenet121(...)

    def configure_optimizers(self):
        optimiser = {'optimizer': torch.optim.AdamW(self.parameters(), lr=self.lr)}
        optimiser['scheduler'] = {
                'scheduler': get_constant_schedule_with_warmup(optimiser['optimizer'], self.num_warmup_steps),
                'interval': 'step',
                'frequency': 1,
            }
        return optimiser

We could also construct a model that is the combination of the two via composition. For example, we may want to use everything from Baseline, but the optimiser from Inheritance:

from pytorch_lightning import LightningModule

class Composite(LightningModule):
    def __init__(self, **kwargs):
        self.baseline = Baseline(self, **kwargs)

    def setup(self, stage=None):
        self.baseline.setup(stage)

    def train_dataloader(self, shuffle=True):
        return self.baseline.train_dataloader(shuffle)

    def val_dataloader(self):
        return self.baseline.val_dataloader()

    def test_dataloader(self):
        return self.baseline.test_dataloader()

    def configure_optimizers(self):     
        return Inheritance.configure_optimizers(self)  # Use configure_optimizers() from Inheritance.

    def forward(self, images):
        return self.baseline.forward(images)

    def training_step(self, batch, batch_idx):
        return self.baseline.training_step(batch, batch_idx)

    def validation_step(self, batch, batch_idx):
        return self.baseline.validation_step(batch, batch_idx)

    def test_step(self, batch, batch_idx):
        return self.baseline.test_step(batch, batch_idx)

Configuration YAML files and argparse

Currently, there are two methods for giving arguments:

Via command line arguments using the argparse module. argparse mainly handles paths, development stage flags (e.g., training and testing flags), and cluster manager arguments.
Via a configuration file stored in YAML format. Can handle all the arguments defined by the argparse plus more, including hyperparameters for the model.

The mandatory arguments include:

task, the name of the task.
config, the name of the configuration (no extension).
module, the name of the module that the model definition is housed.
definition, the name of the class representing the model.
exp_dir, the experiment directory, i.e., where all outputs, including model checkpoints will be saved.
monitor, metric to monitor for ModelCheckpoint and EarlyStopping (optional), as well as test checkpoint loading (e.g., 'val_loss').
monitor_mode, whether the monitored metric is to be maximised or minimised ('max' or 'min').

task and config must be given as command line arguments for argparse:

dlhpcstarter --config baseline --task cifar10

module, definition, and exp_dir can be given either as command line arguments, or be placed in the configuration file.

For each model of a task, we define a configuration. Hyperparameters, paths, as well as the device configuration can be stored in a configuration file. Configuration files have the following strict requirements:

They are stored in the config directory of a task, e.g., task/cifar10/config.
They are stored in YAML format, e.g., task/cifar10/config/baseline.yaml.

Innovate via Configuration Files

If we have the following configuration file for the aforementioned CIFAR10 Baseline model, task/cifar10/config/baseline.yaml:

train: True
test: True
module: baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-3
max_epochs: 32
mbatch_size: 32
num_workers: 5
exp_dir: /my/experiment/directory
dataset_dir: /my/datasets/directory

Another way we can improve upon the baseline model, i.e., the baseline configuration, is by modifying its hyperparameters. For example, we can still use Baseline, but alter the learning rate in task/cifar10/config/baseline_rev_a.yaml:

train: True
test: True
module: baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-4  # modify this.
max_epochs: 32
mbatch_size: 32
num_workers: 5
exp_dir: /my/experiment/directory
dataset_dir: /my/datasets/directory

dlhpcstarter --config baseline_rev_a --task cifar10

Next level: Configuration composition via Hydra

If your new configuration only modifies a few arguments of another configuration file, you can take advantage of the composition feature of Hydra. This makes creating task/cifar10/config/baseline_rev_a.yaml from the previous section easy. We simply add the arguments from task/cifar10/config/baseline.yaml by adding its name to the defaults list:

defaults:
  - baseline
  - _self_

lr: 1e-4

Note that other configuration files are imported with reference to the current configuration path (not the working directory).

Please note that groups are not being used, and packages should be placed using @_global_ if the configurations being used for composition are not in the same directory. For example, the following would not work with this repository as the arguments in hpc_paths will be grouped under paths:

defaults:
  - paths/hpc_paths
  - _self_

train: True
test: True
resumable: True
module: baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-3
max_epochs: 3
mbatch_size: 32
num_workers: 5

To get around this, simply place @_global_ to remove the grouping:

defaults:
  - paths/hpc_paths@_global_  # changed here to remove "paths" grouping.
  - _self_

train: True
test: True
resumable: True
module: baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-3
max_epochs: 3
mbatch_size: 32
num_workers: 5

This also allows us to organise configurations easily. For example, if we have the following directory structure:

├── task
│   └──  cifar10          
│        └── config  
│            ├── cluster
│            │    ├── 2hr.yaml
│            │    └── 24hr.yaml
│            │
│            ├── distributed
│            │    ├── 1gpu.yaml
│            │    ├── 4gpu.yaml
│            │    └── 4gpu4node.yaml
│            │
│            ├── paths
│            │    ├── local.yaml
│            │    └── hpc.yaml
│            │
│            └── baseline.yaml

With task/cifar10/config/baseline.yaml as:

defaults:
  - cluster/2hr@_global_
  - distributed/4gpu@_global_
  - paths/hpc_paths@_global_
  - _self_

train: True
test: True
resumable: True
module: baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-3
max_epochs: 3
mbatch_size: 32
num_workers: 5

Where task/cifar10/config/baseline.yaml will now include arguments from the following example sub-configurations:

task/cifar10/config/cluster/2hr.yaml:

memory: 32GB
time_limit: '02:00:00'
venv_path: /path/to/my/venv/bin/activate

task/cifar10/config/distributed/4gpu.yaml:
```
num_gpus: 2
strategy: ddp
```

task/cifar10/config/paths/hpc.yaml:

exp_dir: /path/to/my/experiments
dataset_dir: /path/to/my/dataset

See the following documentation for more information:

Stages and Trainer

In each task directory is a Python module called stages.py, which contains the stages definition. This definition takes an object as input that houses the configuration for a job.

Typically, the following things happen in stages():

The LightningModule model is imported via the model argument, e.g.,

from src import importer

Model = importer(definition=args.definition, module='.'.join(['task', args.task, 'model', args.module])
model = Model(**vars(args))

See src.utils.importer for a handy function that imports based on strings.

A pytorch_lightning.Trainer instance is created, e.g., trainer = pytorch_lightning.Trainer(...).
The model is trained using trainer: trainer.fit(model).
The model is tested using trainer: trainer.test(model).

It handles the training and testing of a model for a task by using a pytorch_lightning.Trainer.

A helpful wrapper at src/trainer.py exists that passes frequently used and useful callbacks, loggers, and plugins to a pytorch_lightning.Trainer instance:

from src.dlhpcstarter.trainer import trainer_instance

trainer = trainer_instance(**vars(args))

Place any of the parameters for the trainer detailed at https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#trainer-class-api in your configuration file, and they will be passed to the pytorch_lightning.Trainer instance.

Tying it all together: `dlhpcstarter`

This is an overview of what occurs when the entrypoint dlhpcstarter is executed, this is not necessary to understand to use the package.

dlhpcstarter does the following:

Gets the command line arguments using argparse, e.g., arguments like this:
```
dlhpcstarter --config baseline --task cifar10
```
Imports the stages definition for the task using src.utils.importer.
Reads the configuration .yaml and combines it with the command line arguments.
Submits stages to the cluster manager if args.submit = True or runs stages locally. The command line arguments and the configuration arguments are passed to stages in both cases.

Cluster manager and distributed computing

The following arguments are used for distributed computing:

Argument	Description	Default
`num_workers`	No. of workers per DataLoader & GPU.	`1`
`num_gpus`	Number of GPUs per node.	`None`
`num_nodes`	Number of nodes (should only be used with `submit = True`).	`1`

The following arguments are used to configure a job for a cluster manager (the default cluster manager is SLURM):

Argument	Description	Default
`memory`	Amount of memory per node.	`'16GB'`
`time_limit`	Job time limit.	`'02:00:00'`
`submit`	Submit job to the cluster manager.	`None`
`resumable`	Resumable training; Automatic resubmission to cluster manager.	`None`
`qos`	Quality of service.	`None`
`begin`	When to begin the Slurm job, e.g. `now+1hour`.	`None`
`email`	Email for cluster manager notifications.	`None`
`venv_path`	Path to ''bin/activate'' of a venv.	`None`

These can be given as command line arguments:

dlhpcstarter --config baseline --task cifar10 --submit 1 --num-gpus 4 --num-workers 5 --memory 32GB

Or they can be placed in the configuration .yaml file:

num_gpus: 4  # Added.
num_workers: 5  # Added.
memory: '32GB'  # Added.

train: True
test: True
module: baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-3
max_epochs: 32
mbatch_size: 32
num_workers: 5
exp_dir: /my/experiment/directory
dataset_dir: /my/datasets/directory

And executed with:

dlhpcstarter --config baseline --task cifar10 --submit True

If using a cluster manager, add the path to the bin/activate of your virtual environment:

...
venv_path: /my/env/name/bin/activate
...

Monitoring using Neptune.ai

Simply sign up at https://neptune.ai/ and add your username and API token to your configuration file:

...
neptune_username: my_username
neptune_api_key: df987y94y2q9hoiusadhc9wy9tr82uq408rjw98ch987qwhtr093q4jfi9uwehc987wqhc9qw4uf9w3q4h897324th
...

The PyTorch Lightning Trainer will then automatically upload metrics using the Neptune Logger to Neptune.ai. Once logged in to https://neptune.ai/, you will be able to monitor your task. See here for information about using the online UI: https://docs.neptune.ai/you-should-know/displaying-metadata.

Where all the outputs go: `exp_dir`

The experiments directory is where all your outputs will be saved, including model checkpoints, metric scores. This is also where the cluster manager script, as well as where stderr and stdout are saved.

Note: the trial number also sets the seed number for your experiment.

***Description to be finished.

Repository Wish List

Add description about how to use https://neptune.ai/.
Use https://hydra.cc/ instead of argparse (or have the option to use either).
https://docs.ray.io/en/latest/tune/index.html for hyperparameter optimisation.
Notebook examples.

Project details

Release history Release notifications | RSS feed

0.1.8

Aug 26, 2024

0.1.7

Jun 19, 2024

0.1.6

Sep 27, 2023

0.1.4

Jun 12, 2023

0.1.3

May 8, 2023

0.1.2

Apr 26, 2023

0.0.8

Mar 14, 2023

This version

0.0.7

Feb 17, 2023

0.0.6

Feb 12, 2023

0.0.5

Jan 18, 2023

0.0.4

Jan 13, 2023

0.0.3

Nov 21, 2022

0.0.2

Sep 27, 2022

0.0.1

Sep 12, 2022

0.0.0

Sep 8, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dlhpcstarter-0.0.7.tar.gz (36.2 kB view details)

Uploaded Feb 17, 2023 Source

Built Distribution

dlhpcstarter-0.0.7-py3-none-any.whl (30.8 kB view details)

Uploaded Feb 17, 2023 Python 3

File details

Details for the file dlhpcstarter-0.0.7.tar.gz.

File metadata

Download URL: dlhpcstarter-0.0.7.tar.gz
Upload date: Feb 17, 2023
Size: 36.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.4

File hashes

Hashes for dlhpcstarter-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`f06708045cb31a6f56cf581dd56aed01a7ac8fff7f25c6657222606a61b4f46b`
MD5	`7de8dc2aa99afabe98f4f282459d941c`
BLAKE2b-256	`7487888f157cc06c3f3bee690d52d6962967cb1022572b4b543c832fc9b84096`

See more details on using hashes here.

File details

Details for the file dlhpcstarter-0.0.7-py3-none-any.whl.

File metadata

Download URL: dlhpcstarter-0.0.7-py3-none-any.whl
Upload date: Feb 17, 2023
Size: 30.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.4

File hashes

Hashes for dlhpcstarter-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a1def3dc93fa51d67e3da075a21a110069c1807dcfe65528a464db8f29646b1`
MD5	`93f792eb992e55ab6e3b78d725dcfdb1`
BLAKE2b-256	`47ae635d8fb0fc35989e27f7f93732043ce19946e439a2767b2bbbcf129f9a84`

See more details on using hashes here.

dlhpcstarter 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Installation

Table of Contents

How to structure your project

Package map

Tasks

Models

Innovate via Model Composition and Inheritance

Configuration YAML files and argparse

Innovate via Configuration Files

Next level: Configuration composition via Hydra

Stages and Trainer

Tying it all together: `dlhpcstarter`

Cluster manager and distributed computing

Monitoring using Neptune.ai

Where all the outputs go: `exp_dir`

Repository Wish List

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

dlhpcstarter 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Installation

Table of Contents

How to structure your project

Package map

Tasks

Models

Innovate via Model Composition and Inheritance

Configuration YAML files and argparse

Innovate via Configuration Files

Next level: Configuration composition via Hydra

Stages and Trainer

Tying it all together: dlhpcstarter

Cluster manager and distributed computing

Monitoring using Neptune.ai

Where all the outputs go: exp_dir

Repository Wish List

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Tying it all together: `dlhpcstarter`

Where all the outputs go: `exp_dir`