PyTorch Model Training and Experiment Tracking Framework
Project description
AI Toolbox
AIToolbox is a framework which helps you train deep learning models in PyTorch and quickly iterate experiments. It hides the repetitive technicalities of training the neural nets and frees you to focus on interesting part of devising new models. In essence, it offers a keras-style train loop abstraction which can be used for higher level training process while still allowing the manual control on the lower level when desired.
In addition to orchestrating the model training loop the framework also helps you keep track of different experiments by automatically saving models in a structured traceable way and creating performance reports. These can be stored both locally or on AWS S3 (Google Cloud Storage in beta) which makes the library very useful when training on the GPU instance on AWS. Instance can be automatically shut down when training is finished and all the results are safely stored on S3.
Installation
To install the AIToolbox package execute:
pip install aitoolbox
If you want to install the most recent version from github repository, first clone the package repository and
then install via the pip
command:
git clone https://github.com/mv1388/aitoolbox.git
pip install ./aitoolbox
AIToolbox package can be also provided as a dependency in the requirements.txt
file. This can be done by
just specifying the aitoolbox
dependency. On the other hand, to automatically
download the current master branch from github include the following dependency specification in the requirements.txt:
git+https://github.com/mv1388/aitoolbox#egg=aitoolbox
TrainLoop
TrainLoop
is the main abstraction for PyTorch neural net training. At its core
it handles the batch feeding of data into the model, calculating loss and updating parameters for a specified number of epochs.
To learn how to define the TrainLoop supported PyTorch model please look at the Model section bellow.
After the model is created, the simplest way to train it via the TrainLoop abstraction is by doing the following:
from aitoolbox.torchtrain.train_loop import *
tl = TrainLoop(model,
train_loader, val_loader, test_loader,
optimizer, criterion)
model = tl.fit(num_epochs=10)
AIToolbox includes a few more advanced derivations of the basic TrainLoop
which automatically handle the experiment tracking by creating model
checkpoints, performance reports, example predictions, etc. All of this can be saved just on the local drive
or can also be automatically stored on AWS S3. Currently implemented advanced
TrainLoops
are
TrainLoopCheckpoint
,
TrainLoopEndSave
and
TrainLoopCheckpointEndSave
.
Here, 'Checkpoint' stands for checkpointing after each epoch, while 'EndSave' will only persist and evaluate at the very end of the training.
For the most complete experiment tracking it is recommended to use the
TrainLoopCheckpointEndSave
option.
The optional use of the result packages needed for the neural net performance evaluation is explained in
the experiment section bellow.
from aitoolbox.torchtrain.train_loop import *
TrainLoopCheckpointEndSave(
model,
train_loader, validation_loader, test_loader,
optimizer, criterion,
project_name, experiment_name, local_model_result_folder_path,
hyperparams, val_result_package=None, test_result_package=None,
cloud_save_mode='s3', bucket_name='models', cloud_dir_prefix=''
)
Check out a full TrainLoop training & experiment tracking example.
Multi-GPU training
All TrainLoop versions in addition to single GPU also support multi-GPU training to achieve even faster training.
Following the core PyTorch setup, two multi-GPU training approaches are available:
DataParallel
and DistributedDataParallel
.
DataParallel (DP)
To use DataParallel-like multiGPU training with TrainLoop just set the TrainLoop's gpu_mode
parameter to 'dp'
:
from aitoolbox.torchtrain.train_loop import *
model = ... # TTModel
TrainLoop(
model,
train_loader, val_loader, test_loader,
optimizer, criterion,
gpu_mode='dp'
).fit(num_epochs=10)
Check out a full DataParallel training example.
DistributedDataParallel (DDP)
Distributed training on multiple GPUs via DistributedDataParallel is enabled by the TrainLoop itself under
the hood by wrapping the model (TTModel
, more in Model section) into DistributedDataParallel
.
TrainLoop also automatically spawns multiple processes and initializes them. Inside each spawned process
the model and all other necessary training components are moved to the correct GPU belonging to a specific
process. Lastly, TrainLoop also automatically adds the PyTorch DistributedSampler
to each of the provided
data loaders in order to ensure different data batches go to different GPUs and there is no overlap.
To enable distributed training via DistributedDataParallel, the user has to set the TrainLoop's gpu_mode
parameter to 'ddp'
.
from aitoolbox.torchtrain.train_loop import *
model = ... # TTModel
TrainLoop(
model,
train_loader, val_loader, test_loader,
optimizer, criterion,
gpu_mode='ddp'
).fit(num_epochs=10, callbacks=None,
ddp_model_args=None,
num_nodes=1, node_rank=0, num_gpus=torch.cuda.device_count())
Check out a full DistributedDataParallel training example.
Automatic Mixed Precision training (AMP)
All the TrainLoop versions also support training with Automatic Mixed Precision (AMP). In the past this required using the Nvidia apex extension but from PyTorch 1.6 onwards AMP functionality is built into core PyTorch and no separate instalation is needed. Current version of AIToolbox already supports the use of built-in PyTorch AMP.
The user only has to set the TrainLoop parameter use_amp
to use_amp=True
in order to use the default
AMP initialization and start training the model in the mixed precision mode. If the user wants to specify
custom AMP GradScaler
initialization parameters, these should be provided as a dict parameter
use_amp={'init_scale': 2.**16, 'growth_factor': 2.0, ...}
to the TrainLoop.
All AMP initializations and training related steps are then handled automatically by the TrainLoop.
You can read more about different AMP details in the PyTorch AMP documentation.
Single-GPU mixed precision training
Example of single-GPU AMP setup:
from aitoolbox.torchtrain.train_loop import *
model = ... # TTModel
TrainLoop(
model, ...,
optimizer, criterion,
use_amp=True
).fit(num_epochs=10)
Check out a full AMP single-GPU training example.
Multi-GPU DDP mixed precision training
When training in the multi-GPU setting, the setup is mostly the same as in the single-GPU.
All the user has to do is set accordingly the use_amp
parameter of the TrainLoop and to switch its gpu_mode
parameter to 'ddp'
.
Under the hood, TrainLoop will initialize the model and the optimizer for AMP and start training using
DistributedDataParallel approach.
Example of multi-GPU AMP setup:
from aitoolbox.torchtrain.train_loop import *
model = ... # TTModel
TrainLoop(
model, ...,
optimizer, criterion,
gpu_mode='ddp',
use_amp=True
).fit(num_epochs=10)
Check out a full AMP multi-GPU DistributedDataParallel training example.
Model
To take advantage of the TrainLoop abstraction the user has to define their model as a class which is a standard way
in core PyTorch as well. The only difference is that for TrainLoop supported training the model class has
to be inherited from the AIToolbox specific
TTModel
base class instead of PyTorch nn.Module
.
TTModel
itself inherits from the normally used nn.Module
class thus our models still
retain all the expected PyTorch enabled functionality. The reason for using the TTModel super class is that
TrainLoop requires users to implement two additional methods which describe how each batch of data
is fed into the model when calculating the loss in the training mode and when making the predictions in the
evaluation mode.
The code below shows the general skeleton all the TTModels have to follow to enable them to be trained with the TrainLoop:
from aitoolbox.torchtrain.model import TTModel
class MyNeuralModel(TTModel):
def __init__(self):
# model layers, etc.
def forward(self, x_data_batch):
# The same method as required in the base PyTorch nn.Module
...
# return prediction
def get_loss(self, batch_data, criterion, device):
# Get loss during training stage, called from fit() in TrainLoop
...
# return batch loss
def get_predictions(self, batch_data, device):
# Get predictions during evaluation stage
# + return any metadata potentially needed for evaluation
...
# return predictions, true_targets, metadata
Callbacks
For advanced applications the basic logic offered in different default TrainLoops might not be enough.
Additional needed logic can be injected into the training procedure by using
callbacks
and providing them as a parameter list to TrainLoop's fit(callbacks=[callback_1, callback_2, ...])
function.
AIToolbox by default already offers a wide selection of different useful callbacks. However when
some completely new functionality is desired the user can also implement their own callbacks by
inheriting from the base callback object
AbstractCallback
.
All that the user has to do is to implement corresponding methods to execute the new callback
at the desired point in the train loop, such as: start/end of batch, epoch, training.
experiment
Result Package
This is the definition of the model evaluation procedure on the task we are experimenting with.
Result packages available out of the box can be found in the
result_package
module
where we have implemented several
basic, general result packages.
Furthermore, for those dealing with NLP, result packages for
several widely researched NLP tasks such as translation, QA can be found as part of the
NLP
module
module. Last but not least, as the framework was built with extensibility in mind and thus
if needed the users can easily define their own result packages with custom evaluations by extending the base
AbstractResultPackage
.
Under the hood the result package executes one or more
metrics
objects which actually
calculate the performance metric calculation. Result package object is thus used as a wrapper
around potentially multiple performance calculations which are needed for our task. The metrics
which are part of the specified result package are calculated by calling the prepare_result_package()
method
of the result package which we are using to evaluate model's performance.
Experiment Saver
The experiment saver saves the model architecture as well as model performance evaluation results and training history. This can be done at the end of each epoch as a model checkpointing or at the end of training.
Normally not really a point of great interest when using the TrainLoop interface as it is hidden under the hood.
However as AIToolbox was designed to be modular one can decide to write their own training loop logic but
just use the provided experiment saver module to help with the experiment tracking and model saving.
For PyTorch users we recommend using the
FullPyTorchExperimentS3Saver
which has also been most thoroughly tested.
The experiment is saved by calling the save_experiment()
function from the selected experiment saver and
providing the trained model and the evaluated result package containing the calculated performance results.
cloud
All of these modules are mainly hidden under the hood when using different experiment tracking abstractions. However, if desired and only the cloud saving functionality is needed it is easy to use them as standalone modules in some desired downstream application.
AWS
Functionality for saving model architecture and training results to S3 either during training or at the training end. On the other hand, the module also offers the dataset downloading from the S3 based dataset store. This is useful when we are experimenting with datasets and have only a slow local connection, thus scp/FTP is out of the picture.
Google Cloud
Same functionality as for AWS S3 but for Google Cloud Storage. Implemented, however, not yet tested in practice.
nlp
Currently, mainly used for the performance evaluation
result packages
needed for different NLP tasks, such as Q&A, summarization, machine translation.
For the case of e.g. NMT the module also provides attention heatmap plotting which is often helpful for gaining addition insights into the seq2seq model. The heatmap plotter creates attention heatmap plots for every validation example and saves them as pictures to disk (potentially also to cloud).
Lastly, the nlp module also provides several rudimentary NLP data processing functions.
AWS GPU instance prep and management bash scripts
As some of the tasks when training models on the AWS cloud GPU are quite repetitive, the package also includes several useful bash scripts to automatize tasks such as instance initialization and bootstrapping, experiment file updating, remote AIToolbox installation updating, etc.
For further information look into the /bin/AWS
folder and read
the provided README.
Examples of package usage
Look into the /examples
folder for starters.
Will be adding more examples of different training scenarios.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file aitoolbox-1.7.0.tar.gz
.
File metadata
- Download URL: aitoolbox-1.7.0.tar.gz
- Upload date:
- Size: 104.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e0378eac8f340828250c65d92341ad3c10a55e33542d430eae51e2de6586b78 |
|
MD5 | 76b72d1acc08116140c922928dc5bd03 |
|
BLAKE2b-256 | ab03c618d3f282be50ddc0529cd9f791d4eb32eaf213b01166b610a0fabdfa93 |
File details
Details for the file aitoolbox-1.7.0-py3-none-any.whl
.
File metadata
- Download URL: aitoolbox-1.7.0-py3-none-any.whl
- Upload date:
- Size: 261.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 460af5c67ddd96c29c1c5dce731362916cdb88f724208c6303f85983e1a2296a |
|
MD5 | 402e62ebc2c739af3751299d6794f642 |
|
BLAKE2b-256 | 1ac5b88cec6877f3b02404f4ea1eeee705178c1f2fb16804f87312d647aae9b2 |