Skip to main content

Commons PyTorch Lightning Trainer for Hyperparmater Optimization

Project description

AnhaltAI Commons PL Hyper

AnhaltAI Commons Pytorch Lightning Trainer Framework for Hyper Parameter Optimization

Summary

This framework provides a Deep Learning Trainer based on PyTorch Lightning with common usable setup for different deep learning tasks that supports k-Fold cross-validation to fulfill automated hyperparameter optimization. Training runs are planned by using sweeps from Weights and Biases (wandb) that are created based on the supported configuration files.

The framework is based on the PyTorch Lightning Trainer and provides extended functionality. It allows models to be trained in a multi-GPU setup, both on a local machine and in more distributed environments. Using the functionality of Weights and Biases enables planning, starting, and logging multiple experiments at once (called sweeps). Offering these features in context forms the main part of our framework.

The content provided by the framework is intended to simplify comparisons of AI training experiments.

With its abstract classes the code is not usable in its base form since some functions have to be implemented by an inheriting class.

The package is accessible on PyPI and compatible with Python version >=3.10

Contents

User Guide

This chapter is meant for the users of this framework and gives the introduction as well as further usage possibilities

How to start

This section provides guidelines for setting up your AI training project using the framework.

It is intended that you can create your own project for AI training that can have any directory structure, apart from a few specifications for the use of the framework.

In addition, similar to PyTorch and lightning, the framework is expected to be installed as python package in order to use or extend its contents.

The framework is published on PyPI and can therefore be installed as package using pip.

pip install anhaltai-commons-pl-hyper

The next section tutorial will help you to start a project that uses the anhaltai-commons-pl-hyper framework.

After completing the tutorial as a quick introduction, you can continue developing your project using the other sections.

Tutorial

The tutorial in this section shows how to set up a project at the example of an 2D image classification training where a model architecture from Hugging Face or a custom model implementation can be used. The dataset is downloaded from Hugging Face too.

You can work through the chapters of the tutorial in order or read them individually if your training setup deviates too much from the example.

Hint: The tutorial assumes that only one AI training task is prepared for each project. After you have familiarized yourself with the framework, you could later redesign the project structure to be able to train multiple tasks.

How to extend the Implementation for your Task

This section introduces how the implementation of training for a specific task is intended. Further implementation details will follow in the next sections.

To use this framework for your very specific task you have to extend the provided abstract classes and functions.

First of all, there is an example project at examples/example_classification_project that shows a fully functional example of how this framework can be used.

The integration tests in the tests/integration directory also show examples how to use different configurations e.g. for using different tasks and data splitting on the same code base.

As shown in the example project, you need to extend the implementation of the following classes by implementing a subclass for each of them if you want to use its functionality:

Detailed information about the extendable classes and custom model architectures here: src/anhaltai_commons_pl_hyper/README.md

Additionally, to your DataModule subclass you will need to use the data loader and preprocessing of your datasets for your specific AI learning task.

Proceed to the next section to see the example project setup.

Quick Start using the Example Project to initialize the Project Structure

In this section it is explained how to start a project from the code base of the example project that uses the framework.

Do the following steps for quick start:

  • Have a look at the example project

  • To make the example independent of this framework repository create an empty directory for a new project on your device.

  • Copy the contents of the example_classification_project in your project root so that the project root is named, as you want. The src folder has to be in the project root.

  • Navigate in your project root and ensure that you have Python installed.

  • optional: Create a virtual python environment for your IDE and activate it for the next steps

  • Install all needed python packages

    pip install -r requirements.txt
    
  • Because of copying the files, the imports of the python source code files could be broken. Correct the broken imports in each src code file. While doing this imports from the public anhaltai-commons-pl-hyper package must be used instead of the local imports to the source code in the example.

  • optional: You can change the values of the environment variables inside the .env file e.g. WANDB_MODE=online and set the wandb credentials to be able to see the metrics online at https://wandb.ai ( see wandb env variable docs )

  • more environment variables and configuration parameters are introduced in the next sections and can be skipped for now

  • important: If you are using Git, it is important to add the .env file to the .gitignore file to exclude it because .env can contain your secrets. (.env was not added to .gitignore previously to provide the example)

  • After this very basic setup you can start the training locally by using the entrypoints.

    • To start a wandb single run:
    python -u -m src.classification_training.classification_trainer
    
    • To start a wandb sweep (multiple runs) run those commands in different terminals. You will have to set your wandb credentials first in the env variables:
    python -u -m src.classification_training.wandb_utils.sweep_server
    python -u -m src.classification_training.wandb_utils.sweep_agent
    
  • Once this works, you can build Docker images for this example. Read section Build docker images to do it. You can try out the wandb single run and sweep run options. Further you can play around with the different data splitting modes.

  • After you have understood the provided example you can start to adapt and extend the code for your own AI training task.

Add Models

This section explains how to add a model to the training setup.

The model architecture should be added in such a way, that multiple models can be selected by the config for different runs on the same data to allow wandb sweep runs for clean comparisons.

Model name and additional settings can be set in the config. It is allowed to add further control structures and settings. Refer to the config documentation for details about parameters.

Custom Models

Custom models have to be an extension of torch.nn.Module and have to implement the forward function. They can be implemented in any Python file and must be added to your self implemented control structure to select the model by the config.

Import Models

The model import should be added directly to the control structure in the load_model() method of the TrainingModule class or outsourced to another python file.

Entrypoints

This section explains what are entrypoints for a training are explained and how to modify them.

Having implemented the entrypoints in your own files is essential for the later step Build docker images.

There are at least two entrypoints to start the training, that are explained in the subsections below.

For a setup such as given in the example project the working directory must be the project root where the entrypoints are called from because of the referenced location of the config files.

They depend on the two options, how wandb is used: single runs and sweeps .

Start of a Single Run

You have to call the train() method of an instance of your Trainer subclass to start a training.

In this case the training is configured by the config for the single run.

Start of a Sweep

Two functions are needed when running a sweep on multiple devices: sweep and agent.

The main() function is implemented in the SweepServer class.

You must implement an entrypoint that calls the main() method of an instance of SweepServer to start a sweep server from where the sweep agents on other devices can get the sweep ID over an GET request.

Sweeps can be resumed, as explained here.

Also, you need to implement an entrypoint that calls the create_agent() function to start a sweep agent that requests the sweep ID from the sweep server. The agent calls the train() method of an instance your Trainer subclass. With the sweep ID he can get the correct run config is get from the running wandb sweep.

Configure Logging for Multiprocessing:

Learn how to set up console logging when using multiprocessing with PyTorch Lightning in this section.

It is recommended to set custom logging options for logging to the console or log files at the very beginning of all entrypoints as shown by the code examples in the section Entrypoints.

Placing the logging format this way enables the logging even when multiprocessing is used when training with multiple devices and data loader workers. It prevents the logs from not being output in some cases.

It is possible to make console logs of multiple processes more readable. For example the following logging setup provides import information for each line such as timestamp, host name, process ID and log level:

import logging

log_format = "%(asctime)s %(name)s[%(process)d] %(levelname)s %(message)s"
logging.basicConfig(level=logging.INFO, format=log_format, datefmt="%Y-%m-%d %H:%M:%S")

Those lines are used for the entrypoints of the example project:

The underlying framework lightning provides further options to configure custom logging to the console: https://lightning.ai/docs/pytorch/stable/common/console_logs. html.

Setup Configs

This section gives an overview for the setup of the training configs wandb single run and sweep config.

The location of the config files can be set with environment variables SINGLE_RUN_CONFIG_PATH and SWEEP_DIRECTORY as explained in Setup Environment Variables.

The config documentation can be found here with example config files in examples/example_classification_project/configs.

Data splitting is a part of the training configuration and has 3 modes :

  • Train
  • Train + Test
  • Train + Test + Validation

Setup Environment Variables

This section explains all mandatory and optional environment variables.

Using an .env file locally in your project root is recommended.

Required Environment Variables for training

The most important environment variables for this training framework are listed here. Keep in mind that further variables e.g. of Weights and Biases can be compatible to the framework. For the sake of simplicity, however, we refer to the original sources.

Variable Mandatory, Condition Examples Source Description
CHECKPOINT_DIRECTORY No, for checkpointing models Local directory to save the checkpoints.
HF_TOKEN No, to upload to Hugging Face. (secret) Hugging Face Hugging Face token
HF_USERNAME No, to upload to Hugging Face. username Hugging Face Hugging Face username
SINGLE_RUN_CONFIG_PATH No, for wandb single runs configs/single-run.yaml Local file of the single run config for your single wandb run if not using a sweep.
SWEEP_DIRECTORY No, for wandb sweeps configs/sweep Local directory of the configs for your wandb sweeps.
SWEEP_SERVER_ADDRESS No, for wandb sweeps http://localhost:5001 The address of your hosted sweep server
TRAINER_PATH No, for wandb sweeps classification.classification_trainer Python file where your trainer subclass is implemented for your learning task
WANDB_ANONYMOUS No, set to must for integration tests allow, never, must wandb Let users create anonymous runs with secret urls.
WANDB_API_KEY No, for wandb online, sweeps (secret) wandb API KEY of your wandb account.
WANDB_DISABLE_GIT No, set to true for integration tests true, false wandb Prevent wandb from probing for a git repository and capturing the latest commit / diff.
WANDB_ENTITY No, for wandb online, sweeps my_company wandb Name of the wandb entity. (https://docs.wandb.ai/ref/python/init/)
WANDB_MODE No, set to offline for integration tests online, offline wandb If you set this to “offline” wandb will save your run metadata locally and not sync to the server. If you set this to disabled wandb will turn off completely.
WANDB_PROJECT No, for using wandb my_project wandb Name of the wandb project. (https://docs.wandb.ai/ref/python/init/)
WANDB_USERNAME No, for wandb online, sweeps my_name wandb Username of your wandb account.

Additional Environment Variables for Docker

Condition: When building or pushing docker images.

Variable Mandatory, Condition Examples Sources Description
DOCKER_REPOSITORY_SERVER No, to push images. gitlab.com:5050 Docker, GitLab Repository server of of your docker container registry
DOCKER_REPOSITORY_PATH No, to push images. myprojectGroup/myproject Docker, GitLab Repository path of your docker container registry
DOCKER_TRAINING_IMAGE_NAME No, to build images. 2024.10.dev0 Trainer image name for docker build (dev or release)
DOCKER_SWEEP_SERVER_IMAGE_NAME No, to build images. sweep-server Sweep server image name for docker build
DOCKER_USERNAME No, to push images. username Docker, GitLab Username of your docker container registry. GitLab is also supported.
DOCKER_TOKEN No, to push images. (secret) Docker, GitLab Token of your docker container registry. GitLab is also supported.

Additional Environment Variables for Kubernetes

Condition: When running docker images in Kubernetes.

Environment variables can be provided as config-maps and secrets on Kubernetes or Rancher.

Variable Mandatory/Condition Examples Source Description
KUBE_NAMESPACE Yes my-training Kubernetes Your kubernetes namespace
KUBE_SWEEP_SERVER_ADDRESS No, for wandb sweeps http://sweep-server:5001 The address of your hosted sweep server on kubernetes

Logging Metrics

This section explains the logging of metrics.

The metrics of the runs can be retrieved from the Weights and Biases website. You can set the config for it via the run/sweep config and the login with wandb environment variables.

Checkpoints

This section explains saving and loading of checkpoints.

The checkpoints are saved to the relative directory path that is given by the env variable CHECKPOINT_DIRECTORY which is by default models. Subfolders are created for the best and latest checkpoints (their existence depends on run/sweep config). Inside these folders subfolders with the timestamp of their creation are created. There you will find the checkpoint directories for your runs named by the wandb run id of the run that is logged on the Weights and Biases website.

The upload of the checkpoints of the trained model to Hugging Face can be configured in the run/sweep config.

When using Kubernetes it is possible to mount this checkpoint folder as volume e.g. Persistent Volumes (PVC) to be able to retrieve the checkpoints after training.

The training that was started using the framework can be resumed by setting the checkpoint path in the config to load the checkpoint. Important: For sweeps, all runs will start from this checkpoint.

Resume Weights and Biases Sweeps

Instructions on resuming sweeps are provided in this section.

When the sweep_id parameter in the sweep config is set to the sweep ID of a running sweep (refer to config docs) then the sweep will be resumed to run only the planned runs that have not been started yet.

To be able to resume Weights and Biases (wandb) runs by using SweepServer you will need to install wandb on your system interpreter! The resume of a sweep is explained in a further section Setup Configs.

pip install wandb

Build Docker Images

This section shows how to build docker images to run your training.

This step depends also on your project specific setup.

You can build docker images to run the training. One image is for single run training and the sweep agent. Another image is for the sweep server. You can build them with the example shell script that is provided in the example project:

  • Must be called from the example project's root
    ./scripts/build_images.sh
    

There is another script for the case if you want to build and push to a docker image repository, if the env variables are set for the docker repository:

  • Must be called from the example project's root
    ./scripts/build_and_upload_images.sh
    

examples/example_classification_project/scripts/build_and_upload_images_only_sweep_server.sh

Configs explained in Setup Configs will be baked into the docker image by default. So you can rebuild the SweepServer if you make changes in the sweep config files e.g. by using this shell script:

  • Must be called from the example project's root
    ./scripts/build_and_upload_images_only_sweep_server.sh
    

Alternatively you can mount the config files as volumes (Read further for examples: Example for running on Kubernetes).

Example for Running Docker Images on Kubernetes

This section explains how you can run your built docker images with Kubernetes, after you read the section Build docker images.

Preparation for Kubernetes or Rancher

To be able to run the docker images on Kubernetes as pods you should prepare your setup first. The provided example shell script setup_kubernetes.sh does the following:

  • creates a Kubernetes namespace if not existing and uses it for the next steps
  • recreates a secret named wandb-access for the wandb credentials
  • recreates a docker-registry named gitlab-registry for the case that docker images are stored on gitlab as Docker registry
  • recreates a configmap named sweep-config for further env variables that are necessary for the training
  • recreates the service for the sweep server to open the port for the requests by the agents

Alternatively, you can create your setup (configmaps, secrets and services) manually or with a adapted shell script.

Working with the example shell script setup_kubernetes.sh :

  • When working on Rancher then you have to log in to your Kubernetes cluster, and you have to create your namespace in you project then because kubectl cannot create a namespace inside a project. Projects only exist on rancher and not in Kubernetes itself.

  • Next set the name of the namespace as local env variable so that the shell script can create the contents in the existing namespace.

  • Ensure that you have set all env variables that are required for the training locally: see Required Environment Variables for training

  • Additional environment variables for Kubernetes are needed as well: see Additional Environment Variables for Kubernetes

    You need to be connected to the kubernetes cluster inside the console by using the kubernetes config of the cluster. The setup shell script will use these env variables to create the configmaps, secrets and services for your AI training.

  • Then run the setup shell script to create the necessary kubernetes resources.:

    ./examples/example_classification_project/scripts/setup_kubernetes.sh
    

You can repeat this procedure for multiple clusters and namespaces if you need to run multiple AI training experiments in parallel.

You can define your .yaml files for Kubernetes to start single runs and sweeps. You find more details about them in the next subsections:

For a Sweep:

There are templates for kubernetes yaml files provided in the example project examples/example_classification_project/configs/kubernetes/sweeps that fit to the example Dockerfiles.

Important You need to set the docker image urls in those .yaml files to point to your built images to use them.

Example files explained:

File Description
classification-training-pod.yaml Kubernetes config to start a job to launch multiple wandb agents for a sweep. (sweep server needed)
classification-training-with-sweep-server.yaml Kubernetes config to start a job to launch multiple wandb agents for a sweep. Additionally, starts a sweep server independent from the job usable for the agents.
sweep-server-service.yaml Kubernetes config to start a service, so that the agents can send requests to the port of the sweep server.
sweep-server.yaml Kubernetes config to start a sweep server.

To run a sweep, the SweepServer needs to run and the sweep configs must be located in the folder at the environment variable SWEEP_DIRECTORY path. In the example the sweep config files model.yaml, logging.yaml and dataset.yaml are given in addition to the mandatory sweep.yaml. The sweep config directory can include any number of yaml files in any number of subdirectories to offer the possibility to group the parameters. The only limitation is that sweep.yaml lays directly in the sweep directory and includes the general sweep settings. All the other files contain the parameters. For example:

  • dataset.yaml - contains the dataset configuration
  • model.yaml - contains the model configuration
  • logging.yaml - contains the logging configuration
  • sweep.yaml - contains general sweep configuration

Those configs combined follow the structure of a wandb sweep config.

The configs are copied to the docker image by default, so you have to rebuild the image on every change, which takes much time. The next lines explain an optional alternative way how to change the parameter values of the configs without a rebuild of the image.

They configs can be set in a ConfigMap named sweep-config-yaml and mounted by using a volume to replace the default config files.

Caution: The configs set in the ConfigMap fully replace the previously copied config files! You have to change the values inside the ConfigMap for them to take effect.

The ConfigMap must be created in the same Kubernetes namespace as used for the training.

To create the ConfigMap add the filename and the content of the file for each config file as key value pairs to the ConfigMap. The filename is used as key and the file content is to be pasted as value.

The image shows an example for two files:

img scikit-learn cross validation

As shown in the classification-training-with-sweep-server.yaml the ConfigMap is provided as volume for the sweep server so that all files given in the ConfigMap are used to fully replace the default configs/sweep folder:

Shown are the most important lines for the sweep server pod:

kind: Pod
metadata:
  name: sweep-server
[...]
spec:
[...]
  containers:
    - name: pytorch-model-sweep-agent
[...]
      volumeMounts:
        - name: sweep-config-yaml
          mountPath: /configs/sweep
[...]
  volumes:
    - name: sweep-config-yaml
      configMap:
        name: sweep-config-yaml

For more details, see the Kubernetes docs.

For a Single Run:

There is a templates for a kubernetes yaml file provided in the example project examples/example_classification_project/configs/kubernetes/single-run that fits to the example Dockerfiles.

Important You need to set the docker image urls in this .yaml file to point to your built images to use them.

Example file explained:

File Description
classification-training-pod-single-run.yaml Kubernetes config to start a single wandb run. (no sweep server needed)

The single run config is copied to the docker image by default, so you have to rebuild the image on every change, which takes much time. The next lines explain an optional alternative way how to change the parameter values of the config without a rebuild of the image.

For the single run it is also possible to provide the single-run.yaml in a ConfigMap e.g. single-run-yaml. The filename single-run.yaml is used as key and the file content as value inside the ConfigMap.

As visible in the example classification-training-pod-single-run.yaml the ConfigMap is provided as volume for the container (sweep server is not needed) in which the training runs. To be able to only replace the default single-run.yaml that is located at /workspace/configs/single-run.yaml in the docker image only the single-run.yaml key is used from the ConfigMap as volume. The subPath parameter is necessary for the volume mount to only replace that single file.

Caution: The single run config set in the ConfigMap fully replaces the previously copied config file! You have to change the values inside the ConfigMap for them to take effect.

Shown are the most important lines for the single run job:

apiVersion: batch/v1
kind: Job
[...]
      containers:
        - name: pytorch-model-single-run
[...]
          volumeMounts:
            [...]
            - mountPath: /workspace/configs/single-run.yaml
              name: single-run-yaml
              subPath: single-run.yaml
[...]
      volumes:
        [...]
        - name: single-run-yaml
          configMap:
            name: single-run-yaml
            items:
              - key: single-run.yaml
                path: single-run.yaml

Start Training on Kubernetes or Rancher

If the hyperparameters are set in the config-maps and the Kubernetes yaml files have been prepared (includes selection of namespace, node, computation resources) then you can paste the yaml files to Kubernetes or Rancher.

The training will run itself until its finished.

Cleanup after Training on Kubernetes or Rancher

If there are any finished jobs and pods on the system then you can delete them, after you have found the logged metrics by wandb.

It is recommended to keep the rest of the setup of the namespace as it is to be able to launch more trainings without repeating all preparation steps. In this case the reduced preparation time clearly shows the benefits of using the framework.

Development Setup

This chapter is meant for the developers of this framework and shows the basic project setup to be able to further develop or maintain the framework.

It is recommended to follow the sections in the given order.

Install Python Requirements

pip install -r requirements.txt
pip install -r requirements-tests.txt

Set src Folders:

  • src
  • examples/example_classification_project/src

Set test Folders:

  • tests

Unit Tests and Integration Tests

  • caution: The example project is also configured as integration tests. Only change the .env and config files if you know what you are doing.

  • Test scripts directory: tests

  • Integration test scripts directory: tests/integration

  • The integration tests in tests/integration are used to show minimal example project setups

  • All tests have to be run from the project root dir as workdir

  • Please do not mark the subdirectories named "src" python as source folders to avoid breaking the structure

  • To find all code modules during tests the pythonpath is defined in the pyproject.toml file

This way all tests functions (with prefix "tests") are found and executed from project root:

pytest tests

Debug Entrypoints

For a setup such as given in the example project the working directory must be the project root where the entrypoints are called from because of the referenced location of the config files.

Append examples/example_classification_project/src as value to the env variable PYTHONPATH.

First change working directory:

cd examples/example_classification_project

You can run the trainer implementation of the provided example project in single run mode with this entrypoint:

python -u -m src.classification_training.classification_trainer

Alternatively you can run the sweep the provided example project with this entrypoints in two terminals:

You need to set the env variables first, but don't set those secrets directly in the example project! Instead, configure them in your operating system:

  • WANDB_USERNAME
  • WANDB_API_KEY
  • WANDB_ENTITY
python -u -m src.classification_training.wandb_utils.sweep_server
python -u -m src.classification_training.wandb_utils.sweep_agent

Build Package Locally

python -m build 

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anhaltai_commons_pl_hyper-2025.2.4.tar.gz (59.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anhaltai_commons_pl_hyper-2025.2.4-py3-none-any.whl (42.1 kB view details)

Uploaded Python 3

File details

Details for the file anhaltai_commons_pl_hyper-2025.2.4.tar.gz.

File metadata

File hashes

Hashes for anhaltai_commons_pl_hyper-2025.2.4.tar.gz
Algorithm Hash digest
SHA256 4c043eb04170bd336897bd5d3132733561bbd666effe9aade25cd674634c8804
MD5 98df1fdcc15b4588eb23e47847fb3d17
BLAKE2b-256 a121d60bfb53b20d2b1ca2d963a3b8fa81e4fb121b906931e4a405fd483b08f9

See more details on using hashes here.

File details

Details for the file anhaltai_commons_pl_hyper-2025.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for anhaltai_commons_pl_hyper-2025.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7b4bac76e3c9ac5932bb663e4ea6d8ab7ad2750283bcebb13c2cb6d9a5d1fdaa
MD5 ba99e6751aedff5a29eea37bef7da76f
BLAKE2b-256 e8c0ae6fedaea4285ea36213d369330caa4c6c1e5512c8ab5d59e2882d653a69

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page