Skip to main content

Commons PyTorch Lightning Trainer for Hyperparmater Optimization

Project description

AnhaltAI Commons PL Hyper

AnhaltAI Commons Pytorch Lightning Trainer Framework for Hyper Parameter Optimization

Summary

Deep Learning Trainer based on PyTorch Lightning with common usable setup for different deep learning tasks that supports k-Fold cross-validation to fulfill automated hyperparameter optimization. The runs are planned by using sweeps from Weights and Biases (wandb) that are created based on the supported configuration files.

With the usage of the code of Weights and Biases and Lightning AI the training using multiple GPUs and wandb agent processes is a main part of this framework.

The foundation provided by this framework must be extended with code parts for each AI learning task.

The package is accessible on PyPI and compatible to Python version >=3.10

Contents

Usage

Install with pip

pip install anhaltai-pl-hyper

Extend the implementation for your task

To use this framework for your very specific task you have to extend the provided abstract classes and functions. You need to add the implementation of your Trainer, DataModule, TrainingModule and preprocessing of your datasets for your specific AI learning task.

There are multiple integration tests in the tests/integration directory showing examples how to use this framework for your AI training e.g. for different tasks and data splitting.

You will find detailed information here: src/anhaltai_commons_pl_hyper/README.md

Extend sweep server and wandb agent

The package provides functions to run a sweep server that creates or resumes a Weights and Biases (wandb) sweep. Then multiple agents can be started. They get the sweep IDs from the server via REST request and start an available run of the sweep.

To use them you can create your own functions to call the provided functions create_agent() and SweepServer().main() from your code base. Feel free to extend or overwrite these functions for your need. Having these in your implementation enables the later step Build docker images

Basic example:

.../wandb_utils/sweep_server.py

from anhaltai_commons_pl_hyper.wandb_utils.sweep_server import SweepServer

if __name__ == "__main__":
    # load your env variables here

    SweepServer().main()  # run

.../wandb_utils/sweep_agent.py

from anhaltai_commons_pl_hyper.wandb_utils.sweep_agent import create_agent

if __name__ == "__main__":
    # load your env variables here

    create_agent()  # run

To resume Weights and Biases (wandb) runs by using SweepServer you will need to install wandb on your system interpreter! The resume of a sweep is explained in a further section Setup Configs.

pip install wandb

Configure logging for multiprocessing:

It is recommended to insert custom logging options before calling create_agent() and SweepServer().main() to be able to read logs of multiple processes more clearly:

import logging

log_format = "%(asctime)s %(name)s[%(process)d] %(levelname)s %(message)s"
logging.basicConfig(level=logging.INFO, format=log_format, datefmt="%Y-%m-%d %H:%M:%S")

Setup Configs

Example config files are located in ./configs/

You will find its documentation here: docs/config-documentation.md

Supported data splitting modes are documented here: docs/data-splitting-documentation.md TL;DR: basically: train, or train+test, or train+test+val.

The location of the config files can be set with environment variables as explained in Setup Environment Variables.

Setup Environment Variables

Usage: First copy .env-example file as .env file to your project root and change its values as you need.

Required Environment Variables for training

Variable Example Source Description
WANDB_PROJECT myproject wandb Name of the wandb project. (https://docs.wandb.ai/ref/python/init/)
WANDB_API_KEY wandb API KEY of your wandb account
WANDB_ENTITY mycompany wandb Name of the wandb entity. (https://docs.wandb.ai/ref/python/init/)
CHECKPOINT_DIRECTORY models
Local directory to save the checkpoints.
SWEEP_DIRECTORY configs/sweep Local directory of the configs for your wandb sweeps.
SINGLE_RUN_CONFIG_PATH configs/single-run.yaml Local file of the single run config for your single wandb run if not using a sweep.
TRAINER_PATH classification.classification_trainer Python file where your trainer subclass is implemented for your learning task
SWEEP_SERVER_ADDRESS http://localhost:5001 The address of your hosted sweep server
HF_USERNAME username Hugging Face Huggingface username
HF_TOKEN Hugging Face Huggingface token

Additional Environment Variables for Docker and Kubernetes

Variable Example Source Description
DOCKER_REPOSITORY_SERVER gitlab.com:5050 Repository server of of your docker container registry
DOCKER_REPOSITORY_PATH myprojectGroup/myproject Repository path of your docker container registry
DOCKER_TRAINING_IMAGE_NAME 2024.10.dev0 Trainer image name for docker build (dev or release)
DOCKER_SWEEP_SERVER_IMAGE_NAME sweep-server Sweep server image name for docker build
DOCKER_USERNAME username Username of your docker container registry
DOCKER_TOKEN Token of your docker container registry
KUBE_NAMESPACE my-training Your kubernetes namespace
KUBE_SWEEP_SERVER_ADDRESS http://sweep-server:5001 The address of your hosted sweep server on kubernetes

Run your training

This step depends on your project specific setup, your hardware and your configuration. Run a wandb sweep by running the SweepServer or other options provided by Weights and Biases (wandb).

Then run one or more sweep agents with create_agent() or your specific implementation.

You can also start a single run by starting your trainer subclass e.g. trainer.py.

Metrics

The metrics of the runs can be retrieved from the Weights and Biases website. You can set the config for it via the run/sweep config and the login with wandb environment variables.

Checkpoints

The checkpoints are saved to the relative directory path that is given by the env variable CHECKPOINT_DIRECTORY which is by default models. Subfolders are created for the best and latest checkpoints (its existence depends on run/sweep config). Inside these folders subfolders with the timestamp of creation are created. There you will find the checkpoint directories for your runs named by the wandb run id of the run that is logged on the Weights and Biases website.

The upload of the checkpoints of the trained model to Hugging Face can be configured in the run/sweep config.

When using Kubernetes it is possible to mount this checkpoint folder as volume e.g. Persistent Volumes (PVC) to be able to retrieve the checkpoints after a training.

Build docker images

This step depends also on your project specific setup.

You can build docker images for the sweep agent and the sweep server.

Hint: Configs explained in Setup Configs will be baked into the docker image for now. So can rebuild the SweepServer if you make changes in the sweep config files. Alternatively you can mount the config files as volumes (Read further for examples: Example for running on Kubernetes).

You will find example Dockerfiles in the root of this repository and example build scripts in the scripts directory.

Example for running on Kubernetes

You can run the (custom) built docker images with Kubernetes. There are also templates for kubernetes yaml files in the configs dir that fit to the example Dockerfiles.

For a sweep:
  • segmentation-training-with-sweep-server.yaml (cmd runs trainer from sweep agent)
  • sweep-server-service.yaml

Alternatively for a sweep:

  • segmentation-training-pod.yaml
  • sweep-server.yaml

In the example the sweep config files model.yaml, logging.yaml and dataset.yaml are given in addition to the mandatory sweep.yaml.

They are given in a ConfigMap named sweep-config-yaml and mounted by using a volume to replace the default config files. The ConfigMap must be created in the same Kubernetes namespace as used for the training.

To create the ConfigMap add the filename and the content of the file for each config file as key value pairs to the ConfigMap. The filename is used as key and the file content is to be pasted as value.

The image shows an example for two files: img scikit-learn cross validation

As shown in the classification-training-with-sweep-server.yaml the ConfigMap is provided as volume for the sweep server so that all files given in the ConfigMap are used to fully replace the default configs/sweep folder:

Shown are the most important lines:

metadata:
  name: sweep-server
[...]
spec:
[...]
  containers:
[...]
      volumeMounts:
        - name: sweep-config-yaml
          mountPath: /configs/sweep
[...]
  volumes:
    - name: sweep-config-yaml
      configMap:
        name: sweep-config-yaml

For more details, see the Kubernetes docs.

For a single run:
  • classification-training-pod-single-run.yaml (cmd runs trainer directly)

For the single run it is also possible to provide the single-run.yaml in a ConfigMap e.g. single-run-yaml. The filename single-run.yaml is used as key and the file content as value inside the ConfigMap.

As visible in the example classification-training-pod-single-run.yaml the ConfigMap is provided as volume for the container (sweep server is not needed) in which the training runs. To be able to only replace the default single-run.yaml that is located at /workspace/configs/single-run.yaml in the docker image only the single-run.yaml key is used from the ConfigMap as volume. For the volume mount is the subPath parameter necessary to only replace that single file.

Shown are the most important lines:

[...]
      containers:
        - name: pytorch-model-single-run
[...]
          volumeMounts:
            [...]
            - mountPath: /workspace/configs/single-run.yaml
              name: single-run-yaml
              subPath: single-run.yaml
[...]
      volumes:
        [...]
        - name: single-run-yaml
          configMap:
            name: single-run-yaml
            items:
              - key: single-run.yaml
                path: single-run.yaml

Development Setup

Install python requirements

pip install -r requirements.txt
pip install -r requirements-tests.txt

Entrypoints

You need to prepare the following:

It is just a demo for debugging and to write tests. You will need to do all steps described in Usage to be able to run a minimal working example.

To start wandb single run without sweep:

python src/anhaltai_commons_pl_hyper/trainer.py

To start wandb sweep server:

python src/anhaltai_commons_pl_hyper/wandb_utils/sweep_server.py

To start a local sweep run that gets the sweep ID from the sweep server to execute its runs:

python src/anhaltai_commons_pl_hyper/wandb_utils/sweep_agent.py

Build package locally

python -m build 

Unit Tests and Integration Tests

  • Test scripts directory: tests
  • Integration test scripts directory: tests/integration
  • The integration tests in tests/integration ar used to show minimal example project setups
  • All tests have to be run from the project root dir as workdir
  • Please do not mark the subdirectories named "src" python as source folders to avoid breaking the structure
  • To find all code modules during tests the pythonpath is defined in the pyproject.toml file

This way all tests functions (with prefix "tests") are found and executed from project root:

pytest tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anhaltai_commons_pl_hyper-2024.11.18.tar.gz (34.4 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file anhaltai_commons_pl_hyper-2024.11.18.tar.gz.

File metadata

File hashes

Hashes for anhaltai_commons_pl_hyper-2024.11.18.tar.gz
Algorithm Hash digest
SHA256 d3f0bbf862d4a56d9fa258a65902fe2e1472cfff4c4fcc6a20da78dad54193e6
MD5 1aec65ec69bea990bb2c8da62aca71a4
BLAKE2b-256 303647014efa39aa3a267144e3a16089b55d89095efc4cf3518b673c77e07c26

See more details on using hashes here.

File details

Details for the file anhaltai_commons_pl_hyper-2024.11.18-py3-none-any.whl.

File metadata

File hashes

Hashes for anhaltai_commons_pl_hyper-2024.11.18-py3-none-any.whl
Algorithm Hash digest
SHA256 2be2d9b4a5c29685003c2eaa07433b568aa89051cc6b60119faacee5242f5847
MD5 69d40422b332f5ed604237d9fc6ab830
BLAKE2b-256 09587f7668a4b8ab216f0a245c5e9db248e9bd74419cd07b218d82cd5518f208

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page