Commons PyTorch Lightning Trainer for Hyperparmater Optimization
Project description
AnhaltAI Commons PL Hyper
AnhaltAI Commons Pytorch Lightning Trainer Framework for Hyper Parameter Optimization
Summary
Deep Learning Trainer based on PyTorch Lightning with common usable setup for different deep learning tasks that supports k-Fold cross-validation to fulfill automated hyperparameter optimization. The runs are planned by using sweeps from Weights and Biases (wandb) that are created based on the supported configuration files.
With the usage of the code of Weights and Biases and Lightning AI the training using multiple GPUs and wandb agent processes is a main part of this framework.
The foundation provided by this framework must be extended with code parts for each AI learning task.
The package is accessible on PyPI and compatible to Python version >=3.10
Contents
Usage
Install with pip
pip install anhaltai-pl-hyper
Extend the implementation for your task
To use this framework for your very specific task you have to extend the provided abstract classes and functions. You need to add the implementation of your Trainer, DataModule, TrainingModule and preprocessing of your datasets for your specific AI learning task.
There are multiple integration tests in the tests/integration
directory showing
examples how to use this framework for your AI training e.g. for different tasks and
data splitting.
You will find detailed information here: src/anhaltai_commons_pl_hyper/README.md
Extend sweep server and wandb agent
The package provides functions to run a sweep server that creates or resumes a Weights and Biases (wandb) sweep. Then multiple agents can be started. They get the sweep IDs from the server via REST request and start an available run of the sweep.
To use them you can create your own functions to call the provided functions
create_agent()
and
SweepServer().main()
from your code base. Feel free to extend or overwrite these
functions for your need.
Having these in your implementation enables the later
step Build docker images
Basic example:
.../wandb_utils/sweep_server.py
from anhaltai_commons_pl_hyper.wandb_utils.sweep_server import SweepServer
if __name__ == "__main__":
# load your env variables here
SweepServer().main() # run
.../wandb_utils/sweep_agent.py
from anhaltai_commons_pl_hyper.wandb_utils.sweep_agent import create_agent
if __name__ == "__main__":
# load your env variables here
create_agent() # run
To resume Weights and Biases (wandb) runs by using SweepServer you will need to install wandb on your system interpreter! The resume of a sweep is explained in a further section Setup Configs.
pip install wandb
Configure logging for multiprocessing:
It is recommended to insert custom logging options before calling create_agent()
and
SweepServer().main()
to be able to read logs of multiple processes more clearly:
import logging
log_format = "%(asctime)s %(name)s[%(process)d] %(levelname)s %(message)s"
logging.basicConfig(level=logging.INFO, format=log_format, datefmt="%Y-%m-%d %H:%M:%S")
Setup Configs
Example config files are located in ./configs/
You will find its documentation here: docs/config-documentation.md
Supported data splitting modes are documented here: docs/data-splitting-documentation.md TL;DR: basically: train, or train+test, or train+test+val.
The location of the config files can be set with environment variables as explained in Setup Environment Variables.
Setup Environment Variables
Usage: First copy .env-example
file as .env
file to your project root and change
its values as you need.
Required Environment Variables for training
Variable | Example | Source | Description |
---|---|---|---|
WANDB_PROJECT | myproject | wandb | Name of the wandb project. (https://docs.wandb.ai/ref/python/init/) |
WANDB_API_KEY | wandb | API KEY of your wandb account | |
WANDB_ENTITY | mycompany | wandb | Name of the wandb entity. (https://docs.wandb.ai/ref/python/init/) |
CHECKPOINT_DIRECTORY | models | Local directory to save the checkpoints. | |
SWEEP_DIRECTORY | configs/sweep | Local directory of the configs for your wandb sweeps. | |
SINGLE_RUN_CONFIG_PATH | configs/single-run.yaml | Local file of the single run config for your single wandb run if not using a sweep. | |
TRAINER_PATH | classification.classification_trainer | Python file where your trainer subclass is implemented for your learning task | |
SWEEP_SERVER_ADDRESS | http://localhost:5001 | The address of your hosted sweep server | |
HF_USERNAME | username | Hugging Face | Huggingface username |
HF_TOKEN | Hugging Face | Huggingface token |
Additional Environment Variables for Docker and Kubernetes
Variable | Example | Source | Description |
---|---|---|---|
DOCKER_REPOSITORY_SERVER | gitlab.com:5050 | Repository server of of your docker container registry | |
DOCKER_REPOSITORY_PATH | myprojectGroup/myproject | Repository path of your docker container registry | |
DOCKER_TRAINING_IMAGE_NAME | 2024.10.dev0 | Trainer image name for docker build (dev or release) | |
DOCKER_SWEEP_SERVER_IMAGE_NAME | sweep-server | Sweep server image name for docker build | |
DOCKER_USERNAME | username | Username of your docker container registry | |
DOCKER_TOKEN | Token of your docker container registry | ||
KUBE_NAMESPACE | my-training | Your kubernetes namespace | |
KUBE_SWEEP_SERVER_ADDRESS | http://sweep-server:5001 | The address of your hosted sweep server on kubernetes |
Run your training
This step depends on your project specific setup, your hardware and your configuration.
Run a wandb sweep by running the SweepServer
or other options provided by
Weights and Biases (wandb).
Then run one or more sweep agents with create_agent()
or your specific
implementation.
You can also start a single run by starting your trainer subclass e.g. trainer.py
.
Metrics
The metrics of the runs can be retrieved from the Weights and Biases website. You can set the config for it via the run/sweep config and the login with wandb environment variables.
Checkpoints
The checkpoints are saved to the relative directory path that is given by the env
variable
CHECKPOINT_DIRECTORY
which is by default models
.
Subfolders are created for the best
and latest
checkpoints
(its existence depends on run/sweep config).
Inside these folders subfolders with the timestamp of creation are created.
There you will find the checkpoint directories for your runs named by the wandb run id
of the run that is logged on the Weights and Biases website.
The upload of the checkpoints of the trained model to Hugging Face can be configured in the run/sweep config.
When using Kubernetes it is possible to mount this checkpoint folder as volume e.g. Persistent Volumes (PVC) to be able to retrieve the checkpoints after a training.
Build docker images
This step depends also on your project specific setup.
You can build docker images for the sweep agent and the sweep server.
Hint: Configs explained in Setup Configs will be baked into the docker image for now. So can rebuild the SweepServer if you make changes in the sweep config files. Alternatively you can mount the config files as volumes (Read further for examples: Example for running on Kubernetes).
You will find example Dockerfiles in the root of this repository
and example build scripts in the scripts
directory.
Example for running on Kubernetes
You can run the (custom) built docker images with
Kubernetes.
There are also templates for kubernetes yaml files in the configs
dir that fit to
the example Dockerfiles.
For a sweep:
segmentation-training-with-sweep-server.yaml
(cmd runs trainer from sweep agent)sweep-server-service.yaml
Alternatively for a sweep:
segmentation-training-pod.yaml
sweep-server.yaml
In the example the sweep config files model.yaml
, logging.yaml
and
dataset.yaml
are given in addition to the mandatory sweep.yaml
.
They are given in
a ConfigMap
named sweep-config-yaml
and mounted by using a
volume
to replace the default config files. The ConfigMap must be created in the same
Kubernetes namespace as used for the training.
To create the ConfigMap add the filename and the content of the file for each config file as key value pairs to the ConfigMap. The filename is used as key and the file content is to be pasted as value.
The image shows an example for two files:
As shown in the classification-training-with-sweep-server.yaml
the ConfigMap is
provided as volume for the sweep server so that all files given in the ConfigMap are
used to fully replace the default configs/sweep
folder:
Shown are the most important lines:
metadata:
name: sweep-server
[...]
spec:
[...]
containers:
[...]
volumeMounts:
- name: sweep-config-yaml
mountPath: /configs/sweep
[...]
volumes:
- name: sweep-config-yaml
configMap:
name: sweep-config-yaml
For more details, see the Kubernetes docs.
For a single run:
classification-training-pod-single-run.yaml
(cmd runs trainer directly)
For the single run it is also possible to provide the single-run.yaml
in a
ConfigMap
e.g. single-run-yaml
. The filename single-run.yaml
is used as key
and the file content as value inside the ConfigMap.
As visible in the example classification-training-pod-single-run.yaml
the
ConfigMap is provided as volume
for the container (sweep server is not needed) in which
the training runs.
To be able to only replace the default single-run.yaml
that is located at
/workspace/configs/single-run.yaml
in the docker image only the single-run.yaml
key is used from the ConfigMap as volume. For the volume mount is the subPath
parameter necessary to only replace that single file.
Shown are the most important lines:
[...]
containers:
- name: pytorch-model-single-run
[...]
volumeMounts:
[...]
- mountPath: /workspace/configs/single-run.yaml
name: single-run-yaml
subPath: single-run.yaml
[...]
volumes:
[...]
- name: single-run-yaml
configMap:
name: single-run-yaml
items:
- key: single-run.yaml
path: single-run.yaml
Development Setup
Install python requirements
pip install -r requirements.txt
pip install -r requirements-tests.txt
Entrypoints
You need to prepare the following:
It is just a demo for debugging and to write tests. You will need to do all steps described in Usage to be able to run a minimal working example.
To start wandb single run without sweep:
python src/anhaltai_commons_pl_hyper/trainer.py
To start wandb sweep server:
python src/anhaltai_commons_pl_hyper/wandb_utils/sweep_server.py
To start a local sweep run that gets the sweep ID from the sweep server to execute its runs:
python src/anhaltai_commons_pl_hyper/wandb_utils/sweep_agent.py
Build package locally
python -m build
Unit Tests and Integration Tests
- Test scripts directory: tests
- Integration test scripts directory: tests/integration
- The integration tests in tests/integration ar used to show minimal example project setups
- All tests have to be run from the project root dir as workdir
- Please do not mark the subdirectories named "src" python as source folders to avoid breaking the structure
- To find all code modules during tests the
pythonpath
is defined in thepyproject.toml
file
This way all tests functions (with prefix "tests") are found and executed from project root:
pytest tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file anhaltai_commons_pl_hyper-2024.11.18.tar.gz
.
File metadata
- Download URL: anhaltai_commons_pl_hyper-2024.11.18.tar.gz
- Upload date:
- Size: 34.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.13.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3f0bbf862d4a56d9fa258a65902fe2e1472cfff4c4fcc6a20da78dad54193e6 |
|
MD5 | 1aec65ec69bea990bb2c8da62aca71a4 |
|
BLAKE2b-256 | 303647014efa39aa3a267144e3a16089b55d89095efc4cf3518b673c77e07c26 |
File details
Details for the file anhaltai_commons_pl_hyper-2024.11.18-py3-none-any.whl
.
File metadata
- Download URL: anhaltai_commons_pl_hyper-2024.11.18-py3-none-any.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.13.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2be2d9b4a5c29685003c2eaa07433b568aa89051cc6b60119faacee5242f5847 |
|
MD5 | 69d40422b332f5ed604237d9fc6ab830 |
|
BLAKE2b-256 | 09587f7668a4b8ab216f0a245c5e9db248e9bd74419cd07b218d82cd5518f208 |