Evaluate Deep Reinforcement Learning policies for heterogeneous Workload Management
Project description
Framework for evaluating workload management policies based on deep reinforcement learning for heterogeneous clusters.
Overview
HDeepRM is a Python framework for evaluating workload management policies based on deep reinforcement learning for heterogeneous clusters. It leverages the Batsim ecosystem for simulating a heterogeneous workload management context. This is composed of the simulator, Batsim and the decision system, PyBatsim.
HDeepRM provides a heterogeneity layer on top of PyBatsim, which adds support for user-defined resource hierarchies. Memory capacity and bandwidth conflicts are added along with interdependence when consolidating or scattering jobs across the data centre.
It offers a flexible API for developing deep reinforcement learning agents. These may be trained by providing real workload traces in SWF format along with platforms defined in the format specified in Platforms. They can be further evaluated and tested against classic policies.
Installation Prerequisites
HDeepRM is distributed as a Python package on PyPi. In order to download and install it, the following software is needed:
Python3.6+, find your OS in this installation guide.
Pip, the Python package manager. If not already available with the Python installation, follow the official guide.
Installation
For installing HDeepRM, just download the package from PyPi:
pip install --upgrade --user hdeeprm
If pip is mapped to Python 2.x, try:
pip3 install --upgrade --user hdeeprm
When working with multiple Python versions, use:
python3.6 -m pip install --upgrade --user hdeeprm
This should download the hdeeprm package with all its dependencies, which are:
defusedxml >= 0.5.0: secure XML generation and parsing.
gym >= 0.12.0: environment, actions and observations definitions.
lxml >= 4.3.2: generation of the XML tree. Backend for defusedxml.
numpy >= 1.16.2: efficient data structure operations.
procset >= 1.0: closed-interval sets for resource selection.
pybatsim >= 3.1.0: decision system and main interface to interact with Batsim.
torch >= 1.0.1.post2: deep learning library for agent definition.
Usage Prerequisites
The simulation side is done by Batsim, which is needed in order to run HDeepRM experiments. Follow the official installation docs for instructions.
Launching experiments
In order to experiment with HDeepRM, an integrated launcher is provided:
hdeeprm-launch -a <agent.py> -cw <custom_workload.json> -im <saved_model.pt> -om <to_save_model.pt> <options.json>
The options.json specifies the experiment parameters. The JSON structure is as follows:
{
"seed": 0,
"nb_resources": 0,
"nb_jobs": 0,
"workload_file_path": "",
"platform_file_path": "",
"pybatsim": {
"log_level": "",
"env": {
"objective": "",
"actions": {
"selection": [
{"": []}
],
"void": false
},
"observation": "",
"queue_sensitivity": 0.0,
},
"agent": {
"type": "",
"run": "",
"hidden": 0,
"lr": 0.0,
"gamma": 0.0
}
}
}
Global options:
seed - The random seed for evaluation reproducibility.
nb_resources - Total number of cores in the simulated platform.
nb_jobs - Total number of jobs to generate in the workload.
workload_file_path - Location of the original SWF formatted workload.
platform_file_path - Location of the original HDeepRM JSON formatted platform.
PyBatsim options:
log_level - Logging level for showing insights from the simulation. See Logging for reference on possible values.
PyBatsim - Environment options:
objective - Metric to be optimised by the agent. See Objectives for an explanation and recognised values.
actions - Subset of actions for the simulation. If not specified, all 37 actions in HDeepRM are used.
observation - Type of observation to use, one of normal, small or minimal.
queue_sensitivity - Sensitivity of the observation to variations in job queue size. See Hyperparameters - Queue Sensitivity.
PyBatsim - Common agent options:
type - Type of the scheduling agent, one of CLASSIC or LEARNING.
PyBatsim - Learning agent options:
run - Type of run for the learning agent, one of train or test. When training, the agent’s inner model is updated, whereas testing is meant for evaluation purposes.
hidden - Number of units in each hidden layer from the agent’s inner model. See Hyperparameters - Hidden units.
lr - Learning rate for updating the agent’s inner model. See Hyperparameters - Learning rate.
gamma - Discount factor for rewards. See Hyperparameters - Reward Discount Factor.
This is an example of an options.json file for a classic agent:
{
"seed": 2009,
"nb_resources": 175,
"nb_jobs": 1000,
"workload_file_path": "/workspace/workloads/my_workload.swf",
"platform_file_path": "/workspace/platforms/my_platform.json",
"pybatsim": {
"log_level": "DEBUG",
"env": {
"objective": "avg_utilization",
"actions": {
"selection": [
{"shortest": ["high_mem_bw"]}
],
"void": false
},
"observation": "normal",
"queue_sensitivity": 0.05
},
"agent": {
"type": "CLASSIC"
}
}
}
This is another example of an options.json file, in this case for a learning agent:
{
"seed": 1995,
"nb_resources": 175,
"nb_jobs": 1000,
"workload_file_path": "/workspace/workloads/my_workload.swf",
"platform_file_path": "/workspace/platforms/my_platform.json",
"pybatsim": {
"log_level": "WARNING",
"env": {
"objective": "makespan",
"actions": {
"selection": [
{"first": ["high_gflops", "high_mem_bw"]},
{"smallest": [""]}
],
"void": false
},
"queue_sensitivity": 0.01
},
"agent": {
"type": "LEARNING",
"run": "train",
"hidden": 128,
"lr": 0.001,
"gamma": 0.99
}
}
}
Optional command line arguments are available:
-a - The file containing your developed learning agent for evaluation. See agent examples for reference.
-cw - If you are thinking about proof-of-concept experiments, you may need to define your own workload. Doing this in SWF is tedious, thus this option allows for passing a custom workload defined in Batsim JSON format.
-im - PyTorch trained models are usually saved in .pt files. This option allows for loading a previously trained model to bootstrap the agent.
-om - If you want to save the model after the simulation is finished, specify the output file in this option. This is usually combined with train runs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hdeeprm-0.3.1b1.tar.gz
.
File metadata
- Download URL: hdeeprm-0.3.1b1.tar.gz
- Upload date:
- Size: 38.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6c0a186571169de9ce15ec8284a3aa78b1cf7f0531439335f1db56238fa053f |
|
MD5 | 00bbae10dfc13ef4e9b678de50e90c7c |
|
BLAKE2b-256 | 0e55d1d9b42bbf3688f548041448a386df7a8bc4ba36e081aef5a3b77b5c1a7f |
File details
Details for the file hdeeprm-0.3.1b1-py3-none-any.whl
.
File metadata
- Download URL: hdeeprm-0.3.1b1-py3-none-any.whl
- Upload date:
- Size: 41.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eda0a33b2838723882b60790bec1a12535583d8cbed64da2a80c883120d1d3d1 |
|
MD5 | 0f4ce976311988b37fb7628f9baa5782 |
|
BLAKE2b-256 | f5eb47e2f2dc951c9e0a3325dfc9ecae91644875d06456e1895560722f19ca0b |