Evaluate Deep Reinforcement Learning policies for heterogeneous Workload Management
Project description
Framework for evaluating workload management policies based on deep reinforcement learning for heterogeneous clusters.
Overview
HDeepRM is a Python framework for evaluating workload management policies based on deep reinforcement learning for heterogeneous clusters. It leverages the Batsim ecosystem for simulating a heterogeneous workload management context. This is composed of the simulator, Batsim and the decision system, PyBatsim.
HDeepRM provides a heterogeneity layer on top of PyBatsim, which adds support for user-defined resource hierarchies. Memory capacity and bandwidth conflicts are added along with interdependence when consolidating or scattering jobs across the data centre.
It offers a flexible API for developing deep reinforcement learning agents. These may be trained by providing real workload traces in SWF format along with platforms defined in the format specified in Platforms. They can be further evaluated and tested against classic policies.
Installation Prerequisites
HDeepRM is distributed as a Python package on PyPi. In order to download and install it, the following software is needed:
Python3.6+, find your OS in this installation guide.
Pip, the Python package manager. If not already available with the Python installation, follow the official guide.
Installation
For installing HDeepRM, just download the package from PyPi:
pip install --user hdeeprm
If pip is mapped to Python 2.x, try:
pip3 install --user hdeeprm
When working with multiple Python versions, use:
python3.6 -m pip install --user hdeeprm
This should download the hdeeprm package with all its dependencies, which are:
defusedxml >= 0.5.0: secure XML generation and parsing.
gym >= 0.12.0: environment, actions and observations definitions.
lxml >= 4.3.2: generation of the XML tree. Backend for defusedxml.
numpy >= 1.16.2: efficient data structure operations.
procset >= 1.0: closed-interval sets for resource selection.
pybatsim >= 3.1.0: decision system and main interface to interact with Batsim.
torch >= 1.0.1.post2: deep learning library for agent definition.
Usage Prerequisites
The simulation side is done by Batsim, which is needed in order to run HDeepRM experiments. Follow the official installation docs for instructions.
Usage
In order to experiment with HDeepRM, an integrated launcher is provided:
hdeeprm-launch -a <agent.py> -im <saved_model.pt> -om <to_save_model.pt> <options.json>
The options.json specifies the experiment parameters. The JSON structure is as follows:
{
"seed": "",
"nb_resources": "",
"nb_jobs": "",
"workload_file_path": "",
"platform_file_path": "",
"pybatsim": {
"log_level": "",
"env": {
"objective": "",
"queue_sensitivity": ""
},
"agent": {
"type": "",
"policy_pair": "",
"run": "",
"hidden": "",
"lr": "",
"gamma": ""
}
}
}
Global options:
seed - The random seed for evaluation reproducibility.
nb_resources - Total number of cores in the simulated platform.
nb_jobs - Total number of jobs to generate in the workload.
workload_file_path - Location of the original SWF formatted workload.
platform_file_path - Location of the original HDeepRM JSON formatted platform.
PyBatsim options:
log_level - Logging level for showing insights from the simulation. See Logging for reference on possible values.
PyBatsim - Environment options:
objective - Metric to be optimised by the agent. See Objectives for an explanation and recognised values.
queue_sensitivity - Sensitivity of the observation to variations in job queue size. See Hyperparameters - Queue Sensitivity.
PyBatsim - Common agent options:
type - Type of the scheduling agent, one of CLASSIC or LEARNING.
PyBatsim - Classic agent options:
policy_pair - The job and resource selection policies. Policy pairs are further described in Environment - Action Space.
PyBatsim - Learning agent options:
run - Type of run for the learning agent, one of train or test. When training, the agent’s inner model is updated, whereas testing is meant for evaluation purposes.
hidden - Number of units in each hidden layer from the agent’s inner model. See Hyperparameters - Hidden units.
lr - Learning rate for updating the agent’s inner model. See Hyperparameters - Learning rate.
gamma - Discount factor for rewards. See Hyperparameters - Reward Discount Factor.
This is an example of an options.json file for a classic agent:
{
"seed": 2009,
"nb_resources": 2280,
"nb_jobs": 10000,
"workload_file_path": "/workspace/workloads/my_workload.swf",
"platform_file_path": "/workspace/platforms/my_platform.json",
"pybatsim": {
"log_level": "DEBUG",
"env": {
"objective": "avg_utilization",
"queue_sensitivity": 0.05
},
"agent": {
"type": "CLASSIC",
"policy_pair": "shortest-high_flops"
}
}
}
This is another example of an options.json file, in this case for a learning agent:
{
"seed": 1995,
"nb_resources": 2280,
"nb_jobs": 10000,
"workload_file_path": "/workspace/workloads/my_workload.swf",
"platform_file_path": "/workspace/platforms/my_platform.json",
"pybatsim": {
"log_level": "WARNING",
"env": {
"objective": "makespan",
"queue_sensitivity": 0.01
},
"agent": {
"type": "LEARNING",
"run": "train",
"hidden": 128,
"lr": 0.001,
"gamma": 0.99
}
}
}
Extra command line arguments are available for learning agent simulations:
The agent.py file contains your developed learning agent for evaluation. See agent examples for reference.
The inmodel optional argument may be used for providing a path to a previously trained and saved model. HDeepRM will load this model before starting the run.
The outmodel optional argument may be specified as a path for saving the model after the run is finished. If not provide, the model won’t be saved. This is usually combined with train runs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for hdeeprm-0.3.0b0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88851d5febb6446255a61a65b6fa7fc65df8fedc52379c03a99d129bc0a40af4 |
|
MD5 | b229280339c2a8898ff7e40d11a2e8c2 |
|
BLAKE2b-256 | f5c8856365fedbc655f073d294bce3dbd4a253b779976721e534f7f2b9f54593 |