Skip to main content

Evaluate Deep Reinforcement Learning policies for heterogeneous Workload Management

Project description

Framework for evaluating workload management policies based on deep reinforcement learning for heterogeneous clusters.

Overview

HDeepRM is a Python framework for evaluating workload management policies based on deep reinforcement learning for heterogeneous clusters. It leverages the Batsim ecosystem for simulating a heterogeneous workload management context. This is composed of the simulator, Batsim and the decision system, PyBatsim.

HDeepRM provides a heterogeneity layer on top of PyBatsim, which adds support for user-defined resource hierarchies. Memory capacity and bandwidth conflicts are added along with interdependence when consolidating or scattering jobs across the data centre.

It offers a flexible API for developing deep reinforcement learning agents. These may be trained by providing real workload traces in SWF format along with platforms defined in the format specified in Platforms. They can be further evaluated and tested against classic policies.

Installation Prerequisites

HDeepRM is distributed as a Python package on PyPi. In order to download and install it, the following software is needed:

  • Python3.6+, find your OS in this installation guide.

  • Pip, the Python package manager. If not already available with the Python installation, follow the official guide.

Installation

For installing HDeepRM, just download the package from PyPi:

pip install --upgrade --user hdeeprm

If pip is mapped to Python 2.x, try:

pip3 install --upgrade --user hdeeprm

When working with multiple Python versions, use:

python3.6 -m pip install --upgrade --user hdeeprm

This should download the hdeeprm package with all its dependencies, which are:

  • defusedxml >= 0.5.0: secure XML generation and parsing.

  • gym >= 0.12.0: environment, actions and observations definitions.

  • lxml >= 4.3.2: generation of the XML tree. Backend for defusedxml.

  • numpy >= 1.16.2: efficient data structure operations.

  • procset >= 1.0: closed-interval sets for resource selection.

  • pybatsim >= 3.1.0: decision system and main interface to interact with Batsim.

  • torch >= 1.0.1.post2: deep learning library for agent definition.

Usage Prerequisites

The simulation side is done by Batsim, which is needed in order to run HDeepRM experiments. Follow the official installation docs for instructions.

Launching experiments

In order to experiment with HDeepRM, an integrated launcher is provided:

hdeeprm-launch -a <agent.py> -cw <custom_workload.json> -im <saved_model.pt> -om <to_save_model.pt> <options.json>

The options.json specifies the experiment parameters. The JSON structure is as follows:

{
  "seed": 0,
  "nb_resources": 0,
  "nb_jobs": 0,
  "workload_file_path": "",
  "platform_file_path": "",
  "pybatsim": {
    "log_level": "",
    "env": {
      "objective": "",
      "actions": {
        "selection": [
          {"": []}
        ],
        "void": false
      },
      "observation": "",
      "queue_sensitivity": 0.0,
    },
    "agent": {
      "type": "",
      "run": "",
      "hidden": 0,
      "lr": 0.0,
      "gamma": 0.0
    }
  }
}

Global options:

  • seed - The random seed for evaluation reproducibility.

  • nb_resources - Total number of cores in the simulated platform.

  • nb_jobs - Total number of jobs to generate in the workload.

  • workload_file_path - Location of the original SWF formatted workload.

  • platform_file_path - Location of the original HDeepRM JSON formatted platform.

PyBatsim options:

  • log_level - Logging level for showing insights from the simulation. See Logging for reference on possible values.

PyBatsim - Environment options:

  • objective - Metric to be optimised by the agent. See Objectives for an explanation and recognised values.

  • actions - Subset of actions for the simulation. If not specified, all 37 actions in HDeepRM are used.

  • observation - Type of observation to use, one of normal, small or minimal.

  • queue_sensitivity - Sensitivity of the observation to variations in job queue size. See Hyperparameters - Queue Sensitivity.

PyBatsim - Common agent options:

  • type - Type of the scheduling agent, one of CLASSIC or LEARNING.

PyBatsim - Learning agent options:

This is an example of an options.json file for a classic agent:

{
  "seed": 2009,
  "nb_resources": 175,
  "nb_jobs": 1000,
  "workload_file_path": "/workspace/workloads/my_workload.swf",
  "platform_file_path": "/workspace/platforms/my_platform.json",
  "pybatsim": {
    "log_level": "DEBUG",
    "env": {
      "objective": "avg_utilization",
      "actions": {
        "selection": [
          {"shortest": ["high_mem_bw"]}
        ],
        "void": false
      },
      "observation": "normal",
      "queue_sensitivity": 0.05
    },
    "agent": {
      "type": "CLASSIC"
    }
  }
}

This is another example of an options.json file, in this case for a learning agent:

{
  "seed": 1995,
  "nb_resources": 175,
  "nb_jobs": 1000,
  "workload_file_path": "/workspace/workloads/my_workload.swf",
  "platform_file_path": "/workspace/platforms/my_platform.json",
  "pybatsim": {
    "log_level": "WARNING",
    "env": {
      "objective": "makespan",
      "actions": {
        "selection": [
          {"first": ["high_gflops", "high_mem_bw"]},
          {"smallest": [""]}
        ],
        "void": false
      },
      "queue_sensitivity": 0.01
    },
    "agent": {
      "type": "LEARNING",
      "run": "train",
      "hidden": 128,
      "lr": 0.001,
      "gamma": 0.99
    }
  }
}

Optional command line arguments are available:

  • -a - The file containing your developed learning agent for evaluation. See agent examples for reference.

  • -cw - If you are thinking about proof-of-concept experiments, you may need to define your own workload. Doing this in SWF is tedious, thus this option allows for passing a custom workload defined in Batsim JSON format.

  • -im - PyTorch trained models are usually saved in .pt files. This option allows for loading a previously trained model to bootstrap the agent.

  • -om - If you want to save the model after the simulation is finished, specify the output file in this option. This is usually combined with train runs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdeeprm-0.3.1b1.tar.gz (38.4 kB view details)

Uploaded Source

Built Distribution

hdeeprm-0.3.1b1-py3-none-any.whl (41.4 kB view details)

Uploaded Python 3

File details

Details for the file hdeeprm-0.3.1b1.tar.gz.

File metadata

  • Download URL: hdeeprm-0.3.1b1.tar.gz
  • Upload date:
  • Size: 38.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for hdeeprm-0.3.1b1.tar.gz
Algorithm Hash digest
SHA256 b6c0a186571169de9ce15ec8284a3aa78b1cf7f0531439335f1db56238fa053f
MD5 00bbae10dfc13ef4e9b678de50e90c7c
BLAKE2b-256 0e55d1d9b42bbf3688f548041448a386df7a8bc4ba36e081aef5a3b77b5c1a7f

See more details on using hashes here.

File details

Details for the file hdeeprm-0.3.1b1-py3-none-any.whl.

File metadata

  • Download URL: hdeeprm-0.3.1b1-py3-none-any.whl
  • Upload date:
  • Size: 41.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for hdeeprm-0.3.1b1-py3-none-any.whl
Algorithm Hash digest
SHA256 eda0a33b2838723882b60790bec1a12535583d8cbed64da2a80c883120d1d3d1
MD5 0f4ce976311988b37fb7628f9baa5782
BLAKE2b-256 f5eb47e2f2dc951c9e0a3325dfc9ecae91644875d06456e1895560722f19ca0b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page