Evaluate Deep Reinforcement Learning policies for heterogeneous Workload Management
Project description
HDeepRM
=======
Framework for evaluating Workload Management policies based on
Deep Reinforcement Learning for Heterogeneous Clusters.
.. include-overview-start
Overview
--------
HDeepRM is a Python framework for evaluating Workload Management policies
based on Deep Reinforcement Learning for Heterogeneous Clusters. It
leverages the `Batsim ecosystem <https://gitlab.inria.fr/batsim>`_
for simulating a heterogeneous Workload Management context. This is composed
of the Simulator, `Batsim <https://gitlab.inria.fr/batsim/batsim>`_ and the
Decision System, `PyBatsim <https://gitlab.inria.fr/batsim/pybatsim>`_.
HDeepRM provides a heterogeneity layer on top of PyBatsim, which adds support
for user-defined resource hierarchies. Memory and bandwidth conflicts are added
along with interdependence when consolidating or scattering jobs across the
data centre.
It offers a flexible API for developing Deep Reinforcement Learning agents.
These may be trained by providing real workload traces in
`SWF format <http://www.cs.huji.ac.il/labs/parallel/workload/swf.html>`_ along
with platforms defined in the format specified in `Platforms <TODO>`_. They can
be further evaluated and tested against traditional policies.
Installation Prerequisites
~~~~~~~~~~~~~~~~~~~~~~~~~~
HDeepRM is distributed as a Python package on
`PyPi <https://pypi.org/project/hdeeprm/>`_.
In order to download and install it, the following software is needed:
- Python3.6+, find your OS in this
`installation guide <https://realpython.com/installing-python/>`_.
- Pip, the Python package manager. If not already available with the Python
installation, follow the
`official guide <https://pip.pypa.io/en/stable/installing/>`_.
Installation
~~~~~~~~~~~~
In order to install HDeepRM, just download the package from PyPi:
.. code-block:: bash
pip install --user hdeeprm
If ``pip`` is mapped to Python 2.x, try:
.. code-block:: bash
pip3 install --user hdeeprm
When working with multiple Python versions, use:
.. code-block:: bash
python3.6 -m pip install --user hdeeprm
This should download the ``hdeeprm`` package with all its dependencies,
which are:
- ``defusedxml`` >= 0.5.0: secure XML generation and parsing.
- ``gym`` >= 0.11.0: environment, actions and observations definitions.
- ``lxml`` >= 4.3.1: generation of the XML tree. Backend for ``defusedxml``.
- ``numpy`` >= 1.16.1: efficient data structure operations.
- ``procset`` >= 1.0: closed-interval sets for resource selection.
- ``pybatsim`` >= 3.1.0: decision system and main interface to interact
with Batsim.
- ``torch`` >= 1.0.1: deep learning library for agent definition.
Usage Prerequisites
~~~~~~~~~~~~~~~~~~~
The simulation side is done by Batsim, which is needed in order to run
HDeepRM experiments. Follow the `official installation docs
<https://batsim.readthedocs.io/en/latest/installation.html>`_ for instructions.
Usage
~~~~~
In order to experiment with HDeepRM, an integrated launcher is provided:
.. code-block:: bash
hdeeprm-launch <agent.py> <options.json> --inmodel <saved_model.pt> --outmodel <to_save_model.pt>
The ``agent.py`` file contains your developed agent for evaluation.
See `agent examples <TODO>`_ for reference.
The ``options.json`` specifies the experiment parameters. The JSON structure
is as follows:
.. code-block:: json
{
"seed",
"nb_resources",
"nb_jobs",
"workload_file_path",
"platform_file_path,
"pybatsim": {
"log_level",
"env": {
"objective",
"queue_sensitivity"
},
"agent": {
"policy_pair",
"run",
"hidden",
"lr",
"gamma"
}
}
}
Global options:
* ``seed`` - The random seed for evaluation reproducibility.
* ``nb_resources`` - Total number of cores in the simulated platform.
* ``nb_jobs`` - Total number of jobs to generate in the workload.
* ``workload_file_path`` - Location of the original SWF formatted workload.
* ``platform_file_path`` - Location of the original
HDeepRM JSON formatted platform.
PyBatsim options:
* ``log_level`` - Logging level for showing insights from the simulation. See `Logging <https://docs.python.org/3.6/howto/logging.html>`_ for reference on possible values.
PyBatsim - Environment options:
* ``objective`` - Metric to be optimised by the agent. See `Objectives <TODO>`_ for an explanation and recognised values.
* ``queue_sensitivity`` - Sensitivity of the observation to variations in job queue size. See `Hyperparameters - Queue Sensitivity <TODO>`_.
PyBatsim - Agent options:
* ``policy_pair`` - For :class:`~hdeeprm.agent.ClassicAgent` and derived only. The job and resource selection policies. Policy pairs are further described in `Environment - Action Space <TODO>`_.
* ``run`` - For :class:`~hdeeprm.agent.LearningAgent` and derived only.
Type of run for the learning agent, can be *train* or *test*.
When training, the agent's inner model is updated,
whereas testing is meant for evaluation purposes.
* ``hidden`` - For :class:`~hdeeprm.agent.LearningAgent` and derived only. Number of units in each hidden layer from the agent's inner model. See `Hyperparameters - Hidden units <TODO>`_.
* ``lr`` - For :class:`~hdeeprm.agent.LearningAgent` and derived only. Learning rate for updating the agent's inner model. See `Hyperparameters - Learning rate <TODO>`_.
* ``gamma`` - For :class:`~hdeeprm.agent.LearningAgent` and derived only. Discount factor for rewards. See `Hyperparameters - Reward Discount Factor <TODO>`_.
This is an example of an ``options.json`` file
for a :class:`~hdeeprm.agent.ClassicAgent`:
.. code-block:: json
{
"seed": 2009,
"nb_resources": 2280,
"nb_jobs": 10000,
"workload_file_path": "/workspace/workloads/my_workload.swf",
"platform_file_path: "/workspace/platforms/my_platform.json",
"pybatsim": {
"log_level": "DEBUG",
"env": {
"objective": "avg_utilization",
"queue_sensitivity": 0.05
},
"agent": {
"policy_pair": "shortest-high_flops"
}
}
}
This is another example of an ``options.json`` file,
in this case for a :class:`~hdeeprm.agent.LearningAgent`:
.. code-block:: json
{
"seed": 1995,
"nb_resources": 2280,
"nb_jobs": 10000,
"workload_file_path": "/workspace/workloads/my_workload.swf",
"platform_file_path: "/workspace/platforms/my_platform.json",
"pybatsim": {
"log_level": "WARNING",
"env": {
"objective": "makespan",
"queue_sensitivity": 0.01
},
"agent": {
"run": "train",
"hidden": 128,
"lr": 0.001,
"gamma": 0.99
}
}
}
The ``inmodel`` optional argument may be used for providing a path
to a previously trained and saved model. HDeepRM will load this model
before starting the run.
The ``outmodel`` optional argument may be specified as a path for
saving the model after the run is finished. If not provide, the model
won't be saved. This is usually combined with *train* runs.
.. include-overview-end
=======
Framework for evaluating Workload Management policies based on
Deep Reinforcement Learning for Heterogeneous Clusters.
.. include-overview-start
Overview
--------
HDeepRM is a Python framework for evaluating Workload Management policies
based on Deep Reinforcement Learning for Heterogeneous Clusters. It
leverages the `Batsim ecosystem <https://gitlab.inria.fr/batsim>`_
for simulating a heterogeneous Workload Management context. This is composed
of the Simulator, `Batsim <https://gitlab.inria.fr/batsim/batsim>`_ and the
Decision System, `PyBatsim <https://gitlab.inria.fr/batsim/pybatsim>`_.
HDeepRM provides a heterogeneity layer on top of PyBatsim, which adds support
for user-defined resource hierarchies. Memory and bandwidth conflicts are added
along with interdependence when consolidating or scattering jobs across the
data centre.
It offers a flexible API for developing Deep Reinforcement Learning agents.
These may be trained by providing real workload traces in
`SWF format <http://www.cs.huji.ac.il/labs/parallel/workload/swf.html>`_ along
with platforms defined in the format specified in `Platforms <TODO>`_. They can
be further evaluated and tested against traditional policies.
Installation Prerequisites
~~~~~~~~~~~~~~~~~~~~~~~~~~
HDeepRM is distributed as a Python package on
`PyPi <https://pypi.org/project/hdeeprm/>`_.
In order to download and install it, the following software is needed:
- Python3.6+, find your OS in this
`installation guide <https://realpython.com/installing-python/>`_.
- Pip, the Python package manager. If not already available with the Python
installation, follow the
`official guide <https://pip.pypa.io/en/stable/installing/>`_.
Installation
~~~~~~~~~~~~
In order to install HDeepRM, just download the package from PyPi:
.. code-block:: bash
pip install --user hdeeprm
If ``pip`` is mapped to Python 2.x, try:
.. code-block:: bash
pip3 install --user hdeeprm
When working with multiple Python versions, use:
.. code-block:: bash
python3.6 -m pip install --user hdeeprm
This should download the ``hdeeprm`` package with all its dependencies,
which are:
- ``defusedxml`` >= 0.5.0: secure XML generation and parsing.
- ``gym`` >= 0.11.0: environment, actions and observations definitions.
- ``lxml`` >= 4.3.1: generation of the XML tree. Backend for ``defusedxml``.
- ``numpy`` >= 1.16.1: efficient data structure operations.
- ``procset`` >= 1.0: closed-interval sets for resource selection.
- ``pybatsim`` >= 3.1.0: decision system and main interface to interact
with Batsim.
- ``torch`` >= 1.0.1: deep learning library for agent definition.
Usage Prerequisites
~~~~~~~~~~~~~~~~~~~
The simulation side is done by Batsim, which is needed in order to run
HDeepRM experiments. Follow the `official installation docs
<https://batsim.readthedocs.io/en/latest/installation.html>`_ for instructions.
Usage
~~~~~
In order to experiment with HDeepRM, an integrated launcher is provided:
.. code-block:: bash
hdeeprm-launch <agent.py> <options.json> --inmodel <saved_model.pt> --outmodel <to_save_model.pt>
The ``agent.py`` file contains your developed agent for evaluation.
See `agent examples <TODO>`_ for reference.
The ``options.json`` specifies the experiment parameters. The JSON structure
is as follows:
.. code-block:: json
{
"seed",
"nb_resources",
"nb_jobs",
"workload_file_path",
"platform_file_path,
"pybatsim": {
"log_level",
"env": {
"objective",
"queue_sensitivity"
},
"agent": {
"policy_pair",
"run",
"hidden",
"lr",
"gamma"
}
}
}
Global options:
* ``seed`` - The random seed for evaluation reproducibility.
* ``nb_resources`` - Total number of cores in the simulated platform.
* ``nb_jobs`` - Total number of jobs to generate in the workload.
* ``workload_file_path`` - Location of the original SWF formatted workload.
* ``platform_file_path`` - Location of the original
HDeepRM JSON formatted platform.
PyBatsim options:
* ``log_level`` - Logging level for showing insights from the simulation. See `Logging <https://docs.python.org/3.6/howto/logging.html>`_ for reference on possible values.
PyBatsim - Environment options:
* ``objective`` - Metric to be optimised by the agent. See `Objectives <TODO>`_ for an explanation and recognised values.
* ``queue_sensitivity`` - Sensitivity of the observation to variations in job queue size. See `Hyperparameters - Queue Sensitivity <TODO>`_.
PyBatsim - Agent options:
* ``policy_pair`` - For :class:`~hdeeprm.agent.ClassicAgent` and derived only. The job and resource selection policies. Policy pairs are further described in `Environment - Action Space <TODO>`_.
* ``run`` - For :class:`~hdeeprm.agent.LearningAgent` and derived only.
Type of run for the learning agent, can be *train* or *test*.
When training, the agent's inner model is updated,
whereas testing is meant for evaluation purposes.
* ``hidden`` - For :class:`~hdeeprm.agent.LearningAgent` and derived only. Number of units in each hidden layer from the agent's inner model. See `Hyperparameters - Hidden units <TODO>`_.
* ``lr`` - For :class:`~hdeeprm.agent.LearningAgent` and derived only. Learning rate for updating the agent's inner model. See `Hyperparameters - Learning rate <TODO>`_.
* ``gamma`` - For :class:`~hdeeprm.agent.LearningAgent` and derived only. Discount factor for rewards. See `Hyperparameters - Reward Discount Factor <TODO>`_.
This is an example of an ``options.json`` file
for a :class:`~hdeeprm.agent.ClassicAgent`:
.. code-block:: json
{
"seed": 2009,
"nb_resources": 2280,
"nb_jobs": 10000,
"workload_file_path": "/workspace/workloads/my_workload.swf",
"platform_file_path: "/workspace/platforms/my_platform.json",
"pybatsim": {
"log_level": "DEBUG",
"env": {
"objective": "avg_utilization",
"queue_sensitivity": 0.05
},
"agent": {
"policy_pair": "shortest-high_flops"
}
}
}
This is another example of an ``options.json`` file,
in this case for a :class:`~hdeeprm.agent.LearningAgent`:
.. code-block:: json
{
"seed": 1995,
"nb_resources": 2280,
"nb_jobs": 10000,
"workload_file_path": "/workspace/workloads/my_workload.swf",
"platform_file_path: "/workspace/platforms/my_platform.json",
"pybatsim": {
"log_level": "WARNING",
"env": {
"objective": "makespan",
"queue_sensitivity": 0.01
},
"agent": {
"run": "train",
"hidden": 128,
"lr": 0.001,
"gamma": 0.99
}
}
}
The ``inmodel`` optional argument may be used for providing a path
to a previously trained and saved model. HDeepRM will load this model
before starting the run.
The ``outmodel`` optional argument may be specified as a path for
saving the model after the run is finished. If not provide, the model
won't be saved. This is usually combined with *train* runs.
.. include-overview-end
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hdeeprm-0.1.0b1.tar.gz
(32.0 kB
view hashes)
Built Distribution
hdeeprm-0.1.0b1-py3-none-any.whl
(35.0 kB
view hashes)
Close
Hashes for hdeeprm-0.1.0b1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb16bf22eba7291103307f8e8426ef6a723dbc042bd30b6e32f4112aefa7b30b |
|
MD5 | a8260ebd68e6d1773bb8133d35057e03 |
|
BLAKE2b-256 | abd2d994abc7a0916f604b47c57e62a1cadff29945dc5d5dbe8b11fd545cb60c |