Covalent HPC Plugin
Project description
Covalent HPC Plugin
Covalent is a Pythonic workflow tool used to execute tasks on advanced computing hardware. This executor plugin uses PSI/J to allow Covalent to seamlessly interface with a variety of common high-performance computing job schedulers and pilot systems (e.g. Slurm, PBS, LSF, Flux, Cobalt, RADICAL-Pilot). For workflows to be deployable, users must have SSH access to the login node, access to the job scheduler, and write access to the remote filesystem.
Installation
Server Environment
To use this plugin with Covalent, simply install it using pip
in whatever Python environment you use to run the Covalent server (your local machine by default):
pip install covalent-hpc-plugin
Run the following in Python to have Covalent automatically register the plugin:
import covalent
HPC Environment
Additionally, on the remote machine(s) where you plan to execute Covalent workflows with this plugin, ensure that the remote Python environment has Covalent and PSI/J installed:
pip install covalent psij-python
Note that the Python major and minor version numbers on both the local and remote machines must match to ensure reliable (un)pickling of the various objects.
Usage
Default Configuration Parameters
By default, when you install the covalent-hpc-plugin
and run import covalent
for the first time, your Covalent configuration file (found at ~/.config/covalent/covalent.conf
by default) will automatically be updated to include the following sections. These are not all of the available parameters but are simply the default values.
[executors.hpc]
address = ""
username = ""
ssh_key_file = "~/.ssh/id_rsa"
instance = "slurm"
launcher = "single"
inherit_environment = true
pre_launch_cmds = []
post_launch_cmds = []
shebang = "#!/bin/bash"
remote_python_exe = "python"
remote_workdir = "~/covalent-workdir"
create_unique_workdir = false
cache_dir = "~/.cache/covalent"
poll_freq = 60
[executors.hpc.environment]
[executors.hpc.resource_spec_kwargs]
node_count = 1
processes_per_node = 1
gpu_cores_per_process = 0
[executors.hpc.job_attributes_kwargs]
duration = 10
You can modify various parameters in the Covalent config file as-needed to better suit your needs, such as the address
of the remote machine, the username
to use when logging in, the ssh_key_file
to use for authentication, the type of job scheduler (instance
), and much more. Note that PSI/J is a common interface to many common job schedulers, so you only need to toggle the instance
to switch between job schedulers.
A full description of the various input parameters are described in the docstrings of the HPCExecutor
class, reproduced below:
Defining Resource Specifications and Job Attributes
Two of the most important sets of parameters are resource_spec_kwargs
and job_attributes_kwargs
, which specify the resources required for the job (e.g. number of nodes, number of processes per node, etc.) and the job attributes (e.g. duration, queue name, etc.), respectively.
resource_spec_kwargs
is a dictionary of keyword arguments passed to PSI/J'sResourceSpecV1
classjob_attributes_kwargs
is a dictionary of keyword arguments passed to PSI/J'sJobAttributes
class.
The allowed types are listed here.
Using the Plugin in a Workflow: Approach 1
With the configuration file appropriately set up, one can run a workflow on the HPC machine as follows:
import covalent as ct
@ct.electron(executor="HPCExecutor")
def add(a, b):
return a + b
@ct.lattice
def workflow(a, b):
return add(a, b)
dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)
Using the Plugin in a Workflow: Approach 2
If you wish to modify the various parameters within your Python script rather than solely relying on the the Covalent configuration file, it is possible to do that as well by instantiating a custom instance of the HPCExecutor
class. An example with some commonly used parameters is shown below. By default, any parameters not specified in the HPCExecutor
will be inherited from the configuration file.
import covalent as ct
executor = ct.executor.HPCExecutor(
address="coolmachine.university.edu",
username="UserName",
ssh_key_file="~/.ssh/id_rsa",
instance="slurm",
remote_conda_env="myenv",
environment={"HELLO": "WORLD"},
resource_spec_kwargs={
"node_count": 2,
"processes_per_node": 24
},
job_attributes_kwargs={
"duration": 30, # minutes
"queue_name": "debug",
"project_name": "AccountName",
},
launcher="single",
remote_workdir="~/covalent-workdir",
)
@ct.electron(executor=executor)
def add(a, b):
return a + b
@ct.lattice
def workflow(a, b):
return add(a, b)
dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)
Working Example: Perlmutter
The following is a minimal working example to submit a Covalent job on NERSC's Perlmutter machine. It assumes that you have used the sshproxy utility to generate a certificate file in order to circumvent the need for multi-factor authentication for each login.
import covalent as ct
executor = ct.executor.HPCExecutor(
address="perlmutter-p1.nersc.gov",
username="UserName",
ssh_key_file="~/.ssh/nersc",
cert_file="~/.ssh/nersc-cert.pub",
remote_conda_env="myenv",
job_attributes_kwargs={
"project_name": "ProjectName",
"custom_attributes": {"slurm.constraint": "cpu", "slurm.qos": "debug"},
},
)
@ct.electron(executor=executor)
def add(a, b):
return a + b
@ct.lattice
def workflow(a, b):
return add(a, b)
dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)
Troubleshooting
The most common cause of issues is related to the job scheduler details (i.e. the resource_spec_kwargs
and the job_attributes_kwargs
). If your job fails on the remote machine, set cleanup=False
and then check the files left behind in the working directory as well as the ~/.psij
directory for a history and various log files associated with your attempted job submissions.
Release Notes
Release notes are available in the Changelog.
Credit
This plugin was developed by Andrew S. Rosen, building off of prior work by the Agnostiq team on the covalent-slurm-plugin.
If you use this plugin, be sure to cite Covalent as follows:
W. J. Cunningham, S. K. Radha, F. Hasan, J. Kanem, S. W. Neagle, and S. Sanand. Covalent. Zenodo, 2022. https://doi.org/10.5281/zenodo.5903364
License
Covalent is licensed under the Apache 2.0 License. Covalent may be distributed under other licenses upon request. See the LICENSE file or contact the support team for more details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for covalent-hpc-plugin-0.0.8.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5977d653baccde14d894f8db09aa1dad8e5e263e5e312f10fdcd22fdf622569e |
|
MD5 | 0530364fa20b6870b09a8b1f6b3db4e8 |
|
BLAKE2b-256 | b432302f340f21c0460113db9c446483864f35b41e6d0b60a0c9f017a52e9cd9 |