Skip to main content

Covalent HPC Plugin

Project description

 

Covalent HPC Plugin

Covalent is a Pythonic workflow tool used to execute tasks on advanced computing hardware. This executor plugin uses PSI/J to allow Covalent to seamlessly interface with a variety of common high-performance computing job schedulers and pilot systems (e.g. Slurm, PBS, LSF, Flux, Cobalt, RADICAL-Pilot). For workflows to be deployable, users must have SSH access to the login node, access to the job scheduler, and write access to the remote filesystem.

Installation

Server Environment

To use this plugin with Covalent, simply install it using pip in whatever Python environment you use to run the Covalent server (your local machine by default):

pip install covalent-hpc-plugin

Run the following in Python to have Covalent automatically register the plugin:

import covalent

HPC Environment

Additionally, on the remote machine(s) where you plan to execute Covalent workflows with this plugin, ensure that the remote Python environment has Covalent and PSI/J installed:

pip install covalent psij-python

Note that the Python major and minor version numbers on both the local and remote machines must match to ensure reliable (un)pickling of the various objects.

Usage

Default Configuration Parameters

By default, when you install the covalent-hpc-plugin and run import covalent for the first time, your Covalent configuration file (found at ~/.config/covalent/covalent.conf by default) will automatically be updated to include the following sections. These are not all of the available parameters but are simply the default values.

[executors.hpc]
address = ""
username = ""
ssh_key_file = "~/.ssh/id_rsa"
instance = "slurm"
launcher = "single"
inherit_environment = true
pre_launch_cmds = []
post_launch_cmds = []
shebang = "#!/bin/bash"
remote_python_exe = "python"
remote_workdir = "~/covalent-workdir"
create_unique_workdir = false
cache_dir = "~/.cache/covalent"
poll_freq = 60

[executors.hpc.environment]

[executors.hpc.resource_spec_kwargs]
node_count = 1
processes_per_node = 1
gpu_cores_per_process = 0

[executors.hpc.job_attributes_kwargs]
duration = 10

You can modify various parameters in the Covalent config file as-needed to better suit your needs, such as the address of the remote machine, the username to use when logging in, the ssh_key_file to use for authentication, the type of job scheduler (instance), and much more. Note that PSI/J is a common interface to many common job schedulers, so you only need to toggle the instance to switch between job schedulers.

A full description of the various input parameters are described in the docstrings of the HPCExecutor class, reproduced below:

https://github.com/Quantum-Accelerators/covalent-hpc-plugin/blob/25785d0c546851c4b11e5c227f2e7aebb12aba8c/covalent_hpc_plugin/hpc.py#L115-L159

Defining Resource Specifications and Job Attributes

Two of the most important sets of parameters are resource_spec_kwargs and job_attributes_kwargs, which specify the resources required for the job (e.g. number of nodes, number of processes per node, etc.) and the job attributes (e.g. duration, queue name, etc.), respectively.

  1. resource_spec_kwargs is a dictionary of keyword arguments passed to PSI/J's ResourceSpecV1 class
  2. job_attributes_kwargs is a dictionary of keyword arguments passed to PSI/J's JobAttributes class.

The allowed types are listed here.

Using the Plugin in a Workflow: Approach 1

With the configuration file appropriately set up, one can run a workflow on the HPC machine as follows:

import covalent as ct

@ct.electron(executor="HPCExecutor")
def add(a, b):
    return a + b

@ct.lattice
def workflow(a, b):
    return add(a, b)


dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)

Using the Plugin in a Workflow: Approach 2

If you wish to modify the various parameters within your Python script rather than solely relying on the the Covalent configuration file, it is possible to do that as well by instantiating a custom instance of the HPCExecutor class. An example with some commonly used parameters is shown below. By default, any parameters not specified in the HPCExecutor will be inherited from the configuration file.

import covalent as ct

executor = ct.executor.HPCExecutor(
    address="coolmachine.university.edu",
    username="UserName",
    ssh_key_file="~/.ssh/id_rsa",
    instance="slurm",
    remote_conda_env="myenv",
    environment={"HELLO": "WORLD"},
    resource_spec_kwargs={
        "node_count": 2,
        "processes_per_node": 24
    },
    job_attributes_kwargs={
        "duration": 30, # minutes
        "queue_name": "debug",
        "project_name": "AccountName",
    },
    launcher="single",
    remote_workdir="~/covalent-workdir",
)

@ct.electron(executor=executor)
def add(a, b):
    return a + b

@ct.lattice
def workflow(a, b):
    return add(a, b)


dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)

Working Example: Perlmutter

The following is a minimal working example to submit a Covalent job on NERSC's Perlmutter machine. It assumes that you have used the sshproxy utility to generate a certificate file in order to circumvent the need for multi-factor authentication for each login.

import covalent as ct

executor = ct.executor.HPCExecutor(
    address="perlmutter-p1.nersc.gov",
    username="UserName",
    ssh_key_file="~/.ssh/nersc",
    cert_file="~/.ssh/nersc-cert.pub",
    remote_conda_env="myenv",
    job_attributes_kwargs={
        "project_name": "ProjectName",
        "custom_attributes": {"slurm.constraint": "cpu", "slurm.qos": "debug"},
    },
)

@ct.electron(executor=executor)
def add(a, b):
    return a + b

@ct.lattice
def workflow(a, b):
    return add(a, b)


dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)

Troubleshooting

The most common cause of issues is related to the job scheduler details (i.e. the resource_spec_kwargs and the job_attributes_kwargs). If your job fails on the remote machine, set cleanup=False and then check the files left behind in the working directory as well as the ~/.psij directory for a history and various log files associated with your attempted job submissions.

Release Notes

Release notes are available in the Changelog.

Credit

This plugin was developed by Andrew S. Rosen, building off of prior work by the Agnostiq team on the covalent-slurm-plugin.

If you use this plugin, be sure to cite Covalent as follows:

W. J. Cunningham, S. K. Radha, F. Hasan, J. Kanem, S. W. Neagle, and S. Sanand. Covalent. Zenodo, 2022. https://doi.org/10.5281/zenodo.5903364

License

Covalent is licensed under the Apache 2.0 License. Covalent may be distributed under other licenses upon request. See the LICENSE file or contact the support team for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

covalent-hpc-plugin-0.0.8.tar.gz (25.2 kB view details)

Uploaded Source

File details

Details for the file covalent-hpc-plugin-0.0.8.tar.gz.

File metadata

  • Download URL: covalent-hpc-plugin-0.0.8.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for covalent-hpc-plugin-0.0.8.tar.gz
Algorithm Hash digest
SHA256 5977d653baccde14d894f8db09aa1dad8e5e263e5e312f10fdcd22fdf622569e
MD5 0530364fa20b6870b09a8b1f6b3db4e8
BLAKE2b-256 b432302f340f21c0460113db9c446483864f35b41e6d0b60a0c9f017a52e9cd9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page