FSL Cluster Submission Script
Project description
fsl_sub
Job submission to cluster queues Copyright 2018-2023, University of Oxford (Duncan Mortimer)
Introduction
fsl_sub provides a consistent interface to various cluster backends, with a fall back to running tasks locally where no cluster is available. If you wish to submit tasks to a cluster you will need to install and configure an appropriate grid backend plugin, two of which are provided alongside fsl_sub:
- fsl_sub_plugin_sge for Sun/Son of/Univa Grid Engine (Grid Engine)
- fsl_sub_plugin_slurm for Slurm
Installation
In addition to the main fsl_sub package, to submit to a cluster queueing system you need to install one of the cluster plugins. At present, plugins are available for Grid Engine (Sun/Son of or Univa/Altair) and SLURM. Please see the INSTALL.md file for details on installing fsl_sub and the relevant plugin.
Configuration
For instructions on how to configure fsl_sub once installed (essential if using a cluster plugin) see the CONFIGURATION.md file.
Usage
For detailed usage see:
fsl_sub --help
The options available will depend on how fsl_sub has been configured for your particular backend - this help text will reflect the availability of options with your chosen backend.
Basic Submission
Submitting or running a job on the default queue is as simple as:
fsl_sub <command line>
When running with a cluster it is recommended that you provide a job run time with -T <time in minutes>
and a memory requirement with -R <memory in GB>
, this allows fsl_sub to automatically select an appropriate queue. Alternatively, you can specify a queue with -q <queue name>
. For example, if you have two queues, short and long with maximum run-times of 60 minutes and 5 days respectively then:
fsl_sub -T 50 myjob
is the equivalent of fsl_sub -q short myjob
, but enables you to potentially use this submission command (and any script based using this command) with any fsl_sub enabled cluster, regardless of queue names.
Providing the memory required is also advisable as some cluster setups enforce memory limits but provide for multi-slot reservations to allocate multiples of the RAM limit to your task. fsl_sub can be configured to automatically make these types of submission.
After validation of your job command and settings, fsl_sub will either wait until job completion (with no cluster backend) or will return with the job ID of your submitted job. This number can then be used to monitor job progress.
Job Monitoring - fsl_sub_report
In addition to the native job monitoring tools, fsl_sub provides a cluster backend agnostic job monitoring tool, fsl_sub_report
.
fsl_sub_report Usage
fsl_sub_report [job_id] {--subjob_id [sub_id]} {--parsable}
Reports on job job_id
, optionally on subtask sub_id
and returns information on both queued/running and completed jobs. --parsable
outputs machine readable information.
Advanced Usage
Skipping Command Validation
By default fsl_sub checks to see if the submitted command can actually be run. Where the software isn't available on the submission computer or you are prepending the command with some logic or setting an environment variable, this test will fail. You can disable validation with the -n
(or --novalidation
) option.
Array Tasks
Array tasks are indepentent tasks that can be run in parallel as they do not need or generate date required by other members of the array. To create a simple array task create a text file where each line contains a command to run. Submit this as the argument to the --array_task
option.
To control the number of array tasks run in parallel, use --array_limit
, this is also useful for standalone installs as it will limit the number of threads used when running array tasks in parallel on your computer.
It is also possible to submit an array task where the submitted software works out what portion of the array it should be processing. In this mode with --array_native
. The command will be launched multiple times (as specified in the --array_native
argument n[:m[:s]] (n umber of array members, m start index, s step in index number between array members)) with the following environment variables populated with the information necessary to workout what part of the array the program is to handle. As each cluster software suite sets different variables fsl_sub will set the following variables to the name of the environment variable your software can query to get the information:
Variable | Points to the variable holding |
---|---|
FSLSUB_JOBID_VAR |
The ID of the master job |
FSLSUB_ARRAYTASKID_VAR |
The ID of the sub-task |
FSLSUB_ARRAYSTARTID_VAR |
The ID of the first sub-task |
FSLSUB_ARRAYENDID_VAR |
The ID of the last sub-task |
FSLSUB_ARRAYSTEPSIZE_VAR |
The step between sub-task IDs (not available for all plugins) |
FSLSUB_ARRAYCOUNT_VAR |
The number of tasks in the array (not available for all plugins) |
Not all variables are set by all queue backends so ensure your software can cope with missing variables.
For example in BASH scripts you can get the ARRAYTASKID value with ${!FSLSUB_ARRAYTASKID_VAR}
.
Setting Environment Variables In Job Environments
Some cluster setups don't support passing all environment variables in your current shell session to your jobs. fsl_sub provides the --export
option to allow you to choose which variables need to be passed on, or to set environment variables only within the job (not affecting your running shell session). To set a variable use the syntax --export MYVAR=THEVALUE
. This can be repeated multiple times.
Multi-stage pipelines
Where you need to queue up a complex pipeline, you can use returned job IDs with the --job_hold
option to request that a submitted task wait for completion of a predecessor task. In addition, multi-stage array tasks can utilise interleaved job-holds with the option
Array Task Validation
Where you need to submit multiple stages in advance with job holds on the previous step but do not know in advance the command you wish to run you may create an array task file containing the text 'dummy'. Validation of the array task file will be skipped allowing the task to be submitted. You should then arrange for a predecessor to populate the array task file with the relevant command(s) to run.
Saving Submission Information
Under normal circumstances cluster backends generate a BASH script that describes your job's requirements to the cluster software and then calls your job (or array task file line). Using the --keep_jobscript
option you can request that fsl_sub leaves a copy of this file in the current folder with name wrapper_<jobid>.sh. This file will contain information on the version of fsl_sub (and plugin used) along with exact command-line used and as such is very useful for recording analyses carried out.
Submitting Cluster Submission Scripts
If you have written your own cluster submission script or wish to re-run a task for which you preserved the wrapper_<jobid>.sh file then you can do so using the --usescript
option, providing the script as the command to submit.
Specifying Memory Requirements Without Using -R
If fsl_sub is being called from within a software package such that you have no ability to specify memory requirements (for example FEAT) then you can achieve this by setting the environment variable FSLSUB_MEMORY_REQUIRED
, e.g.
FSLSUB_MEMORY_REQUIRED=32G myscript_that_submits
If units are not specified then they will default to those configured in the YAML file. If the memory is also specified in the fsl_sub arguments then the value provided in the argument will be used.
Multi-slot/thread tasks
If fsl_sub has a grid scheduler plugin installed then you can control the number of 'slots' your task will be allocated with the -s|--parallelenv
argument. This would typically be used with multi-threaded software, for example software using the OpenMP libraries or similar that allow for parallel processing on a single computer, but can also often be used to allow you to request more memory than is allowed in a single slot. fsl_sub does not support the submission of multi-computer parallel tasks (MPI).
Whilst parallel environments are specific to Grid Engine, SLURM has similar facilities for reserving resources. -s|--parallelenv
takes a single argument which is typically of the form <parallelenv>,<slots>
, where <slots>
is an integer and is the number of slots (threads or multiples of RAM per slot) you require. If your cluster queues support parallel environments these will be reported in the fsl\_sub --help
text.
If your cluster scheduler doesn't use parallel environments, fsl_sub also accepts ,<slots>
or even <slots>
.
Co-Processor Tasks
Where your sofware needs to use a co-processor, most commonly CUDA GPU cards, fsl_sub offers the --coprocessor
options. To run CUDA software you would typically add --coprocessor=cuda
to your fsl_sub commandline. Assuming the queue configration has been setup correctly there is no other configuration necessary as the correct queue/partition will be selected automatically. If your system has multiple versions of CUDA installed and selectable using shell modules (and everything is configured correctly) you can select the cuda version using --coprocessor_toolkit
option. Where multiple hardware versions are available then your system may have been configured to allow you to select specific card generations with --coprocessor_class
, with --coprocessor_class_strict
allowing you to force fsl_sub to only select the class of card you request (as opposed to this class and all superior devices).
Shell Choice (Especially on Heterogeneous Clusters)
Where the submitted command is a shell command line, e.g. "command; command; command", fsl_sub needs to run this via a shell. This defaults to BASH on Linux hosts and macOS prior to 10.15 and zsh on macOS from 10.15 onwards. This can be overridden using the environment variable FSLSUB_SHELL, set to the path of your preferred Bourne shell compatible binary. This is particularly useful if your submission host differs from your execution host (e.g. macOS vs Linux), or the shell binary is in a different location on the execution host (e.g. /bin/bash locally, /usr/local/bin/bash remotely).
Specifying Accounting Project
On some clusters you may be required to submit jobs to different projects to ensure compute time is billed accordingly, or to gain access to restricted resources. You can specify a project with the --project
option. If fsl_sub is being called from within a software package such that you have no ability to specify this option then you can select a project by setting the environment variable FSLSUB_PROJECT
, e.g.
FSLSUB_PROJECT=myproj myscript_that_submits
Submitting tasks from submitted tasks
Most clusters will not allow a running job to submit a sub-task as it is fairly likely this will result in deadlocks. Consquently, subsquent calls to fsl_sub will result in the use of the shell plugin for job running. If this occurs from within a cluster job the job .o and .e files will have filenames of the form <job name>.[o|e]<parent jobid>{.<parent taskid>}-<process id of fsl_sub>{.<taskid>}. Where allowed by thread 'slot' requests array tasks in these sub-tasks will be parallelissed as if running locally.
Native Resource Requests
Where your cluster system has a specific resource requirement that can't be automatically be fulfilled by fsl_sub you can use the -r
option to pass through a native resource request string.
Scheduler Arguments
Where your cluster system requires additional arguments to be passed through that aren't supported by fsl_sub arguments, for example SLURM QOS settings, then these can be specified in two ways.
Command-line
Use --extra "<argument>"
to specify these extra arguments remembering to quote them to prevent fsl_sub from attempting to interpret them. This argument can be provided multiple times to allow more than one extra argument to be specified.
Example:
--extra "--qos=immediate"
Environment Variables
Where you do not have control of the fsl_sub command (for example with FEAT), you can specify these additional arguments using environment variables. Define variables with names that start FSLSUB_EXTRA_
with values equal to your extra arguments. Arguments specified by --extra
will override equivalents set by environment variables.
Example:
export FSLSUB_EXTRA_QOS="--qos=immediate"
Deleting Jobs
fsl_sub --delete_job <jobID>
will enable you to delete a cluster job, assuming you have permission to do so.
Querying Capabilities
If you are writing non-Python software that needs to check on the availability of fsl_sub features, for example whether queues are configured or CUDA hardware is available then you can use the following options:
Option | Use |
---|---|
--has_coprocessor | Takes the name of a co-processor, exits with code 1 if this co-processor is not available. Assuming everything is correctly configured then --has_coprocessor cuda should be a viable test for CUDA hardware both when running standalone and on a cluster system |
--has_queues | fsl_sub will exit with return code 1 if there are no queues configured, e.g. this is a standalone computer |
--show_config | This outputs the currently applicable configuration as a YAML file, the content of this file will depend on the plugins installed and the configuration of your system so is not guaranteed to be identical on all platforms |
Python interface
The fsl_sub package is available for use directly within python scripts. Ensure that the fsl_sub folder is within your Python search path and import the appropriate parent module (e.g. fsl_sub or fsl_sub.config)
fsl_sub.config.has_queues
Import: from fsl_sub.config import has_queues Arguments: None
This function takes no arguments and returns True or False depending on whether there are usable queues (current execution method supports queueing and there are configured queues).
fsl_sub.config.has_coprocessor
Import: from fsl_sub.config import has_coprocessor Arguments: Name of co-processor
Takes the name of a coprocessor configuration key and returns True or False depending on whether the system is configured for or supports this coprocessor. A correctly configured fsl_sub + cluster + CUDA devices should have a coprocessor definition of 'cuda' (users will be warned if this is not the case).
fsl_sub.report
Import: fsl_sub, fsl_sub.consts Arguments: job_id, subjob_id=None
This returns a dictionary describing the job (including completed tasks):
id: # job ID
name: # job 'name'
submission_time: # as a datetime object
tasks: # dict keyed on sub-task ID
subtask_id:
status: # One of:
# fsl_sub.consts.QUEUED
# fsl_sub.consts.RUNNING
# fsl_sub.consts.FINISHED
# fsl_sub.consts.FAILED
# fsl_sub.consts.SUSPENDED
# fsl_sub.consts.HELD
start_time: # as a datetime object
end_time: # as a datetime object
fsl_sub.submit
Import: fsl_sub Arguments:
Argument (*Required) | Default (type) | Purpose |
---|---|---|
command* | (list of strings or string) | Command line, job-script or array task file to submit |
architecture | None (string) | Select nodes of specific CPU architecture (where cluster consists of multiple types) |
array_task | False (boolean) | Whether this is an array task |
array_hold | None (string | integer |
array_limit | None (integer) | Maximum array tasks to run concurrently |
array_specifier | None (string) | If not using an array task file, the definition of the array - n[-m[:s]] . In it's simplest form, n is the number of sub-tasks (sub-ID starts at 1), n-m starts at ID n and runs until sub-job ID m . Providing :s defines the step size between adjacent sub-job IDs |
as_tuple | False (boolean) | Return job ID as a single element tuple |
coprocessor | None (string) | The name of a co-processor your job requires - use has_coprocessor() to check for availability |
coprocessor_toolkit | None (string) | The name of the shell module variant to load to configure the environment for this co-processor task, e.g. if you have a shell module cuda/10.2 then this would be 10.2 (assuming that the co-processor configuration has cuda set as it's module parent) |
coprocessor_class | None (string) | The name of the class (as defined in the configuration) of co-processor |
coprocessor_class_strict | False (boolean) | Only submit to this class of GPU excluding more capable devices |
coprocessor_multi | "1" (string) | Complex definition requesting multiple co-processors. At its most basic this is the number of co-processors per node you require but may take more complex values as required by your cluster setup |
export_vars | [] (list of string) | This is a list of environment variables to copy to your job's environment where your cluster is configured to not transfer your complete environment. This can be simple environment variable names or NAME=VALUE strings that will set the environment variable to the specified value for this job alone. |
jobhold | None (integer, string or list of integers/strings) | Job ID(s) that must complete before this job can run |
jobram | None (integer) | Amount of RAM required for your job in Gigabytes |
jobtime | None (integer) | Time required for your job in minutes |
keep_jobscript | False (boolean) | Whether to keep the generated job script as wrapper_<jobid>.sh |
logdir | None (string) | Path to the directory where log files should be created |
mail_on | None (string) | Mail user (if mail configured) on job 'a'bort/reschedule, 'b'egin, 'e'nd, 's'uspend or 'n'ever mail |
mailto | username@hostname (string) | Email address to send messages to |
name | None (string) | Name of job, defaults to first item in the command line |
parallel_env | None (string) | Name of parallel environment to request if the backend supports these, otherwise ignored |
priority | None (signed integer) | Priority of job within configured range - typically user can only lower priority |
project | None (string) | Project/Account name to use for job |
queue | None (string) | Rather than using jobram|jobtime|coprocessor to automatically select a queue specify a specific queue |
ramsplit | True (boolean) | Whether to enable the requesting multiple slots in a parallel environment sufficient to provide the RAM requested, if your cluster backend/setup has this configured |
requeueable | True (boolean) | Can this job be safely restarted (rescheduled)? |
resources | None (string) | Cluster resource request strings, e.g. softwarelicense=1 |
threads | 1 (integer) | How many threads your software requires - attempts will be made to limit your task to this number of threads |
usescript | False (boolean) | Have you provided a job script in the command argument? If so all other options are ignored |
validate_command | True (boolean) | Whether to validate that the first item in the command line is an executable |
extra_args | None (list) | List of strings representing additional arguments to pass through to the scheduler |
Submit job(s) to a queue, returns the job id as an integer.
Single tasks require a command in the form of a list [command, arg1,arg2, ...] or simple string "command arg1 arg2" which will be shlex.split.
Array tasks (array_task=True) require a file name of the array task table file unless array_specifier="n[-m[:s]]" is specified in which case command is as per a single task.
fsl_sub.delete_job
Import: fsl_sub Arguments: job_id, (sub_job_id)
You can request that a job is killed using the fsl_sub.delete_job function which takes the job ID (including task ID) and calls the appropriate cluster job deletion command. This returns a tuple, text output from the delete command and the return code from the command.
Writing Plugins
Inside the plugins folder there is a template - template_plugin
that can be uesed as a basis to add support for different grid submission engines. This file should be renamed to fsl_sub_plugin_<method>.py
and placed somewhere on the Python search path. Inside the plugin change METHOD_NAME to <method> and then modify the functions appropriately. The submit function carries out the job submission, and aims to either generate a command line with all the job arguments or to build a job submission script. The arguments should be added to the command_args list in the form of option flags and lists of options with arguments.
Also provide a fsl_sub_<method>.yml
file that provides the default configuration for the module.
To create an installable Conda/Pip package of this plugin look at the Grid Engine and SLURM plugins for example directory layouts and build scripts.
Building
Conda
The fsl_sub conda recipe is hosted in a separate repository at https://git.fmrib.ox.ac.uk/fsl/conda/fsl_sub. Conda packages for new releases are automatically built and published to the FSL conda channel at https://fsl.fmrib.ox.ac.uk/fsldownloads/fslconda/public/.
To build a Conda package by hand for the current fsl_sub release (denoted by the version
field specified in the recipe meta.yaml
file):
git clone https://git.fmrib.ox.ac.uk/fsl/conda/fsl_sub
cd fsl_sub
conda build
Refer to the FSL conda documentation for more information on FSL conda packages.
Pip
To build with PIP, prepare the source distribution:
python setup.py sdist
To build a wheel you need to install wheel into your Python build environment
pip install wheel
fsl_sub is only compatible with python 3 so you will be building a Pure Python Wheel
python setup.py bdist_wheel
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fsl_sub-2.8.3.tar.gz
.
File metadata
- Download URL: fsl_sub-2.8.3.tar.gz
- Upload date:
- Size: 84.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6b115bce5f9e95c63df2e4a13a7d36c29e9e2d1ad5e8822bfe61dbce32b3fc4 |
|
MD5 | da4be2c3234df5929982d9fdc0e8020f |
|
BLAKE2b-256 | 40c08178b7fca82b2f9c7f484b0cf510e131c8d8346598dd10b75b2fb02adf7c |
File details
Details for the file fsl_sub-2.8.3-py3-none-any.whl
.
File metadata
- Download URL: fsl_sub-2.8.3-py3-none-any.whl
- Upload date:
- Size: 86.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 31c3754953a6460ddd8a184c9dfe1a729e94e8089c2d61772bb26b136e266274 |
|
MD5 | f5f7eee1ceb1d52cef1dbb4413b08ceb |
|
BLAKE2b-256 | ce2d84e3cbff196cd73444e796df955b63870d869afa76b3ff974b1d11772f22 |