Skip to main content

A command called sbatch and foo for the cloudmesh shell

Project description

Documentation

Sample YAML File

This command requires a YAML file which is configured for the host and gpu. The YAML file also points to the desired slurm template.

slurm_template: 'slurm_template.slurm'

sbatch_setup:
  <hostname>-<gpu>:
    - card_name: "a100"
    - time: "05:00:00"
    - num_cpus: 6
    - num_gpus: 1

  rivanna-v100:
    - card_name: "v100"
    - time: "06:00:00"
    - num_cpus: 6
    - num_gpus: 1

example:

cms sbatch slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4  --noos --dir=example --experiment=\"epoch=[1-3] x=[1,4] y=[10,11]\"
sbatch slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4 --noos --dir=example --experiment="epoch=[1-3] x=[1,4] y=[10,11]"
# ERROR: Importing python not yet implemented
epoch=1 x=1 y=10  sbatch example/slurm.sh
epoch=1 x=1 y=11  sbatch example/slurm.sh
epoch=1 x=4 y=10  sbatch example/slurm.sh
epoch=1 x=4 y=11  sbatch example/slurm.sh
epoch=2 x=1 y=10  sbatch example/slurm.sh
epoch=2 x=1 y=11  sbatch example/slurm.sh
epoch=2 x=4 y=10  sbatch example/slurm.sh
epoch=2 x=4 y=11  sbatch example/slurm.sh
epoch=3 x=1 y=10  sbatch example/slurm.sh
epoch=3 x=1 y=11  sbatch example/slurm.sh
epoch=3 x=4 y=10  sbatch example/slurm.sh
epoch=3 x=4 y=11  sbatch example/slurm.sh
Timer: 0.0022s Load: 0.0013s sbatch slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4 --noos --dir=example --experiment="epoch=[1-3] x=[1,4] y=[10,11]"

Slurm on a single computer ubuntu 20.04

Install

see https://drtailor.medium.com/how-to-setup-slurm-on-ubuntu-20-04-for-single-node-work-scheduling-6cc909574365

32 Processors (threads)

sudo apt update -y
sudo apt install slurmd slurmctld -y

sudo chmod 777 /etc/slurm-llnl

# make sure to use the HOSTNAME

sudo cat << EOF > /etc/slurm-llnl/slurm.conf
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=localcluster
SlurmctldHost=$HOSTNAME
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
# COMPUTE NODES # THis machine has 128GB main memory
NodeName=$HOSTNAME CPUs=32 RealMemory==128762 State=UNKNOWN
PartitionName=local Nodes=ALL Default=YES MaxTime=INFINITE State=UP
EOF

sudo chmod 755 /etc/slurm-llnl/

Start

sudo systemctl start slurmctld
sudo systemctl start slurmd
sudo scontrol update nodename=localhost state=idle

Stop

sudo systemctl stop slurmd
sudo systemctl stop slurmctld

Info

sinfo

Job

save into gregor.slurm

#!/bin/bash

#SBATCH --job-name=gregors_test          # Job name
#SBATCH --mail-type=END,FAIL             # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=laszewski@gmail.com  # Where to send mail	
#SBATCH --ntasks=1                       # Run on a single CPU
####  XBATCH --mem=1gb                        # Job memory request
#SBATCH --time=00:05:00                  # Time limit hrs:min:sec
#SBATCH --output=sgregors_test_%j.log    # Standard output and error log

pwd; hostname; date

echo "Gregors Test"
date
sleep(30)
date

Run with

sbatch gregor.slurm
watch -n 1 squeue

BUG

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2    LocalQ gregors_    green PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

sbatch slurm manageement commands for localhost

start slurm deamons

cms sbatch slurm start

stop surm deamons

cms sbatch slurm stop

BUG:

srun gregor.slurm

srun: Required node not available (down, drained or reserved)
srun: job 7 queued and waiting for resources
sudo scontrol update nodename=localhost state=POWER_UP

Valid states are: NoResp DRAIN FAIL FUTURE RESUME POWER_DOWN POWER_UP UNDRAIN

Cheatsheet

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloudmesh-sbatch-4.3.3.tar.gz (11.5 kB view hashes)

Uploaded Source

Built Distribution

cloudmesh_sbatch-4.3.3-py2.py3-none-any.whl (10.5 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page