Skip to main content

A command called sbatch and foo for the cloudmesh shell

Project description

Documentation

Sample YAML File

This command requires a YAML file which is configured for the host and gpu. The YAML file also points to the desired slurm template.

slurm_template: 'slurm_template.slurm'

sbatch_setup:
  <hostname>-<gpu>:
    - card_name: "a100"
    - time: "05:00:00"
    - num_cpus: 6
    - num_gpus: 1

  rivanna-v100:
    - card_name: "v100"
    - time: "06:00:00"
    - num_cpus: 6
    - num_gpus: 1

example:

cms sbatch slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4  --noos --dir=example --experiment=\"epoch=[1-3] x=[1,4] y=[10,11]\"
sbatch slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4 --noos --dir=example --experiment="epoch=[1-3] x=[1,4] y=[10,11]"
# ERROR: Importing python not yet implemented
epoch=1 x=1 y=10  sbatch example/slurm.sh
epoch=1 x=1 y=11  sbatch example/slurm.sh
epoch=1 x=4 y=10  sbatch example/slurm.sh
epoch=1 x=4 y=11  sbatch example/slurm.sh
epoch=2 x=1 y=10  sbatch example/slurm.sh
epoch=2 x=1 y=11  sbatch example/slurm.sh
epoch=2 x=4 y=10  sbatch example/slurm.sh
epoch=2 x=4 y=11  sbatch example/slurm.sh
epoch=3 x=1 y=10  sbatch example/slurm.sh
epoch=3 x=1 y=11  sbatch example/slurm.sh
epoch=3 x=4 y=10  sbatch example/slurm.sh
epoch=3 x=4 y=11  sbatch example/slurm.sh
Timer: 0.0022s Load: 0.0013s sbatch slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4 --noos --dir=example --experiment="epoch=[1-3] x=[1,4] y=[10,11]"

Slurm on a single computer ubuntu 20.04

Install

see https://drtailor.medium.com/how-to-setup-slurm-on-ubuntu-20-04-for-single-node-work-scheduling-6cc909574365

32 Processors (threads)

sudo apt update -y
sudo apt install slurmd slurmctld -y

sudo chmod 777 /etc/slurm-llnl

# make sure to use the HOSTNAME

sudo cat << EOF > /etc/slurm-llnl/slurm.conf
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=localcluster
SlurmctldHost=$HOSTNAME
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
# COMPUTE NODES # THis machine has 128GB main memory
NodeName=$HOSTNAME CPUs=32 RealMemory==128762 State=UNKNOWN
PartitionName=local Nodes=ALL Default=YES MaxTime=INFINITE State=UP
EOF

sudo chmod 755 /etc/slurm-llnl/

Start

sudo systemctl start slurmctld
sudo systemctl start slurmd
sudo scontrol update nodename=localhost state=idle

Stop

sudo systemctl stop slurmd
sudo systemctl stop slurmctld

Info

sinfo

Job

save into gregor.slurm

#!/bin/bash

#SBATCH --job-name=gregors_test          # Job name
#SBATCH --mail-type=END,FAIL             # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=laszewski@gmail.com  # Where to send mail	
#SBATCH --ntasks=1                       # Run on a single CPU
####  XBATCH --mem=1gb                        # Job memory request
#SBATCH --time=00:05:00                  # Time limit hrs:min:sec
#SBATCH --output=sgregors_test_%j.log    # Standard output and error log

pwd; hostname; date

echo "Gregors Test"
date
sleep(30)
date

Run with

sbatch gregor.slurm
watch -n 1 squeue

BUG

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2    LocalQ gregors_    green PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

sbatch slurm manageement commands for localhost

start slurm deamons

cms sbatch slurm start

stop surm deamons

cms sbatch slurm stop

BUG:

srun gregor.slurm

srun: Required node not available (down, drained or reserved)
srun: job 7 queued and waiting for resources
sudo scontrol update nodename=localhost state=POWER_UP

Valid states are: NoResp DRAIN FAIL FUTURE RESUME POWER_DOWN POWER_UP UNDRAIN

Cheatsheet

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloudmesh-sbatch-4.3.3.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

cloudmesh_sbatch-4.3.3-py2.py3-none-any.whl (10.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file cloudmesh-sbatch-4.3.3.tar.gz.

File metadata

  • Download URL: cloudmesh-sbatch-4.3.3.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for cloudmesh-sbatch-4.3.3.tar.gz
Algorithm Hash digest
SHA256 11d3a11a580cd3315d4b00a87c392472a8905d1a9beb25000f3b457ec40ec23e
MD5 7dac5e0b3bf0f9a6c749df3bacf536c0
BLAKE2b-256 88e5b122ac810b25039a19853e3f31908f9c543f27c1a766cb7a6d9890b6f324

See more details on using hashes here.

File details

Details for the file cloudmesh_sbatch-4.3.3-py2.py3-none-any.whl.

File metadata

  • Download URL: cloudmesh_sbatch-4.3.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for cloudmesh_sbatch-4.3.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 91cf14c3c7c129f12d8cac25d95cacdc437d7919ed6121b57a9b0337f2f8f658
MD5 9753a0000a47de5e4a3156812f6c387c
BLAKE2b-256 8a96b51e14698ddef15d2ffdadb3f3dc12a5e2889ed1823335a55b89e4e4c5b8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page