Skip to main content

helper tools for the SLURM HPC workload manager used at Fred Hutch and elsewhere

Project description

A collection of slurm command line tools and wrappers mostly found on github

why slurm-toys

The purpose of slurm-toys is to package useful SLURM helper tools written in Python 3 or Shell and publish them into a single package on PyPI

currently integrated toys

slurm-limiter

HPC clusters are optimized to maximize utilization for batch jobs. FairShare helps to ensure that all users get an appropriate amount of resources over time. However FairShare can only influence jobs that have not started yet. If a cluster is used 100% by “large” users, “small” users become unhappy because they may not be able to get a single node ad hoc. Currently the only solution to this problem appears to be setting hard account limits. Unfortunatelty these limits are often set too high when a cluster is busy and too low when it is idle. slurm-limiter addresses this problem by dynamically adjusting the limits based on overal partition/ queue load.

If you want a responsive HPC cluster this should take no longer than 5 sec:

~$ time srun hostname
srun: job 61004624 queued and waiting for resources
srun: job 61004624 has been allocated resources
gizmof171

real    0m1.668s
user    0m0.044s
sys 0m0.012s

example use in a cron job, running every 20 min:

*/20 * * * * root (  ml Python/3.6.4-foss-2016b-fh1; /app/bin/slurm-limiter -p campus \
                   --error-email=sysadmin\@institute.org --minaccountlimit=50 --maxaccountlimit=350 \
                   --slaaccountlimit=300 --changestep=50 --maxpercentuse=90 \
                   --minidlenodes=5 ) >>/var/tmp/slurm-limiter.log 2>&1

example output to syslog:

~$ grep slurm-limiter: /var/log/syslog
Apr 15 09:40:03 gizmo-ctld slurm-limiter: INFO:slurm-limiter.85: Cores: running=689, pending=3299, total=1180, Usage=58 %, Limits: 350 / 370, Nodes: idle=101
Apr 15 10:00:03 gizmo-ctld slurm-limiter: INFO:slurm-limiter.85: Cores: running=689, pending=3274, total=1180, Usage=58 %, Limits: 350 / 370, Nodes: idle=101
Apr 15 10:20:03 gizmo-ctld slurm-limiter: INFO:slurm-limiter.85: Cores: running=680, pending=3241, total=1180, Usage=57 %, Limits: 350 / 370, Nodes: idle=102
Apr 15 10:40:03 gizmo-ctld slurm-limiter: INFO:slurm-limiter.85: Cores: running=680, pending=3219, total=1180, Usage=57 %, Limits: 350 / 370, Nodes: idle=102

output of slurm-limiter –help

~$ slurm-limiter --help
usage: slurm-limiter [-h] [--debug] [--error-email ERROREMAIL]
                     [--cluster CLUSTER] [--partition PARTITION]
                     [--feature FEATURE] [--qos QOS]
                     [--maxaccountlimit MAXLIMIT] [--minaccountlimit MINLIMIT]
                     [--slaaccountlimit SLALIMIT]
                     [--userlimitoffset USERLIMITOFFSET]
                     [--changestep CHANGESTEP] [--minpending MINPENDING]
                     [--maxpercentuse MAXPERCENTUSE]
                     [--minidlenodes MINIDLENODES]

slurm-limiter checks the current util of a slurm cluster and adjusts the
account and user limits dynamically within certain range

optional arguments:
  -h, --help            show this help message and exit
  --debug, -d           verbose output for all commands
  --error-email ERROREMAIL, -e ERROREMAIL
                        send errors to this email address.
  --cluster CLUSTER, -M CLUSTER
                        name of the slurm cluster, (default: current cluster)
  --partition PARTITION, -p PARTITION
                        partition of the slurm cluster (default: entire
                        cluster)
  --feature FEATURE, -f FEATURE
                        filter for only this slurm feature
  --qos QOS, -q QOS     slurm QOS to use for changing account limits (default:
                        public)
  --maxaccountlimit MAXLIMIT, -x MAXLIMIT
                        maximum account limit, never go above this (default:
                        300)
  --minaccountlimit MINLIMIT, -n MINLIMIT
                        minimum account limit, never go below this (default:
                        100)
  --slaaccountlimit SLALIMIT, -t SLALIMIT
                        min SLA limit that has been committed to customers,
                        notify via email if breached (default: 150)
  --userlimitoffset USERLIMITOFFSET, -o USERLIMITOFFSET
                        offset of userlimit from account limit, set a negative
                        number for a userlimit lower than account limit
                        (default: 20)
  --changestep CHANGESTEP, -s CHANGESTEP
                        increase or decrease the limit by this # of cores
                        (default: 10)
  --minpending MINPENDING, -i MINPENDING
                        minimum number of jobs that have to be pending to take
                        action (default: 50)
  --maxpercentuse MAXPERCENTUSE, -u MAXPERCENTUSE
                        maximum allowed % usage in this cluster or partition
                        Throttle QOS down by --changestep if exceeded.
                        (default: 90)
  --minidlenodes MINIDLENODES, -w MINIDLENODES
                        critical minimum number of idle nodes. Throttle QOS
                        down to --minaccountlimit if exceeded. (default: 5)

future toys

in the future we can integrate other tools, predominantly stuff found on github

https://github.com/search?l=Python&p=1&q=slurm+&type=Repositories

https://github.com/search?l=Shell&q=slurm+&type=Repositories

new tool

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurm-toys-1.0.1.tar.gz (13.9 kB view details)

Uploaded Source

File details

Details for the file slurm-toys-1.0.1.tar.gz.

File metadata

  • Download URL: slurm-toys-1.0.1.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for slurm-toys-1.0.1.tar.gz
Algorithm Hash digest
SHA256 36edcc7a65ae4a69ff52000e996288790e4cb1e6fa6479c3f7d5a856d78aa383
MD5 0ba58b998d52d9cc0cc70327f2d6f09a
BLAKE2b-256 e7399975c57de9645e20a9d37416ce165dabc9f6f937d03cd78deea449cf0d75

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page