Skip to main content

A fault-tolerant SLURM scheduler extension

Project description

NAME

decimate - a fault-tolerant SLURM scheduler extension

SYNOPSIS

dbatch [ Slurm options ] [ --check <user_script> ]
[ --max-retry=<number of restart> ]
script [args...]

DESCRIPTION

Developped by the KAUST Supercomputing Laboratory (KSL),
decimate is a SLURM extension written in python designed to handle
dependent jobs more easely and efficiently.

Decimate transparently adds parameters to SLURM sbatch command
to check the correctness of jobs and automatically
reschedules jobs found faulty.

Using Decimate on Shaheen II, one can submit, run, monitor or
terminate a workflow composed of dependent jobs. If asked,
thanks to standardized or customized messages, the user will be
informed by mail of the progress of its workflow on the system.

In case of failure of one part of tne workflow, decimate
automatically detects the failure, signals it to the user and
launches the misbehaving part after having fixed the job
dependency. By default if the same failure happens three
consecutive times, decimate cancels the whole workfow removing
all the depending jobs from the scheduling. In a next version,
decimate will allow the automatic restarting of the workflow
once the problem causing its failure has been cured.

decimate also allows the user to define his own mail alerts
that can be sent at any point of the workflow through a call to
a python method. This feature will also be available from bash
in a next version.

Some customized checking functions can also be designed by the
user. Their purpose is to validate if a step of the workflow
was succesful or not. It could involved checking for the
presence of some result files, grepping some error or success
messages in them, computing ratio or checksum... These
intermediate results can be easely transmitted to decimate
validating or not the correctness of any step. They can also be
forwarded by mail to the user where as the workflow is
executing.

USE

At this moment, jobs only need to be submitted through the
dbatch
command that accepts exactely the same parameters as the
original SLURM sbatch command plus the new parameters

--check=SCRIPT_FILE
where SCRIPT_FILE is a python
or shell script
to check if results are ok.

--max-retry=MAX_RETRY
number of time a step can fail and be
restarted automatically before failing the
whole workflow (3 per default)

sslog tails out the decimate logging file attached to the
current directory, tracking all the jobs that were launched
with dbatch from this directory.

sstatus gives the current status of the workflow excecuting
in the current directory.

Decimate is still in a beta phase and under test with some of
our KSL users. More documentations will be provided once the
stabilized and fully tested version is made available by the
end of June 2018.

If interested in testing decimate or contributing, please send
a mail to help@hpc.kaust.edu.sa

AUTHOR

Written by Samuel Kortas (samuel.kortas (at) kaust.edu.sa)

REPORTING BUGS

Report decimate bugs to help@hpc.kaust.edu.sa


COPYRIGHT
Copyright (c) 2017, KAUST Supercomputing Laboratory
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

SEE ALSO

decimate official documentation pages:
<http://http://decimate.readthedocs.io>

KAUST Supercomputing Laboratory: <http://hpc.kaust.edu.sa/>


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

decimate-0.9.6.tar.gz (87.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

decimate-0.9.6-py2.7.egg (179.9 kB view details)

Uploaded Egg

decimate-0.9.6-py2-none-any.whl (93.8 kB view details)

Uploaded Python 2

File details

Details for the file decimate-0.9.6.tar.gz.

File metadata

  • Download URL: decimate-0.9.6.tar.gz
  • Upload date:
  • Size: 87.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for decimate-0.9.6.tar.gz
Algorithm Hash digest
SHA256 5c3e4749e7b978bebadf44752ba5bab3562fdebbecd59c21a723dae0ac3d3710
MD5 f75765f00e2aac9789fe7c1bc5a868ac
BLAKE2b-256 581281431a3d5b9a4b526342170376ca563e3fd960d8eb895dde9a342a215449

See more details on using hashes here.

File details

Details for the file decimate-0.9.6-py2.7.egg.

File metadata

  • Download URL: decimate-0.9.6-py2.7.egg
  • Upload date:
  • Size: 179.9 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for decimate-0.9.6-py2.7.egg
Algorithm Hash digest
SHA256 3b6b36e954aa07715bcbfea50cb4c4523ce062c64f9b5500a2afcae4ea60b38a
MD5 6d438094e69be73db43b62d86dd550b2
BLAKE2b-256 454f3c4b2a940cde690e479af94fe1e08b524403bdc9ee044ae1aca75ce2d349

See more details on using hashes here.

File details

Details for the file decimate-0.9.6-py2-none-any.whl.

File metadata

File hashes

Hashes for decimate-0.9.6-py2-none-any.whl
Algorithm Hash digest
SHA256 48d14a0caf5a6506f698c31c27ccc6d74dfb3bda9b1a47acb3d1e5b15d72d6f8
MD5 9b4d2bf758b286c486e9213e1a6a2f98
BLAKE2b-256 0d33bafdc8da64b736bfd670c440e269c70b7fa633284dd57a871dbeb1da3031

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page