Model Ensemble for batch workflows on HPCs

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
- System Administrators
License
- OSI Approved :: MIT License
Operating System
- POSIX
Programming Language
Topic
- System :: Distributed Computing

Project description

Model Ensembler

Introduction

This is a tool to assist users in running model ensembles on HPCs, potentially in conjunction with other external systems. It will be easy to be extend for various other HPC backends (currently just SLURM is supported), as well as being easy to extend code wise for new tasks that support the ensemble workflows.

Installation

Refit the instructions to match however you like created virtual environments! Python3.8 is the development Python I'm currently using but anything above that is likely to work, as well as possibly 3.7, but 3.6 won't.

python3.8 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install model-ensembler

Checking it works

You can run the sanity checker with the following command, choosing either the dummy executor or slurm as appropriate.

TODO: v0.5.3: in the meantime you can run the examples

model_ensemble_check [dummy|slurm]

Basic Usage

There are some examples under "example" that can be run on a local machine (you can switch off slurm submission via the CLI -s switch)

The basic pattern for using this toolkit is

Create the execution environment (see previous section)
Adapt a job/model run
YAML Configuration
Running the job

Adapt a job/model run

The core component of any slurm_batch run is a working cluster job. If not designing from scratch, think about the following in order to adapt the job to a batch configuration:

What source data/processing do you need before the whole batch and how do you get/do it
What source data/processing do you need before each run and how do you get/do it
What checks are needed before each run is submitted to slurm
What needs to change in the job for each run
What happens afterwards: what needs checking, where is the data going, what cleanup is required

Breaking each activity down should allow you to consider what pre and post processing you need to implement AS SINGLE ACTIVITIES.

Quite a common issue with jobs is that people have a monolithic script doing everything that doesn't lend itself to batching. This monolith should be broken down into activities that can be templated out (to provide per-run variance) and individually assessed prior to moving on.

These activities are then all stitched together with the configuration.

YAML Configuration

To make up a set of runs we use a YAML configuration file which is clear to read and simple to manage.

The idea is that you can define a batch, or set of batches, containing individual runs that are individually templated and run. These runs are done according to a common configuration defined for the batch.

The configuration is split up into the following sections:

vars: global configuration defaults
pre_process/post_process: tasks to be run before any batches commence, or after they've completed
batches: a list of batches to be run concurrently

Each batch is then split into the following sections (please note this is likely to be changed during development):

configuration: there are numerous options that control how the batch operates
- name: an identifier that's used as the prefix for the run ID
- templatedir: directory that will be copied as run directories, can contain both templates and symlinks
- templates: a list of templates to be processed by Jinja (can be any text file)
- job_file: the file to be used to submit to SLURM
- cluster/basedir/email/nodes/ntasks/length: job_file parameters for SLURM
- maxruns: the maximum amount of runs to be processing (pre_run, actual run and post_run activities) at once
- maxjobs: the maximum amount of jobs to have running in the HPC at once
pre_batch/post_batch: tasks to be run before or after the batch
pre_run/post_run: tasks to be run prior to or after each run within the batch

Tasks

There are numerous tasks that can be defined within the pre_ and post_ sections, which allow you to specify actions to take place throughout the execution lifetime.

Tasks are either checks, which block until a condition is satisfied, or processing, which are activities that will do something and result in failure if not completed successfully.

jobs (check): allows you to manually check that there aren't too many jobs running in the HPC.
submit (processing): manually submit a task to the HPC backend - in addition to the core submission specified by the configuration.
quota (check): allows you to check that you have enough user quota space to progress.
check (check): run a script that returns a success/failure error code. This can be failure tolerant (check will be repeated) or intolerant (failure will cause the run to fail)
execute (processing): run a script until completion
move (processing): copy (using rsync) run directory contents to another destination
remove (processing): remove either the run directory or another (specified) directory

Variables

Variables are available in templates with increasing, overridden, granularity. Defaults are specified from vars at the top level and then the run dictionaries, in addition to the batch level configurations, are all available within the templates.

Running the job / CLI reference

usage: model_ensemble [-h] [-n] [-v] [-c] [-s] [-p] [-k SKIPS] [-i INDEXES]
                   [-ct CHECK_TIMEOUT] [-st SUBMIT_TIMEOUT]
                   [-rt RUNNING_TIMEOUT] [-et ERROR_TIMEOUT]
                   configuration

Contributing

This program is still under development and is in its infancy, though it's progressed from a one-off tool to reusable (at the British Antarctic Survey it's been used for running WRF ensembles numerous times and will help power future IceNet and Digital Twin pipeline.)

Contributions now this is in the public domain are welcome!

I'm now trying to keep to the Google Style Guide for documentation

Future plans

Current plans are captured now in the github issues. There's nothing in the long term that I'm focusing on for this tool, except to maintain it and see if I can promote the usage a bit more.

This tool was merely to help out with a single support ticket for a weather model run, but the concept had potential and it was easier than deploying something more substantial! If there are better approaches or tools that do something similar, very keen to look at them!

Certainly, things like Airflow and job arrays have similar concepts, but are either more heavyweight/less suitable deployment wise or not abstracted enough for simplifying lives, respectively!!!

Cycl

Recently we noticed at Cycl 4, so in the medium term it's worth evaluating this tool and compare it to model_ensembler, as it seems pretty lightweight (which is the reason many others are a pain to use) and could be a good tool to use in place. model_ensembler is just quick and easy, so moving to a decent graph based workflow executor is preferable if you're thinking of long term implementation and education.

Further documentation

Wherever this repository is, there should be a WIKI also. This will go into further details about the configuration structure and operation.

Copyright

MIT LICENSE

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
- System Administrators
License
- OSI Approved :: MIT License
Operating System
- POSIX
Programming Language
Topic
- System :: Distributed Computing

Release history Release notifications | RSS feed

0.5.5

Sep 20, 2022

0.5.4

Apr 7, 2022

0.5.3

Mar 19, 2022

This version

0.5.2

Dec 13, 2021

0.5.1

Oct 31, 2021

0.5.0

Oct 7, 2021

0.5.0a2 pre-release

Oct 7, 2021

0.5.0a1 pre-release

Sep 28, 2021

0.4.0

Feb 6, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

model-ensembler-0.5.2.tar.gz (23.1 kB view hashes)

Uploaded Dec 13, 2021 Source

Built Distribution

model_ensembler-0.5.2-py3-none-any.whl (23.6 kB view hashes)

Uploaded Dec 13, 2021 Python 3

Hashes for model-ensembler-0.5.2.tar.gz

Hashes for model-ensembler-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`4e424c9fbf0f053729bb0d02c8abf9ab7f5449298c154298e426cbecfc1b769e`
MD5	`6ec33bcd458ca232bf3fa53f8ac7126a`
BLAKE2b-256	`0dd68a8c4b3d59e3fbf0ccd2927de94c8fec9da43b2c31e0eadc891a4796d20e`

Hashes for model_ensembler-0.5.2-py3-none-any.whl

Hashes for model_ensembler-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cbc12e78f845cda569118c23644711b7f176501bb5bd9c6d8b84c03844c24609`
MD5	`09ace4f488ca4fd20779443a45be70dd`
BLAKE2b-256	`51b7e2ae55a7724f89714afcadf67b2d91776721e69de34c07d66ffbda53a654`