Skip to main content

A tool to orchestrate branch-based workflows and automate job submission for ACCESS experiments.

Project description

access-experiment-runner

CI CD Coverage Status License Code style: black

About

The main role of the ACCESS experiment runner is to manage and monitor experiment job runs on the supercomputing environment (e.g., Gadi). It builds on Payu, handling the orchestration of multiple configuration branches, experiment setup, and job lifecycle.

Key features

  • Leverages Payu and run multiple experiments from different configuration branches.
  • Supports updating parameters even after branches have been created, eliminating the need to delete and recreate entire branches when corrections are required.
  • Submits and tracks PBS jobs on Gadi; oversees job lifecycle from submission through completion.
    • When a job completes within expected run times, the tool prints a confirmation and stops further submissions.
    • If a job fails, users may choose to inspect the working directory to diagnose the root cause. The tool will detect the failure and pause further actions, giving the user control over whether to resubmit.
    • Detects already running or queued jobs and avoids redundant submissions—quickly skips duplicates with a user notification.

Installation

User setup

The experiment-runner is installed in the payu-dev conda environment, hence loading payu/dev would directly make experiment-runner available for use.

module use /g/data/vk83/prerelease/modules && module load payu/dev

Alternatively, create and activate a python virtual environment, then install via pip,

python3 -m venv <path/to/venv> --system-site-packages
source <path/to/venv>/bin/activate

pip install experiment-runner

Development setup

For contributors and developers, setup a development environment,

git clone https://github.com/ACCESS-NRI/access-experiment-runner.git
cd access-experiment-runner

# under a virtual environment
pip install -e .

Usage

experiment-runner -i --help

usage: experiment-runner [-h] [-i INPUT_YAML_FILE]

Manage ACCESS experiments using configurable YAML input.
If no YAML file is specified, the tool will look for 'Experiment_runner.yaml' in the current directory.
If that file is missing, you must specify one with -i / --input-yaml-file.

options:
  -h, --help            show this help message and exit
  -i INPUT_YAML_FILE, --input-yaml-file INPUT_YAML_FILE
                        Path to the YAML file specifying parameter values for experiment runs.
                        Defaults to 'Experiment_runner.yaml' if present in the current directory.

One YAML example is provided in example/Experiment_runner_example.yaml

test_path: /g/data/{PROJECT}/{USER}/prototype-0.1.0
repository_directory: 1deg_jra55_ryf
running_branches: [ctrl, perturb_1, perturb_2]
keep_uuid: True
running_branches: # List of experiment branches to run.
  - ctrl
  - perturb_1
  - perturb_2

nruns: # Number of runs for each branch; must match the order of running_branches.
  - 2
  - 0
  - 0

# Starting point for each branch. Options include:
#   cold: start from scratch (cold start).
#   control/restartXXX: start from a specific control run restart index.
#   perturb/restartXXX: start from a specific perturbation run restart index.
startfrom_restart:
  - cold
  - cold
  - cold

where,

test_path: All control and perturbation experiment repositories.

repository_directory: Local directory name for the central repository, where the running_branches are forked from.

running_branches: A list of git branches representing experiments to run.

keep_uuid: Preserve unique identifiers (UUIDs) across runs.

nruns: A list indicating how many runs to perform for each branch listed in running_branches.

startfrom_restart: Starting point for each branch.

Workflow example

  1. Trigger the experiment
experiment-runner -i example/Experiment_runner_example.yaml
  1. The tool then checks status:
  • Completed:
... already completed " {doneruns}, hence no new runs.
  • Failed:
Clean up a failed job {work_dir} and prepare it for resubmission.
  • Running/Queued:
You have duplicated runs for in the same folder hence not submitting this job!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

experiment_runner-0.2.1.tar.gz (115.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

experiment_runner-0.2.1-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file experiment_runner-0.2.1.tar.gz.

File metadata

  • Download URL: experiment_runner-0.2.1.tar.gz
  • Upload date:
  • Size: 115.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for experiment_runner-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b49ca96d505733154e622f641fed2dd25d9ab4900a6dffa8d653d398b26c2e0f
MD5 2d81587995fd1935d8f1f2904599c63c
BLAKE2b-256 d146b6dfacce679d5bead3c6866930e826ca6f271a744a4c4a3e87cd28195166

See more details on using hashes here.

Provenance

The following attestation bundles were made for experiment_runner-0.2.1.tar.gz:

Publisher: cd.yml on ACCESS-NRI/access-experiment-runner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file experiment_runner-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for experiment_runner-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7824a85cb9d50452095dfb9d3d3d3013af2a7f2e24189cfd870a247f0b7cd16b
MD5 7c3e5cc8ae2443fd6095b21e3d9b1df1
BLAKE2b-256 3f2772e501284dbc8de697730983c1260f1b6dc09381951de86e38008c6d0173

See more details on using hashes here.

Provenance

The following attestation bundles were made for experiment_runner-0.2.1-py3-none-any.whl:

Publisher: cd.yml on ACCESS-NRI/access-experiment-runner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page