Skip to main content

A node vetting cli for Distributed Workloads

Project description

Node Vetting for Distributed Workloads

Ensure allocated nodes are vetted before executing a distributed workload through a series of configurable sanity checks. These checks are designed to detect highly dynamic issues (e.g., GPU temperature) and should be performed immediately before executing the main distributed job.

Features

  • Fast and lightweight
  • 🛠️ Modular and configurable
  • 🚀 Easy to extend

Getting Started

# Install
pip install vetnode

# checks for dependencies and installs requirements
vetnode setup ./examples/local-test/config.yaml

# runs the vetting process
vetnode diagnose ./examples/local-test/config.yaml

Workflow Usage Example

The vetnode cli is intended to be embedded into your HPC workflow. The following is a node vetting example for a ML (machine learning) workflow on a Slurm HPC cluster.

#!/bin/bash

#SBATCH --nodes=6
#SBATCH --time=0-00:15:00
#SBATCH --account=a-csstaff

REQUIRED_NODES=4
MAIN_JOB_COMMAND="python -m torch.distributed.torchrun --nproc_per_node=$(wc -l < vetted-nodes.txt) main.py"

vetnode setup ../examples/slurm-job-with-vetting/config.yaml
srun vetnode diagnose ../examples/slurm-job-with-vetting/config.yaml >> results.txt

# Extract node lists
grep '^Cordon:' results.txt | awk '{print $2}' > cordoned-nodes.txt
grep '^Vetted:' results.txt | awk '{print $2}' > vetted-nodes.txt

#Run on healthy nodes only
if [ $(wc -l < vetted-nodes.txt) -ge $REQUIRED_NODES ]; then
    srun -N $REQUIRED_NODES --exclude=./cordoned-nodes.txt $MAIN_JOB_COMMAND
else
    echo "Job canceled!"
    echo "Reason: too few vetted nodes."
fi

Quick Run

The following is a Slurm job example you can download and run as a test.

curl -o job.sh  https://raw.githubusercontent.com/theely/vetnode/refs/heads/main/examples/slurm-ml-vetting/job.sh
sbatch --account=a-csstaff job.sh

#check job status
squeue -j {jobid} --long

#check vetting results
cat vetnode-{jobid}/results.txt

Development

Set-up Python Virtual environement

Create a virtual environment:

python3.11 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip
pip install -r requirements.txt

Run the CLI

cd src
python -m vetnode diagnose ../examples/local-test/config.yaml

Running Tests

From the FirecREST root folder run pytest to execute all unit tests.

source .venv/bin/activate
pip install -r ./requirements.txt -r ./requirements-testing.txt
pytest

Distribute

pip install -r ./requirements-testing.txt
python3 -m build --wheel
twine upload dist/*         

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vetnode-0.0.3-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file vetnode-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: vetnode-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for vetnode-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 878f8121c35af0adeba4acf6d07b6b0a0598ac257d83689d8e711dc65d10fba9
MD5 37baa0cf307ca4ff67896397f4455034
BLAKE2b-256 834d210bea71bc85fc266f2604d9e3b50f227ed8f529e399858998cd70508b8a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page