A node vetting cli for Distributed Workloads

These details have not been verified by PyPI

Project links

Project description

Node Vetting for Distributed Workloads

Ensure allocated nodes are vetted before executing a distributed workload through a series of configurable sanity checks. These checks are designed to detect highly dynamic issues (e.g., GPU temperature) and should be performed immediately before executing the main distributed job.

Features

⚡ Fast and lightweight
🛠️ Modular and configurable
🚀 Easy to extend

Getting Started

# Install
pip install vetnode

# checks for dependencies and installs requirements
vetnode setup ./examples/local-test/config.yaml

# runs the vetting process
vetnode diagnose ./examples/local-test/config.yaml

Workflow Usage Example

The vetnode cli is intended to be embedded into your HPC workflow. The following is a node vetting example for a ML (machine learning) workflow on a Slurm HPC cluster.

#!/bin/bash

#SBATCH --nodes=6
#SBATCH --time=0-00:15:00
#SBATCH --account=a-csstaff

REQUIRED_NODES=4
MAIN_JOB_COMMAND="python -m torch.distributed.torchrun --nproc_per_node=$(wc -l < vetted-nodes.txt) main.py"

vetnode setup ../examples/slurm-ml-vetting/config.yaml
srun vetnode diagnose ../examples/slurm-ml-vetting/config.yaml >> results.txt

# Extract node lists
grep '^Cordon:' results.txt | awk '{print $2}' > cordoned-nodes.txt
grep '^Vetted:' results.txt | awk '{print $2}' > vetted-nodes.txt

#Run on healthy nodes only
if [ $(wc -l < vetted-nodes.txt) -ge $REQUIRED_NODES ]; then
    srun -N $REQUIRED_NODES --exclude=./cordoned-nodes.txt $MAIN_JOB_COMMAND
else
    echo "Job canceled!"
    echo "Reason: too few vetted nodes."
fi

Quick Run

The following is a Slurm job example you can download and run as a test.

curl -o job.sh  https://raw.githubusercontent.com/theely/vetnode/refs/heads/main/examples/slurm-ml-vetting/job.sh
sbatch --account=a-csstaff job.sh

#check job status
squeue -j {jobid} --long

#check vetting results
cat vetnode-{jobid}/results.txt

Development

Set-up Python Virtual environement

Create a virtual environment:

python3.11 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip
pip install -r requirements.txt

Run the CLI

cd src
python -m vetnode setup ../examples/local-test/config.yaml
python -m vetnode diagnose ../examples/local-test/config.yaml

Running Tests

From the FirecREST root folder run pytest to execute all unit tests.

source .venv/bin/activate
pip install -r ./requirements.txt -r ./requirements-testing.txt
pytest

Distribute

pip install -r ./requirements-testing.txt
python3 -m build --wheel
twine upload dist/*

Info dump

Clariden Distro:

NAME="SLES" VERSION="15-SP5" VERSION_ID="15.5" PRETTY_NAME="SUSE Linux Enterprise Server 15 SP5" ID="sles" ID_LIKE="suse" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:suse:sles:15:sp5" DOCUMENTATION_URL="https://documentation.suse.com/"

./configure --prefix=/users/palmee/aws-ofi-nccl/install --disable-tests --without-mpi --enable-cudart-dynamic --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/

https://download.opensuse.org/repositories/home:/aeszter/openSUSE_Leap_15.3/x86_64/libhwloc5-1.11.8-lp153.1.1.x86_64.rpm https://download.opensuse.org/repositories/home:/aeszter/15.5/x86_64/libhwloc5-1.11.8-lp155.1.1.x86_64.rpm

Build plugin in image

export DOCKER_DEFAULT_PLATFORM=linux/amd64 export #export DOCKER_DEFAULT_PLATFORM=linux/arm64 docker run -i -t registry.suse.com/suse/sle15:15.5
zypper install -y libtool git gcc awk make wget

zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/sles15/sbsa/cuda-sles15.repo zypper --non-interactive --gpg-auto-import-keys refresh zypper install -y cuda-toolkit-12-3

Add missing lib path required by hwloc

echo "/usr/local/cuda/targets/x86_64-linux/lib/stubs/" | tee /etc/ld.so.conf.d/nvidiaml-x86_64.conf echo "/usr/local/cuda/targets/sbsa-linux/lib/stubs/" | tee /etc/ld.so.conf.d/nvidiaml-sbsa.conf

ldconfig ldconfig -p | grep libnvidia

git clone -b v1.19.0 https://github.com/ofiwg/libfabric.git cd libfabric autoupdate ./autogen.sh #CC=gcc ./configure --prefix=/users/palmee/libfabric/install CC=gcc ./configure make make install

wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.0.tar.gz tar -xvzf hwloc-2.12.0.tar.gz cd hwloc-2.12.0 #CC=gcc ./configure --prefix=/users/palmee/hwloc-2.12.0/install CC=gcc ./configure make make install

git clone -b v1.14.0 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install GIT_COMMIT=$(git rev-parse --short HEAD)

./autogen.sh CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=./install/v1.14.0-${GIT_COMMIT}/x86_64/12.3/
--with-cuda=/usr/local/cuda

TODO: consider building an rpm: https://www.redhat.com/en/blog/create-rpm-package

CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install_2/
--with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/ --with-hwloc=/users/palmee/hwloc-2.12.0/install

export LD_LIBRARY_PATH=/opt/cray/libfabric/1.15.2.0/lib64/:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/lib64/:$LD_LIBRARY_PATH ld /users/palmee/aws-ofi-nccl/install_2/lib/libnccl-net.so

Install NCCL

git clone https://github.com/NVIDIA/nccl.git git checkout v2.20.3-1 #looks like this is the version compatible with cuda/12.3/ cd nccl make src.build CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/

WORKING LIB (job 327119)

payload	busbw	algbw
1GiB	90.94GBps	46.94GBps
2GiB	91.24GBps	47.09GBps
4GiB	91.35GBps	47.15GBps

wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.0.tar.gz tar -xvzf hwloc-2.12.0.tar.gz cd hwloc-2.12.0 ./configure --prefix=/users/palmee/hwloc-2.12.0/install make make install

git clone -b v1.14.0 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install

./autogen.sh ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install/
--with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/ --with-hwloc=/users/palmee/hwloc-2.12.0/install

TEST with gcc (job 327124) - working

payload	busbw	algbw
1GiB	91.06GBps	47.00GBps
2GiB	91.24GBps	47.09GBps
4GiB	91.34GBps	47.15GBps

wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.0.tar.gz tar -xvzf hwloc-2.12.0.tar.gz cd hwloc-2.12.0 CC=gcc ./configure --prefix=/users/palmee/hwloc-2.12.0/install make make install

git clone -b v1.14.0 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install

./autogen.sh CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install_3/
--with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/ --with-hwloc=/users/palmee/hwloc-2.12.0/install make make install

TEST with gcc all (job 327130) - running

payload	busbw	algbw
1GiB	91.05GBps	46.99GBps
2GiB	91.24GBps	47.09GBps
4GiB	91.34GBps	47.14GBps

git clone -b v1.14.0 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install

TEST with local libfabric only for compile (job 327145) -

payload	busbw	algbw
1GiB	91.08GBps	47.01GBps
2GiB	91.21GBps	47.08GBps
4GiB	91.34GBps	47.14GBps

git clone -b v1.19.0 https://github.com/ofiwg/libfabric.git cd libfabric autoupdate ./autogen.sh CC=gcc ./configure --prefix=/users/palmee/libfabric/install make make install

git clone -b v1.14.0 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install

./autogen.sh CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install_4/
--with-libfabric=/users/palmee/libfabric/install --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/ --with-hwloc=/users/palmee/hwloc-2.12.0/install make make install

TEST with local libfabric compile and job run (job 327161) -

NOT WORKING!!! We need to use the crey lib fabric

Build plugin in image

export DOCKER_DEFAULT_PLATFORM=linux/amd64 docker run -i -t registry.suse.com/suse/sle15:15.5
zypper install -y libtool git gcc awk make wget

zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo zypper refresh zypper install -y cuda-toolkit-12-3

Build in Clariden

git clone -b v1.14.1 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install

./autogen.sh CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install/
--with-libfabric=/opt/cray/libfabric/1.22.0/
--with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/
--with-hwloc=/users/palmee/hwloc-2.12.0/install

make make install

TODO: consider building an rpm: https://www.redhat.com/en/blog/create-rpm-package

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Feb 18, 2026

0.1.1

Feb 17, 2026

0.1.0

Feb 17, 2026

0.0.13

Feb 3, 2026

0.0.12

Jan 29, 2026

0.0.11

Jan 27, 2026

0.0.10

Jan 26, 2026

0.0.9

Jan 23, 2026

0.0.8

Oct 21, 2025

0.0.7

Jul 11, 2025

0.0.6

May 22, 2025

This version

0.0.5

May 22, 2025

0.0.4

May 14, 2025

0.0.3

Mar 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vetnode-0.0.5-py3-none-any.whl (18.8 kB view details)

Uploaded May 22, 2025 Python 3

File details

Details for the file vetnode-0.0.5-py3-none-any.whl.

File metadata

Download URL: vetnode-0.0.5-py3-none-any.whl
Upload date: May 22, 2025
Size: 18.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for vetnode-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`33b7dc3ff003a112cf187d16593988a0be9d59dcc9b45661d065f64d4b2d574b`
MD5	`4d7e4e6e91c016941b57a3a7242cb3b4`
BLAKE2b-256	`076d28dabbef7105692b46d237d48d7974e0b71bdddb030fceaa759bede5a7d2`

See more details on using hashes here.

VetNode 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Node Vetting for Distributed Workloads

Features

Getting Started

Workflow Usage Example

Quick Run

Development

Set-up Python Virtual environement

Run the CLI

Running Tests

Distribute

Info dump

Build plugin in image

Add missing lib path required by hwloc

Install NCCL

WORKING LIB (job 327119)

TEST with gcc (job 327124) - working

TEST with gcc all (job 327130) - running

TEST with local libfabric only for compile (job 327145) -

TEST with local libfabric compile and job run (job 327161) -

Build plugin in image

Build in Clariden

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes