A node vetting cli for Distributed Workloads
Project description
Node Vetting for Distributed Workloads
Ensure allocated nodes are vetted before executing a distributed workload through a series of configurable sanity checks. These checks are designed to detect highly dynamic issues (e.g., GPU temperature) and should be performed immediately before executing the main distributed job.
Features
- ⚡ Fast and lightweight
- 🛠️ Modular and configurable
- 🚀 Easy to extend
Getting Started
# Install
pip install vetnode
# checks for dependencies and installs requirements
vetnode setup ./examples/local-test/config.yaml
# runs the vetting process
vetnode diagnose ./examples/local-test/config.yaml
Workflow Usage Example
The vetnode cli is intended to be embedded into your HPC workflow. The following is a node vetting example for a ML (machine learning) workflow on a Slurm HPC cluster.
#!/bin/bash
#SBATCH --nodes=6
#SBATCH --time=0-00:15:00
#SBATCH --account=a-csstaff
REQUIRED_NODES=4
MAIN_JOB_COMMAND="python -m torch.distributed.torchrun --nproc_per_node=$(wc -l < vetted-nodes.txt) main.py"
vetnode setup ../examples/slurm-ml-vetting/config.yaml
srun vetnode diagnose ../examples/slurm-ml-vetting/config.yaml >> results.txt
# Extract node lists
grep '^Cordon:' results.txt | awk '{print $2}' > cordoned-nodes.txt
grep '^Vetted:' results.txt | awk '{print $2}' > vetted-nodes.txt
#Run on healthy nodes only
if [ $(wc -l < vetted-nodes.txt) -ge $REQUIRED_NODES ]; then
srun -N $REQUIRED_NODES --exclude=./cordoned-nodes.txt $MAIN_JOB_COMMAND
else
echo "Job canceled!"
echo "Reason: too few vetted nodes."
fi
Quick Run
The following is a Slurm job example you can download and run as a test.
curl -o job.sh https://raw.githubusercontent.com/theely/vetnode/refs/heads/main/examples/slurm-ml-vetting/job.sh
sbatch --account=a-csstaff job.sh
#check job status
squeue -j {jobid} --long
#check vetting results
cat vetnode-{jobid}/results.txt
Development
Set-up Python Virtual environement
Create a virtual environment:
python3.11 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip
pip install -r requirements.txt
Run the CLI
cd src
python -m vetnode setup ../examples/local-test/config.yaml
python -m vetnode diagnose ../examples/local-test/config.yaml
Running Tests
From the FirecREST root folder run pytest to execute all unit tests.
source .venv/bin/activate
pip install -r ./requirements.txt -r ./requirements-testing.txt
pytest
Distribute
pip install -r ./requirements-testing.txt
python3 -m build --wheel
twine upload dist/*
Info dump
Clariden Distro:
NAME="SLES" VERSION="15-SP5" VERSION_ID="15.5" PRETTY_NAME="SUSE Linux Enterprise Server 15 SP5" ID="sles" ID_LIKE="suse" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:suse:sles:15:sp5" DOCUMENTATION_URL="https://documentation.suse.com/"
./configure --prefix=/users/palmee/aws-ofi-nccl/install --disable-tests --without-mpi --enable-cudart-dynamic --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/
https://download.opensuse.org/repositories/home:/aeszter/openSUSE_Leap_15.3/x86_64/libhwloc5-1.11.8-lp153.1.1.x86_64.rpm https://download.opensuse.org/repositories/home:/aeszter/15.5/x86_64/libhwloc5-1.11.8-lp155.1.1.x86_64.rpm
Build plugin in image
export DOCKER_DEFAULT_PLATFORM=linux/amd64 export
#export DOCKER_DEFAULT_PLATFORM=linux/arm64
docker run -i -t registry.suse.com/suse/sle15:15.5
zypper install -y libtool git gcc awk make wget
zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/sles15/sbsa/cuda-sles15.repo zypper --non-interactive --gpg-auto-import-keys refresh zypper install -y cuda-toolkit-12-3
Add missing lib path required by hwloc
echo "/usr/local/cuda/targets/x86_64-linux/lib/stubs/" | tee /etc/ld.so.conf.d/nvidiaml-x86_64.conf echo "/usr/local/cuda/targets/sbsa-linux/lib/stubs/" | tee /etc/ld.so.conf.d/nvidiaml-sbsa.conf
ldconfig ldconfig -p | grep libnvidia
git clone -b v1.19.0 https://github.com/ofiwg/libfabric.git cd libfabric autoupdate ./autogen.sh #CC=gcc ./configure --prefix=/users/palmee/libfabric/install CC=gcc ./configure make make install
wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.0.tar.gz tar -xvzf hwloc-2.12.0.tar.gz cd hwloc-2.12.0 #CC=gcc ./configure --prefix=/users/palmee/hwloc-2.12.0/install CC=gcc ./configure make make install
git clone -b v1.14.0 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install GIT_COMMIT=$(git rev-parse --short HEAD)
./autogen.sh
CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=./install/v1.14.0-${GIT_COMMIT}/x86_64/12.3/
--with-cuda=/usr/local/cuda
TODO: consider building an rpm: https://www.redhat.com/en/blog/create-rpm-package
CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install_2/
--with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/ --with-hwloc=/users/palmee/hwloc-2.12.0/install
export LD_LIBRARY_PATH=/opt/cray/libfabric/1.15.2.0/lib64/:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/lib64/:$LD_LIBRARY_PATH ld /users/palmee/aws-ofi-nccl/install_2/lib/libnccl-net.so
Install NCCL
git clone https://github.com/NVIDIA/nccl.git git checkout v2.20.3-1 #looks like this is the version compatible with cuda/12.3/ cd nccl make src.build CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/
WORKING LIB (job 327119)
| payload | busbw | algbw |
|---|---|---|
| 1GiB | 90.94GBps | 46.94GBps |
| 2GiB | 91.24GBps | 47.09GBps |
| 4GiB | 91.35GBps | 47.15GBps |
wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.0.tar.gz tar -xvzf hwloc-2.12.0.tar.gz cd hwloc-2.12.0 ./configure --prefix=/users/palmee/hwloc-2.12.0/install make make install
git clone -b v1.14.0 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install
./autogen.sh
./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install/
--with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/ --with-hwloc=/users/palmee/hwloc-2.12.0/install
TEST with gcc (job 327124) - working
| payload | busbw | algbw |
|---|---|---|
| 1GiB | 91.06GBps | 47.00GBps |
| 2GiB | 91.24GBps | 47.09GBps |
| 4GiB | 91.34GBps | 47.15GBps |
wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.0.tar.gz tar -xvzf hwloc-2.12.0.tar.gz cd hwloc-2.12.0 CC=gcc ./configure --prefix=/users/palmee/hwloc-2.12.0/install make make install
git clone -b v1.14.0 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install
./autogen.sh
CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install_3/
--with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/ --with-hwloc=/users/palmee/hwloc-2.12.0/install
make
make install
TEST with gcc all (job 327130) - running
| payload | busbw | algbw |
|---|---|---|
| 1GiB | 91.05GBps | 46.99GBps |
| 2GiB | 91.24GBps | 47.09GBps |
| 4GiB | 91.34GBps | 47.14GBps |
wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.0.tar.gz tar -xvzf hwloc-2.12.0.tar.gz cd hwloc-2.12.0 CC=gcc ./configure --prefix=/users/palmee/hwloc-2.12.0/install make make install
git clone -b v1.14.0 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install
./autogen.sh
CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install_3/
--with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/ --with-hwloc=/users/palmee/hwloc-2.12.0/install
make
make install
TEST with local libfabric only for compile (job 327145) -
| payload | busbw | algbw |
|---|---|---|
| 1GiB | 91.08GBps | 47.01GBps |
| 2GiB | 91.21GBps | 47.08GBps |
| 4GiB | 91.34GBps | 47.14GBps |
git clone -b v1.19.0 https://github.com/ofiwg/libfabric.git cd libfabric autoupdate ./autogen.sh CC=gcc ./configure --prefix=/users/palmee/libfabric/install make make install
wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.0.tar.gz tar -xvzf hwloc-2.12.0.tar.gz cd hwloc-2.12.0 CC=gcc ./configure --prefix=/users/palmee/hwloc-2.12.0/install make make install
git clone -b v1.14.0 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install
./autogen.sh
CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install_4/
--with-libfabric=/users/palmee/libfabric/install --with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/ --with-hwloc=/users/palmee/hwloc-2.12.0/install
make
make install
TEST with local libfabric compile and job run (job 327161) -
NOT WORKING!!! We need to use the crey lib fabric
Build plugin in image
export DOCKER_DEFAULT_PLATFORM=linux/amd64
docker run -i -t registry.suse.com/suse/sle15:15.5
zypper install -y libtool git gcc awk make wget
zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo zypper refresh zypper install -y cuda-toolkit-12-3
Build in Clariden
wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.0.tar.gz tar -xvzf hwloc-2.12.0.tar.gz cd hwloc-2.12.0 CC=gcc ./configure --prefix=/users/palmee/hwloc-2.12.0/install #CC=gcc ./configure make make install
git clone -b v1.14.1 https://github.com/aws/aws-ofi-nccl.git cd aws-ofi-nccl mkdir install
./autogen.sh
CC=gcc ./configure --disable-tests --without-mpi
--enable-cudart-dynamic
--prefix=/users/palmee/aws-ofi-nccl/install/
--with-libfabric=/opt/cray/libfabric/1.22.0/
--with-cuda=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/
--with-hwloc=/users/palmee/hwloc-2.12.0/install
make make install
export LD_LIBRARY_PATH=/opt/cray/libfabric/1.15.2.0/lib64/:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/opt/nvidia/hpc_sdk/Linux_aarch64/24.3/cuda/12.3/lib64/:$LD_LIBRARY_PATH ld /users/palmee/aws-ofi-nccl/install/lib/libnccl-net.so
TODO: consider building an rpm: https://www.redhat.com/en/blog/create-rpm-package
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vetnode-0.0.5-py3-none-any.whl.
File metadata
- Download URL: vetnode-0.0.5-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33b7dc3ff003a112cf187d16593988a0be9d59dcc9b45661d065f64d4b2d574b
|
|
| MD5 |
4d7e4e6e91c016941b57a3a7242cb3b4
|
|
| BLAKE2b-256 |
076d28dabbef7105692b46d237d48d7974e0b71bdddb030fceaa759bede5a7d2
|