A node vetting cli for Distributed Workloads
Project description
Node Vetting for Distributed Workloads
Ensure allocated nodes are vetted before executing a distributed workload through a series of configurable sanity checks. These checks are designed to detect highly dynamic issues (e.g., GPU temperature) and should be performed immediately before executing the main distributed job.
Features
- ⚡ Fast and lightweight
- 🛠️ Modular and configurable
- 🚀 Easy to extend
Getting Started
# Install
pip install vetnode
# checks for dependencies and installs requirements
vetnode setup ./examples/local-test/config.yaml
# runs the vetting process
vetnode diagnose ./examples/local-test/config.yaml
Workflow Usage Example
The vetnode cli is intended to be embedded into your HPC workflow. The following is a node vetting example for a ML (machine learning) workflow on a Slurm HPC cluster.
#!/bin/bash
#SBATCH --nodes=6
#SBATCH --time=0-00:15:00
#SBATCH --account=a-csstaff
REQUIRED_NODES=4
MAIN_JOB_COMMAND="python -m torch.distributed.torchrun --nproc_per_node=$(wc -l < vetted-nodes.txt) main.py"
vetnode setup ../examples/slurm-job-with-vetting/config.yaml
srun vetnode diagnose ../examples/slurm-job-with-vetting/config.yaml >> results.txt
# Extract node lists
grep '^Cordon:' results.txt | awk '{print $2}' > cordoned-nodes.txt
grep '^Vetted:' results.txt | awk '{print $2}' > vetted-nodes.txt
#Run on healthy nodes only
if [ $(wc -l < vetted-nodes.txt) -ge $REQUIRED_NODES ]; then
srun -N $REQUIRED_NODES --exclude=./cordoned-nodes.txt $MAIN_JOB_COMMAND
else
echo "Job canceled!"
echo "Reason: too few vetted nodes."
fi
Quick Run
The following is a Slurm job example you can download and run as a test.
curl -o job.sh https://raw.githubusercontent.com/theely/vetnode/refs/heads/main/examples/slurm-ml-vetting/job.sh
sbatch --account=a-csstaff job.sh
#check job status
squeue -j {jobid} --long
#check vetting results
cat vetnode-{jobid}/results.txt
Development
Set-up Python Virtual environement
Create a virtual environment:
python3.11 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip
pip install -r requirements.txt
Run the CLI
cd src
python -m vetnode diagnose ../examples/local-test/config.yaml
Running Tests
From the FirecREST root folder run pytest to execute all unit tests.
source .venv/bin/activate
pip install -r ./requirements.txt -r ./requirements-testing.txt
pytest
Distribute
pip install -r ./requirements-testing.txt
python3 -m build --wheel
twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vetnode-0.0.3-py3-none-any.whl.
File metadata
- Download URL: vetnode-0.0.3-py3-none-any.whl
- Upload date:
- Size: 11.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
878f8121c35af0adeba4acf6d07b6b0a0598ac257d83689d8e711dc65d10fba9
|
|
| MD5 |
37baa0cf307ca4ff67896397f4455034
|
|
| BLAKE2b-256 |
834d210bea71bc85fc266f2604d9e3b50f227ed8f529e399858998cd70508b8a
|