Identify and reduce instances of underutilization by the users of high-performance computing systems
Project description
Job Defense Shield
Job Defense Shield is a software tool for identifying and reducing instances of underutilization by the users of high-performance computing systems. The software can (1) send automated email alerts to users, (2) create reports for system administrators, and (3) automatically cancel GPU jobs at 0% utilization. Job Defense Shield is a component of the Jobstats job monitoring platform.
Below is an example report for 0% GPU utilization:
GPU-Hours at 0% Utilization
---------------------------------------------------------------------
User GPU-Hours-At-0% Jobs JobID Emails
---------------------------------------------------------------------
1 u12998 308 39 62285369,62303767,62317153+ 1 (7)
2 u9l487 84 14 62301737,62301738,62301742+ 0
3 u39635 25 2 62184669,62187323 2 (4)
4 u24074 24 13 62303182,62303183,62303184+ 0
---------------------------------------------------------------------
Cluster: della
Partitions: gpu, llm
Start: Wed Feb 12, 2025 at 09:50 AM
End: Wed Feb 19, 2025 at 09:50 AM
Below is an example email to a user that is requesting too much CPU memory:
Hi Alan (u12345),
Below are your jobs that ran on the Stellar cluster in the past 7 days:
JobID Memory-Used Memory-Allocated Percent-Used Cores Hours
5761066 2 GB 100 GB 2% 1 48
5761091 4 GB 100 GB 4% 1 48
5761092 3 GB 100 GB 3% 1 48
It appears that you are requesting too much CPU memory for your jobs since
you are only using on average 3% of the allocated memory. For help on
allocating CPU memory with Slurm, please see:
https://your-institution.edu/knowledge-base/memory
Replying to this automated email will open a support ticket with Research
Computing.
Getting Started
See the documentation for installing and running the software.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file job_defense_shield-1.2.6.tar.gz.
File metadata
- Download URL: job_defense_shield-1.2.6.tar.gz
- Upload date:
- Size: 115.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69aa666592c4bd6ab2eb343f7a51e50b0df79ad8be329cb15d94b61e8b5bf35b
|
|
| MD5 |
a74001697c4db2a169ea94a25cbec086
|
|
| BLAKE2b-256 |
4e1049159c56ff9c8c8ff1cbcaf37f13f2dc8efcd6820019c5464a6c0168ff3c
|
File details
Details for the file job_defense_shield-1.2.6-py3-none-any.whl.
File metadata
- Download URL: job_defense_shield-1.2.6-py3-none-any.whl
- Upload date:
- Size: 90.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce6da4a90b2c9eff5ede96a8acf17d56f5a6a21f4d7559b82a1abd0c0a75b70d
|
|
| MD5 |
3b7a606467556b123f71f7ac6d6f019d
|
|
| BLAKE2b-256 |
9e8ab5ed83185c18923515e22fd24548a4067f10d17f2b97b8b8a50e50284242
|