Skip to main content

Identify and reduce instances of underutilization by the users of high-performance computing systems

Project description

Tests codecov PyPI License: GPL v2 DOI

Job Defense Shield

Job Defense Shield is a software tool for identifying and reducing instances of underutilization by the users of high-performance computing systems. The software can (1) send automated email alerts to users, (2) create reports for system administrators, and (3) automatically cancel GPU jobs at 0% utilization. Job Defense Shield is a component of the Jobstats job monitoring platform.

Below is an example report for 0% GPU utilization:

                         GPU-Hours at 0% Utilization
---------------------------------------------------------------------
    User   GPU-Hours-At-0%  Jobs             JobID             Emails
---------------------------------------------------------------------
1  u12998        308         39   62285369,62303767,62317153+   1 (7)
2  u9l487         84         14   62301737,62301738,62301742+   0         
3  u39635         25          2            62184669,62187323    2 (4)         
4  u24074         24         13   62303182,62303183,62303184+   0         
---------------------------------------------------------------------
   Cluster: della
Partitions: gpu, llm
     Start: Wed Feb 12, 2025 at 09:50 AM
       End: Wed Feb 19, 2025 at 09:50 AM

Below is an example email to a user that is requesting too much CPU memory:

Hi Alan (u12345),

Below are your jobs that ran on the Stellar cluster in the past 7 days:

     JobID   Memory-Used  Memory-Allocated  Percent-Used  Cores  Hours
    5761066      2 GB          100 GB            2%         1     48
    5761091      4 GB          100 GB            4%         1     48
    5761092      3 GB          100 GB            3%         1     48

It appears that you are requesting too much CPU memory for your jobs since
you are only using on average 3% of the allocated memory. For help on
allocating CPU memory with Slurm, please see:

    https://your-institution.edu/knowledge-base/memory

Replying to this automated email will open a support ticket with Research
Computing.

Getting Started

See the documentation for installing and running the software.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

job_defense_shield-1.2.6.tar.gz (115.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

job_defense_shield-1.2.6-py3-none-any.whl (90.9 kB view details)

Uploaded Python 3

File details

Details for the file job_defense_shield-1.2.6.tar.gz.

File metadata

  • Download URL: job_defense_shield-1.2.6.tar.gz
  • Upload date:
  • Size: 115.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for job_defense_shield-1.2.6.tar.gz
Algorithm Hash digest
SHA256 69aa666592c4bd6ab2eb343f7a51e50b0df79ad8be329cb15d94b61e8b5bf35b
MD5 a74001697c4db2a169ea94a25cbec086
BLAKE2b-256 4e1049159c56ff9c8c8ff1cbcaf37f13f2dc8efcd6820019c5464a6c0168ff3c

See more details on using hashes here.

File details

Details for the file job_defense_shield-1.2.6-py3-none-any.whl.

File metadata

File hashes

Hashes for job_defense_shield-1.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ce6da4a90b2c9eff5ede96a8acf17d56f5a6a21f4d7559b82a1abd0c0a75b70d
MD5 3b7a606467556b123f71f7ac6d6f019d
BLAKE2b-256 9e8ab5ed83185c18923515e22fd24548a4067f10d17f2b97b8b8a50e50284242

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page