Skip to main content

A summary of GPU usage on a SLURM cluster

Project description

sgpustat

sgpustat is a simple command line utility that produces a summary of GPU usage on a SLURM cluster, following the naming convention of the other SLURM tools (squeue, sinfo, scontrol, ...). The tool can be used in two ways:

  1. To query the current usage of GPUs on the cluster.
  2. To launch a daemon which will log usage over time. This log can later be queried to provide usage statistics.

All data comes from exactly two scontrol calls per invocation, so it is fast even on busy clusters, and GPU accounting is exact — including on nodes with NVIDIA MIG instances and for jobs submitted with untyped --gres=gpu:N requests.

This project began as a fork of albanie/slurm_gpustat; the implementation has since been rewritten.

Installation

Install via pip install sgpustat. The parsing/accounting logic lives in core.py, data collection in collect.py, rendering in render.py, the logging daemon in daemon.py, and the CLI entry point in cli.py.

Usage

To print a summary of current activity:

sgpustat

To print a summary of current activity on particular partitions, e.g. debug & normal:

sgpustat -p debug,normal or sgpustat --partition debug,normal

To include a per-node breakdown of available GPUs:

sgpustat --verbose

To output machine-readable CSV:

sgpustat --raw

To start the logging daemon:

sgpustat --action daemon-start

To view a summary of logged data:

sgpustat --action history

Example output

SLURM Cluster GPU Status
========================

GPU Summary

+----------------------------+-------+----------+-------------+
| GPU model                  |   all |   online |   available |
+============================+=======+==========+=============+
| total                      |   214 |      193 |          51 |
+----------------------------+-------+----------+-------------+
| nvidia_geforce_rtx_3090    |    68 |       53 |          11 |
+----------------------------+-------+----------+-------------+
| nvidia_geforce_rtx_2080_ti |    54 |       54 |          22 |
+----------------------------+-------+----------+-------------+
| nvidia_a100-sxm4-80gb      |    36 |       32 |           0 |
+----------------------------+-------+----------+-------------+

----------------------------------------------------------------------

Usage by User

+---------+------------------------+-------------------------------+
| User    |   Total GPUs Allocated | Count per GPU Type            |
+=========+========================+===============================+
| user01  |                     24 | nvidia_geforce_rtx_2080_ti:24 |
+---------+------------------------+-------------------------------+

With --verbose, each GPU type is broken down per node:

nvidia_geforce_rtx_3090: 11 available
  -> gpunode14: 2 nvidia_geforce_rtx_3090 [cpu: 56/64, gpu: 6/8, mem: 376G/500G] [user02,user03]
  -> gpunode15: 4 nvidia_geforce_rtx_3090 [cpu: 56/64, gpu: 4/8, mem: 180G/500G] [user02]

Notes on accounting

  • "all" counts every configured GPU; "online" excludes nodes whose state contains DRAIN/DOWN/MAINT/etc.; "available" is unallocated GPUs on online nodes.
  • GPU inventory is read from each node's Gres= field (not CfgTRES, whose typed entries can be incomplete for MIG profiles).
  • Per-job allocations come from the per-node GRES=...(IDX:...) detail lines of scontrol show job -dd, falling back to the job's typed AllocTRES and then to TresPerNode.

Dependencies

  • Python >= 3.8
  • tabulate
  • termcolor

Tests

python -m pytest tests/ — no SLURM installation required; tests run against recorded scontrol fixtures.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sgpustat-0.1.0.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sgpustat-0.1.0-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

File details

Details for the file sgpustat-0.1.0.tar.gz.

File metadata

  • Download URL: sgpustat-0.1.0.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for sgpustat-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c5626a67b5c568dd968647095de9bb3357903bca1da729a66595419e61ba94b7
MD5 a8ee56a6f6c861b5a2979f4969e36aa2
BLAKE2b-256 2f4942240c463b71ecd8a0b1b5822dbdce2f8618c05e61998b0b27c24509ca4a

See more details on using hashes here.

File details

Details for the file sgpustat-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sgpustat-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for sgpustat-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5025b0751f8d31459676f3451c1b8d7af7a2d9640baf7caae7b3833e2c734177
MD5 c6017f5bb7e59eea9a87cbf78671fa98
BLAKE2b-256 8b5c6a52c8e7789ee6279b48082488a262f8302608dcd71703dacc6d3386fb52

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page