Skip to main content

A summary of GPU usage on a SLURM cluster

Project description

sgpustat

sgpustat is a simple command line utility that produces a summary of GPU usage on a SLURM cluster, following the naming convention of the other SLURM tools (squeue, sinfo, scontrol, ...). The tool can be used in two ways:

  1. To query the current usage of GPUs on the cluster.
  2. To launch a daemon which will log usage over time. This log can later be queried to provide usage statistics.

All data comes from exactly two scontrol calls per invocation, so it is fast even on busy clusters, and GPU accounting is exact — including on nodes with NVIDIA MIG instances and for jobs submitted with untyped --gres=gpu:N requests.

This project began as a fork of albanie/slurm_gpustat; the implementation has since been rewritten.

Installation

Install via pip install sgpustat. The pre-rename slurm_gpustat command is kept as an alias. The parsing/accounting logic lives in core.py, data collection in collect.py, rendering in render.py, the logging daemon in daemon.py, and the CLI entry point in cli.py.

Usage

To print a summary of current activity:

sgpustat

To print a summary of current activity on particular partitions, e.g. debug & normal:

sgpustat -p debug,normal or sgpustat --partition debug,normal

To include a per-node breakdown of available GPUs:

sgpustat --verbose

To output machine-readable CSV:

sgpustat --raw

Output is colorized when stdout is a terminal; --color 0 or the NO_COLOR environment variable disables it, --color 1 forces it (e.g. when piping to less -R).

To start the logging daemon:

sgpustat --action daemon-start

To view a summary of logged data:

sgpustat --action history

Example output

SLURM Cluster GPU Status
========================

GPU Summary

+----------------------------+-------+----------+-------------+
| GPU model                  |   all |   online |   available |
+============================+=======+==========+=============+
| total                      |   214 |      193 |          51 |
+----------------------------+-------+----------+-------------+
| nvidia_geforce_rtx_3090    |    68 |       53 |          11 |
+----------------------------+-------+----------+-------------+
| nvidia_geforce_rtx_2080_ti |    54 |       54 |          22 |
+----------------------------+-------+----------+-------------+
| nvidia_a100-sxm4-80gb      |    36 |       32 |           0 |
+----------------------------+-------+----------+-------------+

----------------------------------------------------------------------

Usage by User

+---------+------------------------+-------------------------------+
| User    |   Total GPUs Allocated | Count per GPU Type            |
+=========+========================+===============================+
| user01  |                     24 | nvidia_geforce_rtx_2080_ti:24 |
+---------+------------------------+-------------------------------+

With --verbose, each GPU type is broken down per node:

nvidia_geforce_rtx_3090: 11 available
  -> gpunode14: 2 nvidia_geforce_rtx_3090 [cpu: 56/64, gpu: 6/8, mem: 376G/500G] [user02,user03]
  -> gpunode15: 4 nvidia_geforce_rtx_3090 [cpu: 56/64, gpu: 4/8, mem: 180G/500G] [user02]

Notes on accounting

  • "all" counts every configured GPU; "online" excludes nodes whose state contains DRAIN/DOWN/MAINT/etc.; "available" is unallocated GPUs on online nodes.
  • GPU inventory is read from each node's Gres= field (not CfgTRES, whose typed entries can be incomplete for MIG profiles).
  • Per-job allocations come from the per-node GRES=...(IDX:...) detail lines of scontrol show job -dd, falling back to the job's typed AllocTRES and then to TresPerNode.

Dependencies

  • Python >= 3.8
  • tabulate
  • termcolor >= 2.1

Tests

python -m pytest tests/ — no SLURM installation required; tests run against recorded scontrol fixtures.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sgpustat-0.1.1.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sgpustat-0.1.1-py3-none-any.whl (19.4 kB view details)

Uploaded Python 3

File details

Details for the file sgpustat-0.1.1.tar.gz.

File metadata

  • Download URL: sgpustat-0.1.1.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for sgpustat-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8644bb38b758b959d9b9349b92c9e5932a59e08bc5d8bba384450c5642405b30
MD5 d3d5b9d09932bd5bec2ce56f0a8b49d1
BLAKE2b-256 a592ff71d56ccd8cbb9114f89ca8c75e72e3c51a3028a15217c836d3b34774a5

See more details on using hashes here.

File details

Details for the file sgpustat-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sgpustat-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for sgpustat-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4587a210383f44daa5539e8f2181dd97828d5dbadbb8a841983e7893f29e94b7
MD5 25b0c005ceb73b473f6180fca49147f2
BLAKE2b-256 d54b068dbfcaf15d3796e2a874e5fcb8f2d388586a5df73e229f0499841dbcc9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page