Skip to main content

Utilities for HPC clusters including PBS/Slurm job statistics and a psutil-based (CPU/GPU/Memory) resource monitor.

Project description

hpc-stats-scripts

Utilities for working with high-performance computing (HPC) environments. The scripts help inspect PBS/Slurm job efficiency and monitor CPU/GPU and memory usage on a running system or process tree.

Made with Codex :)

Dependencies

Install the required Python packages with pip:

Feature Packages Install command
Core utilities psutil pip install psutil
Plotting for psutil-monitor matplotlib, numpy pip install matplotlib numpy
GPU monitoring for psutil-monitor --gpu nvidia-ml-py3 (pynvml) pip install nvidia-ml-py3
Plot + GPU combo psutil, matplotlib, numpy, nvidia-ml-py3 pip install psutil matplotlib numpy nvidia-ml-py3
All extras via pip extras plot + GPU pip install .[all]

The pbs-bulk-user-stats command also expects the PBS qstat utility to be available in your environment. The slurm-bulk-user-stats command expects Slurm's sacct utility to be available in your environment.

Installation

Clone the repository and install with pip:

Install from PyPI (once published):
```bash
pip install hpc-stats-scripts
pip install hpc-stats-scripts[plot]
pip install hpc-stats-scripts[gpu]
pip install hpc-stats-scripts[all]

Or install directly from GitHub (the latest version):

pip install "hpc-stats-scripts @ git+https://github.com/avnikonenko/hpc-stats-scripts.git"
pip install "hpc-stats-scripts[plot] @ git+https://github.com/avnikonenko/hpc-stats-scripts.git"
pip install "hpc-stats-scripts[gpu] @ git+https://github.com/avnikonenko/hpc-stats-scripts.git"
pip install "hpc-stats-scripts[all] @ git+https://github.com/avnikonenko/hpc-stats-scripts.git"

The base installation depends on psutil. The plot extra pulls in matplotlib and numpy for the --plot feature of psutil-monitor. The gpu extra installs nvidia-ml-py3 to enable --gpu.

CLI tools

pbs-bulk-user-stats

Summarize CPU and memory usage for PBS jobs and show which nodes the jobs are allocated to. The command relies on qstat being available in your PATH. The table now includes NGPUS (requested GPUs) when present.

Examples:

# Summarize a specific job and write CSV output
pbs-bulk-user-stats --job 12345 --csv stats.csv

# Summarize all jobs for the current user (default) 
pbs-bulk-user-stats --include-finished

# Summarize all jobs for a specific user
pbs-bulk-user-stats --user myuser --include-finished

# Include finished jobs but only keep 20 finished entries
pbs-bulk-user-stats --user myuser --include-finished --finished-limit 20

# Faster best-effort mode: fetch active jobs + only N finished job details (use if default option takes too long)
pbs-bulk-user-stats --user myuser --include-finished --finished-limit 20 --finished-limit-strategy fetch

When invoked with no --user or --job options:

  • On a login node (no $PBS_JOBID present), it summarizes all jobs for the current user.
  • Inside a running PBS job (where $PBS_JOBID is set), it automatically summarizes that specific job.
pbs-bulk-user-stats

State codes (PBS):

  • R running, Q queued/waiting, X finished (requires --include-finished), other codes are printed under “other” in the summary.

Expected output (CPU/RAM only):

$ pbs-bulk-user-stats

JOBID    STATE   NAME       NODES    NCPUS  WALL(h)  CPUT(h)  avgCPU  CPUeff  memUsed   memReq   memEff
-------------------------------------------------------------------------------------------------------
0001      R      run1		pbs-1    176    38.55    3632.12  163.6  93.53%  207.4 GiB 256.00 GiB 81.10%
0002      R      run2		pbs-2    176    38.59    3589.72  93.13  52.91%  50.02 GiB 256.00 GiB 19.54%
...
Summary:
  jobs:         5
  unique nodes: 3
  states:       R=4  Q=1  X=0  other=0
  mean CPUeff:  75.20%
  mean avgCPU:  132.35
  mean memEff:  82.50%
  max memUsed:  230.16 GiB

or if run inside a running PBS:

JOBID  STATE  NAME   NODES  NCPUS  WALL(h)  CPUT(h)  avgCPU  CPUeff  memUsed     memReq     memEff
-----------------------------------------------------------------------------------------------------
0001   R      STDIN  pbs-5  100    0.03     0.01     0.22    0.22%   666.58 MiB  30.00 GiB  2.17% 

Summary:
  jobs:        1
  mean CPUeff: 0.22%
  mean avgCPU: 0.22
  mean memEff: 2.17%
  max memUsed: 666.58 MiB

After the table, a summary reports the job count, mean CPU efficiency, mean average CPU usage, mean memory efficiency, and the peak memory used across all listed jobs.

psutil-monitor

Real-time CPU and memory monitor for the system or a process tree. Use --gpu to also report aggregate GPU utilization and memory via NVML (requires nvidia-ml-py3). When --csv/--plot are used, metrics stream live to the terminal during the run; CSV/PNG files are written when the monitor exits (Ctrl+C, duration reached, or proc tree ends).

Example output files (generated with --plot and --csv):

  • Plot (CPU + GPU stacked):

    psutil-monitor example plot

  • CSV: docs/psutil-monitor-example.csv

GPU output fields (when --gpu is used):

  • GPU util: Average utilization across visible GPUs.
  • busyGPUs: Sum of utilization fractions (e.g., two GPUs at 50% each → 1.0).
  • GPU mem %: Aggregate GPU memory usage percentage.
  • Per-GPU (CSV gpu_pergpu): index:util%/used/total for each device.

Examples:

# System-wide (by default) monitoring with console output only
psutil-monitor

# System-wide monitoring with CSV and PNG output
psutil-monitor --mode system --csv node.csv --plot node.png

# Monitor the current process tree (useful inside a PBS job)
psutil-monitor --mode proc --pid $$ --include-children --csv job.csv

# For script.py resources monitoring:
python script.py &                   # launch the workload
target=$!                            # PID of script.py
echo $target
# psutil-monitor watches that PID and exits when the process tree is gone
psutil-monitor --mode proc --pid "$target" --include-children --csv stat.csv --plot plot.png

Expected output:

$ psutil-monitor

CPUs available (affinity): 384
Total memory available: 754.76 GiB
CPU basis for %: 384
Memory basis for %: 754.76 GiB
2025-08-14T15:20:14  CPU  79.67%  busyCPUs 305.93  (provided 384)  MEM   9.93%  used 74.96 GiB / total 754.76 GiB
2025-08-14T15:20:16  CPU  69.30%  busyCPUs 266.13  (provided 384)  MEM   9.95%  used 75.12 GiB / total 754.76 GiB
2025-08-14T15:20:18  CPU  61.34%  busyCPUs 235.53  (provided 384)  MEM  10.05%  used 75.82 GiB / total 754.76 GiB
2025-08-14T15:20:20  CPU  61.32%  busyCPUs 235.47  (provided 384)  MEM  10.09%  used 76.15 GiB / total 754.76 GiB
2025-08-14T15:20:22  CPU  74.57%  busyCPUs 286.33  (provided 384)  MEM   9.94%  used 74.99 GiB / total 754.76 GiB
2025-08-14T15:20:24  CPU  85.94%  busyCPUs 330.01  (provided 384)  MEM   9.86%  used 74.44 GiB / total 754.76 GiB
Average busy CPUs over run: 276.570
Peak memory (system): 76.15 GiB

With GPUs (--gpu and NVIDIA GPUs present):

$ psutil-monitor --gpu

CPUs available (affinity): 96
Total memory available: 503.70 GiB
CPU basis for %: 96
Memory basis for %: 503.70 GiB
GPUs detected (NVML): 4
2026-02-03T10:00:14  CPU  45.12%  busyCPUs 43.32  (provided 96)  MEM   8.10%  used 40.80 GiB / total 503.70 GiB  GPU util  57.5% busyGPUs 2.30 mem  42.0%
2026-02-03T10:00:16  CPU  48.33%  busyCPUs 46.39  (provided 96)  MEM   8.20%  used 41.30 GiB / total 503.70 GiB  GPU util  63.0% busyGPUs 2.52 mem  44.1%
2026-02-03T10:00:18  CPU  52.10%  busyCPUs 49.99  (provided 96)  MEM   8.25%  used 41.60 GiB / total 503.70 GiB  GPU util  68.7% busyGPUs 2.75 mem  45.3%
Average busy CPUs over run: 46.567
Average busy GPUs over run: 2.523
Peak memory (system): 41.60 GiB

Use the --help option of each command to see all available options.

slurm-bulk-user-stats

Summarize CPU and memory usage for Slurm jobs and show which nodes the jobs are allocated to. The command relies on sacct being available in your PATH. The table includes NGPUS based on AllocTres/AllocGRES when present. If TRES GPU usage metrics are available, the summary also reports mean GPU util and GPU hours (used/requested).

State codes (Slurm):

  • R/RUNNING, PD/PENDING, CD/COMPLETED; other states (e.g., F, CG, S, TO) are grouped under “other” in the summary and listed in the breakdown.

Examples:

# Summarize a specific job and write CSV output
slurm-bulk-user-stats --job 12345 --csv stats.csv

# Summarize all running jobs for the current user (default)
slurm-bulk-user-stats

# Summarize all jobs (including finished) for a specific user
slurm-bulk-user-stats --user myuser --include-finished

# Include finished jobs but only keep 20 finished entries
slurm-bulk-user-stats --user myuser --include-finished --finished-limit 20

# Faster best-effort mode: fetch active jobs + only N finished job details
slurm-bulk-user-stats --user myuser --include-finished --finished-limit 20 --finished-limit-strategy fetch

When invoked with no --user or --job options:

  • On a login node (no $SLURM_JOB_ID present), it summarizes pending/running jobs for the current user.
  • Inside a running Slurm job (where $SLURM_JOB_ID is set), it automatically summarizes that specific job.
slurm-bulk-user-stats

The output mirrors the PBS version, showing job state, node list, CPU/memory usage, efficiency metrics, and a summary block with job counts and averages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hpc_stats_scripts-1.3.tar.gz (27.9 kB view details)

Uploaded Source

File details

Details for the file hpc_stats_scripts-1.3.tar.gz.

File metadata

  • Download URL: hpc_stats_scripts-1.3.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for hpc_stats_scripts-1.3.tar.gz
Algorithm Hash digest
SHA256 da17686ed80bd67c2f09a040f8889d274fab0d69b4ca41c23acd35a872759f51
MD5 d11e6205b945c40e111f6ca129997074
BLAKE2b-256 09c6b4abd2e21b6d31c96680d196afd20725e7ace605661eb472ee3094bbedc8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page