Command to list the current cluster usage per user.
Project description
SLURM Usage Monitor
A high-performance monitoring system that collects and analyzes SLURM job efficiency metrics, optimized for large-scale HPC environments.
Purpose
SLURM's accounting database purges detailed job metrics (CPU usage, memory usage) after 30 days. This tool captures and preserves that data in efficient Parquet format for long-term analysis of resource utilization patterns.
Key Features
- ๐ Captures comprehensive efficiency metrics from all job states
- ๐พ Efficient Parquet storage - columnar format optimized for analytics
- ๐ Smart incremental processing - tracks completed dates to minimize re-processing
- ๐ Rich visualizations - bar charts for resource usage, efficiency, and node utilization
- ๐ฅ Group-based analytics - track usage by research groups/teams
- ๐ฅ๏ธ Node utilization tracking - analyze per-node CPU and GPU usage
- โก Parallel collection - multi-threaded data collection by default
- โฐ Cron-ready - designed for automated daily collection
- ๐ฏ Intelligent re-collection - only re-fetches incomplete job states
Table of Contents
What It Collects
For each job:
- Job metadata: ID, user, name, partition, state, node list
- Time info: submit, start, end times, elapsed duration
- Allocated resources: CPUs, memory, GPUs, nodes
- Actual usage: CPU seconds used (TotalCPU), peak memory (MaxRSS)
- Calculated metrics:
- CPU efficiency % (actual CPU time / allocated CPU time)
- Memory efficiency % (peak memory / allocated memory)
- CPU hours wasted
- Memory GB-hours wasted
- Total reserved resources (CPU/GPU/memory hours)
Requirements
- uv - Python package and project manager (will auto-install dependencies)
- SLURM with accounting enabled
- sacct command access
That's it! The script uses uv inline script dependencies, so all Python packages are automatically installed when you run the script.
Installation
Quick Start (no installation needed)
# Run directly with uvx (uv tool run)
uvx slurm-usage --help
# Or for a specific command
uvx slurm-usage collect --days 7
Install as a Tool
# Install globally with uv
uv tool install slurm-usage
# Or with pip
pip install slurm-usage
# Then use directly
slurm-usage --help
Run from Source
# Clone the repository
git clone https://github.com/basnijholt/slurm-usage
cd slurm-usage
# Run the script directly (dependencies auto-installed by uv)
./slurm_usage.py --help
# Or with Python
python slurm_usage.py --help
Usage
CLI Commands
The following commands are available:
Usage: slurm_usage.py [OPTIONS] COMMAND [ARGS]...
SLURM Job Monitor - Collect and analyze job efficiency metrics
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --install-completion Install completion for the current shell. โ
โ --show-completion Show completion for the current shell, to copy โ
โ it or customize the installation. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ collect Collect job data from SLURM using parallel date-based queries. โ
โ analyze Analyze collected job data. โ
โ status Show monitoring system status. โ
โ current Display current cluster usage statistics from squeue. โ
โ nodes Display node information from SLURM. โ
โ test Run a quick test of the system. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Example Commands
# Collect data (uses 4 parallel workers by default)
slurm-usage collect
# Collect last 7 days of data
slurm-usage collect --days 7
# Collect with more parallel workers
slurm-usage collect --n-parallel 8
# Analyze collected data
slurm-usage analyze --days 7
# Display current cluster usage
slurm-usage current
# Display node information
slurm-usage nodes
# Check system status
slurm-usage status
# Test system configuration
slurm-usage test
Note: If running from source, use ./slurm_usage.py instead of slurm-usage.
Command Options
collect - Gather job data from SLURM
--days/-d: Days to look back (default: 1)--data-dir: Data directory location (default: ./data)--summary/--no-summary: Show analysis after collection (default: True)--n-parallel/-n: Number of parallel workers (default: 4)
analyze - Analyze collected data
--days/-d: Days to analyze (default: 7)--data-dir: Data directory location
status - Show system status
--data-dir: Data directory location
current - Display current cluster usage
Shows real-time cluster utilization from squeue, broken down by user and partition.
nodes - Display node information
Shows information about cluster nodes including CPU and GPU counts.
test - Test system configuration
Output Structure
Data Organization
data/
โโโ raw/ # Raw SLURM data (archived)
โ โโโ 2025-08-19.parquet # Daily raw records
โ โโโ 2025-08-20.parquet
โ โโโ ...
โโโ processed/ # Processed job metrics
โ โโโ 2025-08-19.parquet # Daily processed data
โ โโโ 2025-08-20.parquet
โ โโโ ...
โโโ .date_completion_tracker.json # Tracks fully processed dates
Sample Analysis Output
โโโ Resource Usage by User โโโ
โโโโโโโโโโโโโโโฌโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโ
โ User โ Jobs โ CPU Hours โ Memory GB-hrsโ GPU Hours โ CPU Eff โ Mem Eff โ
โโโโโโโโโโโโโโโผโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโค
โ alice โ 124 โ 12,847 โ 48,291 โ 1,024 โ 45.2% โ 23.7% โ
โ bob โ 87 โ 8,234 โ 31,456 โ 512 โ 38.1% โ 18.4% โ
โโโโโโโโโโโโโโโดโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโ
โโโ Node Usage Analysis โโโ
โโโโโโโโโโโโโโฌโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ Node โ Jobs โ CPU Hours โ GPU Hours โ CPU Util% โ
โโโโโโโโโโโโโโผโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโค
โ cluster-1 โ 234 โ 45,678 โ 2,048 โ 74.3% โ
โ cluster-2 โ 198 โ 41,234 โ 1,536 โ 67.1% โ
โโโโโโโโโโโโโโดโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโ
Smart Re-collection
The monitor intelligently handles job state transitions:
- Complete dates: Once all jobs for a date reach final states (COMPLETED, FAILED, CANCELLED, etc.), the date is marked complete and won't be re-queried
- Incomplete jobs: Jobs in states like RUNNING, PENDING, or SUSPENDED are automatically re-collected on subsequent runs
- Efficient updates: Only changed jobs are updated, minimizing processing time
Tracked Incomplete States
The following job states indicate a job may change and will trigger re-collection:
- Active:
RUNNING,PENDING,SUSPENDED - Transitional:
COMPLETING,CONFIGURING,STAGE_OUT,SIGNALING - Requeue:
REQUEUED,REQUEUE_FED,REQUEUE_HOLD - Other:
RESIZING,REVOKED,SPECIAL_EXIT
Group Configuration
Create a configuration file to define your organization's research groups and optionally specify the data directory. The configuration file is searched in the following locations:
$XDG_CONFIG_HOME/slurm-usage/config.yaml~/.config/slurm-usage/config.yaml/etc/slurm-usage/config.yaml
Data Directory
The data directory for storing collected metrics can be configured in three ways (in order of priority):
-
Command line: Use
--data-dir /path/to/datawith any command (highest priority) -
Configuration file: Set
data_dir: /path/to/datain the config file -
Default: If not specified, data is stored in
./data(current working directory)
This allows flexible deployment:
- Default installation: Data stored in
./datasubdirectory - System-wide deployment: Set
data_dir: /var/lib/slurm-usagein/etc/slurm-usage/config.yaml - Shared installations: Use a network storage path in the config
- Per-run override: Use
--data-dirflag to override for specific commands
Example config.yaml:
# Example configuration file for slurm-usage
# Copy this file to one of the following locations:
# - $XDG_CONFIG_HOME/slurm-usage/config.yaml
# - ~/.config/slurm-usage/config.yaml
# - /etc/slurm-usage/config.yaml (for system-wide configuration)
# Group configuration - organize users into research groups
groups:
physics:
- alice
- bob
- charlie
chemistry:
- david
- eve
- frank
biology:
- grace
- henry
- irene
# Data directory configuration (optional)
# - If not specified or set to null, defaults to ./data (current working directory)
# - Set to an explicit path to use a custom location
# - Useful for shared installations where data should be stored centrally
#
# Examples:
# data_dir: null # Use default ./data directory
# data_dir: /var/lib/slurm-usage # System-wide data directory
# data_dir: /shared/slurm-data # Shared network location
Automated Collection
Using Cron
# Add to crontab (runs daily at 2 AM)
crontab -e
# If installed with uv tool or pip:
0 2 * * * /path/to/slurm-usage collect --days 2
# Or if running from source:
0 2 * * * /path/to/slurm-usage/slurm_usage.py collect --days 2
Data Schema
ProcessedJob Model
| Field | Type | Description |
|---|---|---|
| job_id | str | SLURM job ID |
| user | str | Username |
| job_name | str | Job name (max 50 chars) |
| partition | str | SLURM partition |
| state | str | Final job state |
| submit_time | datetime.datetime | None |
| start_time | datetime.datetime | None |
| end_time | datetime.datetime | None |
| node_list | str | Nodes where job ran |
| elapsed_seconds | int | Runtime in seconds |
| alloc_cpus | int | CPUs allocated |
| req_mem_mb | float | Memory requested (MB) |
| max_rss_mb | float | Peak memory used (MB) |
| total_cpu_seconds | float | Actual CPU time used |
| alloc_gpus | int | GPUs allocated |
| cpu_efficiency | float | CPU efficiency % (0-100) |
| memory_efficiency | float | Memory efficiency % (0-100) |
| cpu_hours_wasted | float | Wasted CPU hours |
| memory_gb_hours_wasted | float | Wasted memory GB-hours |
| cpu_hours_reserved | float | Total CPU hours reserved |
| memory_gb_hours_reserved | float | Total memory GB-hours reserved |
| gpu_hours_reserved | float | Total GPU hours reserved |
| is_complete | bool | Whether job has reached final state |
Performance Optimizations
- Date completion tracking: Dates with only finished jobs are marked complete and skipped
- Parallel collection: Default 4 workers fetch different dates simultaneously
- Smart merging: Only updates changed jobs when re-collecting
- Efficient storage: Parquet format provides ~10x compression over CSV
- Date-based partitioning: Data organized by date for efficient queries
Important Notes
-
30-day window: SLURM purges detailed metrics after 30 days. Run collection at least weekly to ensure no data is lost.
-
Batch steps: Actual usage metrics (TotalCPU, MaxRSS) are stored in the
.batchstep, not the parent job record. -
State normalization: All CANCELLED variants are normalized to "CANCELLED" for consistency.
-
GPU tracking: GPU allocation is extracted from the AllocTRES field.
-
Raw data archival: Raw SLURM records are preserved in case reprocessing is needed.
Post-Processing with Polars
You can use Polars to analyze the collected data. Here's an example:
from datetime import datetime, timedelta
from pathlib import Path
import polars as pl
# Load processed data for last 7 days
dfs = []
for i in range(7):
date = (datetime.now() - timedelta(days=i)).strftime("%Y-%m-%d")
file = Path(f"data/processed/{date}.parquet")
if file.exists():
dfs.append(pl.read_parquet(file))
if dfs:
df = pl.concat(dfs)
# Find users with worst CPU efficiency
worst_users = df.filter(pl.col("state") == "COMPLETED").group_by("user").agg(pl.col("cpu_efficiency").mean()).sort("cpu_efficiency").head(5)
print("## Users with Worst CPU Efficiency")
print(worst_users)
# Find most wasted resources by partition
waste_by_partition = df.group_by("partition").agg(pl.col("cpu_hours_wasted").sum()).sort("cpu_hours_wasted", descending=True)
print("\n## CPU Hours Wasted by Partition")
print(waste_by_partition)
else:
print("No data files found. Run `./slurm_usage.py collect` first.")
Troubleshooting
No efficiency data?
- Check if SLURM accounting is configured:
scontrol show config | grep JobAcct - Verify jobs have
.batchsteps:sacct -j JOBID
Collection is slow?
- Increase parallel workers:
slurm-usage collect --n-parallel 8 - The first run processes historical data and will be slower
Missing user groups?
- Create or update the configuration file in
~/.config/slurm-usage/config.yaml - Ungrouped users will appear as "ungrouped" in group statistics
Script won't run?
- Ensure
uvis installed:curl -LsSf https://astral.sh/uv/install.sh | sh - Check SLURM access:
slurm-usage test(or./slurm_usage.py testif running from source)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slurm_usage-3.1.0.tar.gz.
File metadata
- Download URL: slurm_usage-3.1.0.tar.gz
- Upload date:
- Size: 95.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
daca8bee664ebea2314adb437699e8b0b2af7065ee13fe73cb8ae6fd4e059043
|
|
| MD5 |
ea7d0100cefffca11781254d3ea100ec
|
|
| BLAKE2b-256 |
bb54bebf6429e8858c32a8fc1fe7963db81e02277eb7cf01848ffb1cf02dfe70
|
Provenance
The following attestation bundles were made for slurm_usage-3.1.0.tar.gz:
Publisher:
pythonpublish.yml on basnijholt/slurm-usage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurm_usage-3.1.0.tar.gz -
Subject digest:
daca8bee664ebea2314adb437699e8b0b2af7065ee13fe73cb8ae6fd4e059043 - Sigstore transparency entry: 548034997
- Sigstore integration time:
-
Permalink:
basnijholt/slurm-usage@f5a284f7c519adc65d1116815fa78dfef1ed0394 -
Branch / Tag:
refs/tags/v3.1.0 - Owner: https://github.com/basnijholt
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pythonpublish.yml@f5a284f7c519adc65d1116815fa78dfef1ed0394 -
Trigger Event:
release
-
Statement type:
File details
Details for the file slurm_usage-3.1.0-py3-none-any.whl.
File metadata
- Download URL: slurm_usage-3.1.0-py3-none-any.whl
- Upload date:
- Size: 32.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
701605307d3b31fa533cf6f91539918fb9affdaab7d65fa7c2637f0b716987ea
|
|
| MD5 |
4083fdf221e9b51739568e3bda7abca2
|
|
| BLAKE2b-256 |
9090415a5d623ec7e9f0a33ff4a21455d50b4f409ce5cf48f6c5a3989013913d
|
Provenance
The following attestation bundles were made for slurm_usage-3.1.0-py3-none-any.whl:
Publisher:
pythonpublish.yml on basnijholt/slurm-usage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurm_usage-3.1.0-py3-none-any.whl -
Subject digest:
701605307d3b31fa533cf6f91539918fb9affdaab7d65fa7c2637f0b716987ea - Sigstore transparency entry: 548035029
- Sigstore integration time:
-
Permalink:
basnijholt/slurm-usage@f5a284f7c519adc65d1116815fa78dfef1ed0394 -
Branch / Tag:
refs/tags/v3.1.0 - Owner: https://github.com/basnijholt
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pythonpublish.yml@f5a284f7c519adc65d1116815fa78dfef1ed0394 -
Trigger Event:
release
-
Statement type: