Command to list the current cluster usage per user.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

Project description

SLURM Usage Monitor

A high-performance monitoring system that collects and analyzes SLURM job efficiency metrics, optimized for large-scale HPC environments.

Purpose

SLURM's accounting database purges detailed job metrics (CPU usage, memory usage) after 30 days. This tool captures and preserves that data in efficient Parquet format for long-term analysis of resource utilization patterns.

Key Features

📊 Captures comprehensive efficiency metrics from all job states
💾 Efficient Parquet storage - columnar format optimized for analytics
🔄 Smart incremental processing - tracks completed dates to minimize re-processing
📈 Rich visualizations - bar charts for resource usage, efficiency, and node utilization
👥 Group-based analytics - track usage by research groups/teams
🖥️ Node utilization tracking - analyze per-node CPU and GPU usage
⚡ Parallel collection - multi-threaded data collection by default
⏰ Cron-ready - designed for automated daily collection
🎯 Intelligent re-collection - only re-fetches incomplete job states

Table of Contents

What It Collects
Requirements
Installation
Usage
- CLI Commands
  - Example Commands
- Command Options
Output Structure
- Data Organization
- Sample Analysis Output
Smart Re-collection
- Tracked Incomplete States
Group Configuration
- Data Directory
Automated Collection
- Using Cron
Data Schema
- ProcessedJob Model
Performance Optimizations
Important Notes
Post-Processing with Polars
Troubleshooting
License

What It Collects

For each job:

Job metadata: ID, user, name, partition, state, node list
Time info: submit, start, end times, elapsed duration
Allocated resources: CPUs, memory, GPUs, nodes
Actual usage: CPU seconds used (TotalCPU), peak memory (MaxRSS)
Calculated metrics:
- CPU efficiency % (actual CPU time / allocated CPU time)
- Memory efficiency % (peak memory / allocated memory)
- CPU hours wasted
- Memory GB-hours wasted
- Total reserved resources (CPU/GPU/memory hours)

Requirements

uv - Python package and project manager (will auto-install dependencies)
SLURM with accounting enabled
sacct command access

That's it! The script uses uv inline script dependencies, so all Python packages are automatically installed when you run the script.

Installation

Quick Start (no installation needed)

# Run directly with uvx (uv tool run)
uvx slurm-usage --help

# Or for a specific command
uvx slurm-usage collect --days 7

Install as a Tool

# Install globally with uv
uv tool install slurm-usage

# Or with pip
pip install slurm-usage

# Then use directly
slurm-usage --help

Run from Source

# Clone the repository
git clone https://github.com/basnijholt/slurm-usage
cd slurm-usage

# Run the script directly (dependencies auto-installed by uv)
./slurm_usage.py --help

# Or with Python
python slurm_usage.py --help

Usage

CLI Commands

The following commands are available:

 Usage: slurm_usage.py [OPTIONS] COMMAND [ARGS]...

 SLURM Job Monitor - Collect and analyze job efficiency metrics


╭─ Options ────────────────────────────────────────────────────────────────────╮
│ --install-completion          Install completion for the current shell.      │
│ --show-completion             Show completion for the current shell, to copy │
│                               it or customize the installation.              │
│ --help                        Show this message and exit.                    │
╰──────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────╮
│ collect   Collect job data from SLURM using parallel date-based queries.     │
│ analyze   Analyze collected job data.                                        │
│ status    Show monitoring system status.                                     │
│ current   Display current cluster usage statistics from squeue.              │
│ nodes     Display node information from SLURM.                               │
│ test      Run a quick test of the system.                                    │
╰──────────────────────────────────────────────────────────────────────────────╯

Example Commands

# Collect data (uses 4 parallel workers by default)
slurm-usage collect

# Collect last 7 days of data
slurm-usage collect --days 7

# Collect with more parallel workers
slurm-usage collect --n-parallel 8

# Analyze collected data
slurm-usage analyze --days 7

# Display current cluster usage
slurm-usage current

# Display node information
slurm-usage nodes

# Check system status
slurm-usage status

# Test system configuration
slurm-usage test

Note: If running from source, use ./slurm_usage.py instead of slurm-usage.

Command Options

`collect` - Gather job data from SLURM

--days/-d: Days to look back (default: 1)
--data-dir: Data directory location (default: ./data)
--summary/--no-summary: Show analysis after collection (default: True)
--n-parallel/-n: Number of parallel workers (default: 4)

`analyze` - Analyze collected data

--days/-d: Days to analyze (default: 7)
--data-dir: Data directory location

`status` - Show system status

--data-dir: Data directory location

`current` - Display current cluster usage

Shows real-time cluster utilization from squeue, broken down by user and partition.

`nodes` - Display node information

Shows information about cluster nodes including CPU and GPU counts.

`test` - Test system configuration

Output Structure

Data Organization

data/
├── raw/                        # Raw SLURM data (archived)
│   ├── 2025-08-19.parquet      # Daily raw records
│   ├── 2025-08-20.parquet
│   └── ...
├── processed/                  # Processed job metrics
│   ├── 2025-08-19.parquet      # Daily processed data
│   ├── 2025-08-20.parquet
│   └── ...
└── .date_completion_tracker.json  # Tracks fully processed dates

Sample Analysis Output

═══ Resource Usage by User ═══

┌─────────────┬──────┬───────────┬──────────────┬───────────┬─────────┬──────────┐
│ User        │ Jobs │ CPU Hours │ Memory GB-hrs│ GPU Hours │ CPU Eff │ Mem Eff  │
├─────────────┼──────┼───────────┼──────────────┼───────────┼─────────┼──────────┤
│ alice       │  124 │   12,847  │    48,291    │    1,024  │  45.2%  │  23.7%   │
│ bob         │   87 │    8,234  │    31,456    │      512  │  38.1%  │  18.4%   │
└─────────────┴──────┴───────────┴──────────────┴───────────┴─────────┴──────────┘

═══ Node Usage Analysis ═══

┌────────────┬──────┬───────────┬───────────┬───────────┐
│ Node       │ Jobs │ CPU Hours │ GPU Hours │ CPU Util% │
├────────────┼──────┼───────────┼───────────┼───────────┤
│ cluster-1  │  234 │   45,678  │    2,048  │   74.3%   │
│ cluster-2  │  198 │   41,234  │    1,536  │   67.1%   │
└────────────┴──────┴───────────┴───────────┴───────────┘

Smart Re-collection

The monitor intelligently handles job state transitions:

Complete dates: Once all jobs for a date reach final states (COMPLETED, FAILED, CANCELLED, etc.), the date is marked complete and won't be re-queried
Incomplete jobs: Jobs in states like RUNNING, PENDING, or SUSPENDED are automatically re-collected on subsequent runs
Efficient updates: Only changed jobs are updated, minimizing processing time

Tracked Incomplete States

The following job states indicate a job may change and will trigger re-collection:

Active: RUNNING, PENDING, SUSPENDED
Transitional: COMPLETING, CONFIGURING, STAGE_OUT, SIGNALING
Requeue: REQUEUED, REQUEUE_FED, REQUEUE_HOLD
Other: RESIZING, REVOKED, SPECIAL_EXIT

Group Configuration

Create a configuration file to define your organization's research groups and optionally specify the data directory. The configuration file is searched in the following locations:

$XDG_CONFIG_HOME/slurm-usage/config.yaml
~/.config/slurm-usage/config.yaml
/etc/slurm-usage/config.yaml

Data Directory

The data directory for storing collected metrics can be configured in three ways (in order of priority):

Command line: Use --data-dir /path/to/data with any command (highest priority)
Configuration file: Set data_dir: /path/to/data in the config file
Default: If not specified, data is stored in ./data (current working directory)

This allows flexible deployment:

Default installation: Data stored in ./data subdirectory
System-wide deployment: Set data_dir: /var/lib/slurm-usage in /etc/slurm-usage/config.yaml
Shared installations: Use a network storage path in the config
Per-run override: Use --data-dir flag to override for specific commands

Example config.yaml:

# Example configuration file for slurm-usage
# Copy this file to one of the following locations:
#   - $XDG_CONFIG_HOME/slurm-usage/config.yaml
#   - ~/.config/slurm-usage/config.yaml
#   - /etc/slurm-usage/config.yaml (for system-wide configuration)

# Group configuration - organize users into research groups
groups:
  physics:
    - alice
    - bob
    - charlie
  chemistry:
    - david
    - eve
    - frank
  biology:
    - grace
    - henry
    - irene

# Data directory configuration (optional)
# - If not specified or set to null, defaults to ./data (current working directory)
# - Set to an explicit path to use a custom location
# - Useful for shared installations where data should be stored centrally
#
# Examples:
# data_dir: null                    # Use default ./data directory
# data_dir: /var/lib/slurm-usage    # System-wide data directory
# data_dir: /shared/slurm-data      # Shared network location

Automated Collection

Using Cron

# Add to crontab (runs daily at 2 AM)
crontab -e

# If installed with uv tool or pip:
0 2 * * * /path/to/slurm-usage collect --days 2

# Or if running from source:
0 2 * * * /path/to/slurm-usage/slurm_usage.py collect --days 2

Data Schema

ProcessedJob Model

Field	Type	Description
job_id	str	SLURM job ID
user	str	Username
job_name	str	Job name (max 50 chars)
partition	str	SLURM partition
state	str	Final job state
submit_time	datetime.datetime	None
start_time	datetime.datetime	None
end_time	datetime.datetime	None
node_list	str	Nodes where job ran
elapsed_seconds	int	Runtime in seconds
alloc_cpus	int	CPUs allocated
req_mem_mb	float	Memory requested (MB)
max_rss_mb	float	Peak memory used (MB)
total_cpu_seconds	float	Actual CPU time used
alloc_gpus	int	GPUs allocated
cpu_efficiency	float	CPU efficiency % (0-100)
memory_efficiency	float	Memory efficiency % (0-100)
cpu_hours_wasted	float	Wasted CPU hours
memory_gb_hours_wasted	float	Wasted memory GB-hours
cpu_hours_reserved	float	Total CPU hours reserved
memory_gb_hours_reserved	float	Total memory GB-hours reserved
gpu_hours_reserved	float	Total GPU hours reserved
is_complete	bool	Whether job has reached final state

Performance Optimizations

Date completion tracking: Dates with only finished jobs are marked complete and skipped
Parallel collection: Default 4 workers fetch different dates simultaneously
Smart merging: Only updates changed jobs when re-collecting
Efficient storage: Parquet format provides ~10x compression over CSV
Date-based partitioning: Data organized by date for efficient queries

Important Notes

30-day window: SLURM purges detailed metrics after 30 days. Run collection at least weekly to ensure no data is lost.
Batch steps: Actual usage metrics (TotalCPU, MaxRSS) are stored in the .batch step, not the parent job record.
State normalization: All CANCELLED variants are normalized to "CANCELLED" for consistency.
GPU tracking: GPU allocation is extracted from the AllocTRES field.
Raw data archival: Raw SLURM records are preserved in case reprocessing is needed.

Post-Processing with Polars

You can use Polars to analyze the collected data. Here's an example:

from datetime import datetime, timedelta
from pathlib import Path

import polars as pl

# Load processed data for last 7 days
dfs = []
for i in range(7):
    date = (datetime.now() - timedelta(days=i)).strftime("%Y-%m-%d")
    file = Path(f"data/processed/{date}.parquet")
    if file.exists():
        dfs.append(pl.read_parquet(file))

if dfs:
    df = pl.concat(dfs)

    # Find users with worst CPU efficiency
    worst_users = df.filter(pl.col("state") == "COMPLETED").group_by("user").agg(pl.col("cpu_efficiency").mean()).sort("cpu_efficiency").head(5)

    print("## Users with Worst CPU Efficiency")
    print(worst_users)

    # Find most wasted resources by partition
    waste_by_partition = df.group_by("partition").agg(pl.col("cpu_hours_wasted").sum()).sort("cpu_hours_wasted", descending=True)

    print("\n## CPU Hours Wasted by Partition")
    print(waste_by_partition)
else:
    print("No data files found. Run `./slurm_usage.py collect` first.")

Troubleshooting

No efficiency data?

Check if SLURM accounting is configured: scontrol show config | grep JobAcct
Verify jobs have .batch steps: sacct -j JOBID

Collection is slow?

Increase parallel workers: slurm-usage collect --n-parallel 8
The first run processes historical data and will be slower

Missing user groups?

Create or update the configuration file in ~/.config/slurm-usage/config.yaml
Ungrouped users will appear as "ungrouped" in group statistics

Script won't run?

Ensure uv is installed: curl -LsSf https://astral.sh/uv/install.sh | sh
Check SLURM access: slurm-usage test (or ./slurm_usage.py test if running from source)

License

MIT

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

basnijholt

Release history Release notifications | RSS feed

This version

3.1.0

Sep 22, 2025

3.0.9

Aug 26, 2025

3.0.8

Aug 21, 2025

3.0.7

Aug 21, 2025

3.0.6

Aug 21, 2025

3.0.5

Aug 21, 2025

3.0.4

Aug 21, 2025

3.0.3

Aug 21, 2025

3.0.2

Aug 21, 2025

3.0.1

Aug 21, 2025

3.0.0

Aug 21, 2025

2.0.0

Aug 14, 2024

1.1.0

Jul 19, 2023

1.0.0

Sep 21, 2022

0.9

Sep 20, 2022

0.8

Sep 20, 2022

0.7

Sep 20, 2022

0.6

Sep 6, 2022

0.5

Sep 8, 2021

0.4

Sep 1, 2021

0.3

Sep 1, 2021

0.2

Sep 1, 2021

0.1

Jan 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurm_usage-3.1.0.tar.gz (95.5 kB view details)

Uploaded Sep 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slurm_usage-3.1.0-py3-none-any.whl (32.4 kB view details)

Uploaded Sep 22, 2025 Python 3

File details

Details for the file slurm_usage-3.1.0.tar.gz.

File metadata

Download URL: slurm_usage-3.1.0.tar.gz
Upload date: Sep 22, 2025
Size: 95.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurm_usage-3.1.0.tar.gz
Algorithm	Hash digest
SHA256	`daca8bee664ebea2314adb437699e8b0b2af7065ee13fe73cb8ae6fd4e059043`
MD5	`ea7d0100cefffca11781254d3ea100ec`
BLAKE2b-256	`bb54bebf6429e8858c32a8fc1fe7963db81e02277eb7cf01848ffb1cf02dfe70`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurm_usage-3.1.0.tar.gz:

Publisher: pythonpublish.yml on basnijholt/slurm-usage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurm_usage-3.1.0.tar.gz
- Subject digest: daca8bee664ebea2314adb437699e8b0b2af7065ee13fe73cb8ae6fd4e059043
- Sigstore transparency entry: 548034997
- Sigstore integration time: Sep 22, 2025
Source repository:
- Permalink: basnijholt/slurm-usage@f5a284f7c519adc65d1116815fa78dfef1ed0394
- Branch / Tag: refs/tags/v3.1.0
- Owner: https://github.com/basnijholt
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pythonpublish.yml@f5a284f7c519adc65d1116815fa78dfef1ed0394
- Trigger Event: release

File details

Details for the file slurm_usage-3.1.0-py3-none-any.whl.

File metadata

Download URL: slurm_usage-3.1.0-py3-none-any.whl
Upload date: Sep 22, 2025
Size: 32.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurm_usage-3.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`701605307d3b31fa533cf6f91539918fb9affdaab7d65fa7c2637f0b716987ea`
MD5	`4083fdf221e9b51739568e3bda7abca2`
BLAKE2b-256	`9090415a5d623ec7e9f0a33ff4a21455d50b4f409ce5cf48f6c5a3989013913d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurm_usage-3.1.0-py3-none-any.whl:

Publisher: pythonpublish.yml on basnijholt/slurm-usage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurm_usage-3.1.0-py3-none-any.whl
- Subject digest: 701605307d3b31fa533cf6f91539918fb9affdaab7d65fa7c2637f0b716987ea
- Sigstore transparency entry: 548035029
- Sigstore integration time: Sep 22, 2025
Source repository:
- Permalink: basnijholt/slurm-usage@f5a284f7c519adc65d1116815fa78dfef1ed0394
- Branch / Tag: refs/tags/v3.1.0
- Owner: https://github.com/basnijholt
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pythonpublish.yml@f5a284f7c519adc65d1116815fa78dfef1ed0394
- Trigger Event: release

slurm-usage 3.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

SLURM Usage Monitor

Purpose

Key Features

What It Collects

Requirements

Installation

Quick Start (no installation needed)

Install as a Tool

Run from Source

Usage

CLI Commands

Example Commands

Command Options

collect - Gather job data from SLURM

analyze - Analyze collected data

status - Show system status

current - Display current cluster usage

nodes - Display node information

test - Test system configuration

Output Structure

Data Organization

Sample Analysis Output

Smart Re-collection

Tracked Incomplete States

Group Configuration

Data Directory

Automated Collection

Using Cron

Data Schema

ProcessedJob Model

Performance Optimizations

Important Notes

Post-Processing with Polars

Troubleshooting

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`collect` - Gather job data from SLURM

`analyze` - Analyze collected data

`status` - Show system status

`current` - Display current cluster usage

`nodes` - Display node information

`test` - Test system configuration