Skip to main content

Slurm GPU quota monitoring and management

Project description

slurmq

GPU quota management for SLURM clusters.

$ slurmq check

╭──────────────────── GPU Quota Report ────────────────────╮
│                                                          │
│   User:     dedalus                                      │
│   QoS:      medium                                       │
│   Cluster:  Stella HPC                                   │
│                                                          │
│   ████████████████████░░░░░░░░░░ 68.5%                   │
│                                                          │
│   Used:      342.5 GPU-hours                             │
│   Remaining: 157.5 GPU-hours                             │
│   Quota:     500 GPU-hours (rolling 30 days)             │
│                                                          │
╰──────────────────────────────────────────────────────────╯

Install

uv tool install slurmq

Setup

slurmq config init       # interactive wizard
slurmq config show       # verify settings
slurmq config validate   # check syntax before deploy

Config resolution order:

  1. SLURMQ_CONFIG env var
  2. ~/.config/slurmq/config.toml (user)
  3. /etc/slurmq/config.toml (system-wide)
default_cluster = "stella"

[clusters.stella]
name = "Stella HPC"
account = "research"
qos = ["low", "medium"]
quota_limit = 500        # GPU-hours
rolling_window_days = 30

Commands

check

slurmq check                  # current user
slurmq check --user alice     # specific user
slurmq check --cluster other  # different cluster
slurmq check --forecast       # usage projection
slurmq --json check           # machine-readable
slurmq --quiet check          # silent on success (for scripts)

efficiency

Analyze job resource efficiency (like seff).

slurmq efficiency 12345

Flags low efficiency: CPU < 30%, Memory < 20%.

report

Generate usage reports (admin).

slurmq report                          # table view
slurmq report --format csv -o out.csv

monitor

Real-time monitoring with optional enforcement (admin).

slurmq monitor                # live dashboard, 30s refresh
slurmq monitor --interval 10
slurmq monitor --once         # single check, for cron
slurmq monitor --enforce      # cancel jobs over quota

stats

Cluster-wide analytics with month-over-month comparison.

slurmq stats                          # GPU utilization + wait times
slurmq stats --days 14                # custom period
slurmq stats --no-compare             # skip MoM comparison
slurmq stats -p gpu -p gpu-large      # specific partitions
slurmq stats --small-threshold 25     # custom job size threshold
slurmq --json stats                   # machine-readable

Shows:

  • GPU utilization by partition/QoS
  • Wait time analysis (median, % jobs waiting > 6h)
  • Small vs large job breakdown
  • Month-over-month trends

Enforcement

Cancel jobs automatically when users exceed quota.

[enforcement]
enabled = true
dry_run = true            # preview mode
grace_period_hours = 24   # warn before cancel
exempt_users = ["admin"]
exempt_job_prefixes = ["checkpoint_"]

Run with slurmq monitor --enforce. Disable dry_run when ready.

Grace period: users exceeding quota get a warning window before jobs are cancelled.

Job States

Problematic states are highlighted:

State Meaning
OOM Out of Memory
TO Timeout
NF Node Failure
F Failed
PR Preempted

Scripting

# check quota status
if slurmq --json check | jq -e '.status == "exceeded"' > /dev/null; then
  echo "Quota exceeded"
fi

# cron: enforce every 5 minutes (quiet mode)
*/5 * * * * slurmq --quiet monitor --once --enforce >> /var/log/slurmq.log 2>&1

Documentation

Online: dedalus-labs.github.io/slurmq

For LLMs: llms.txt | llms-full.txt

Locally:

uv sync --extra docs
uv run mkdocs serve

Development

git clone https://github.com/dedalus-labs/slurmq.git && cd slurmq
uv sync --all-extras
uv run pytest
uv run ruff check
uv run ty check

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmq-0.0.2.tar.gz (25.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slurmq-0.0.2-py3-none-any.whl (34.9 kB view details)

Uploaded Python 3

File details

Details for the file slurmq-0.0.2.tar.gz.

File metadata

  • Download URL: slurmq-0.0.2.tar.gz
  • Upload date:
  • Size: 25.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmq-0.0.2.tar.gz
Algorithm Hash digest
SHA256 7f54fc7e331c481148cdd3f2f0e83f25fc5f28cf6bd385a91a6e10b1ea33ffe0
MD5 8abaf33402d3eead84e77ae7897ed75c
BLAKE2b-256 2817f14e40efeea7a05fd3391071c3d6b059a0adbd9055e0dce77fc592981efb

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmq-0.0.2.tar.gz:

Publisher: release.yml on dedalus-labs/slurmq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slurmq-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: slurmq-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 34.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmq-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e995c5857f4880afbb149868d592070038e0abbc18643422cd78bf28dcc838f2
MD5 d3143cabc656268179cc25b183b86046
BLAKE2b-256 40e4f823efedf8418724cd90bba28a0acf22b21dfd2a4e66a4d5fc87a8e6ffbb

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmq-0.0.2-py3-none-any.whl:

Publisher: release.yml on dedalus-labs/slurmq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page