Skip to main content

Quota monitoring and management for Slurm.

Project description

slurmq

GPU quota management for Slurm clusters.

$ slurmq check

╭──────────────────── GPU Quota Report ────────────────────╮
│                                                          │
│   User:     dedalus                                      │
│   QoS:      medium                                       │
│   Cluster:  Stella HPC                                   │
│                                                          │
│   ████████████████████░░░░░░░░░░ 68.5%                   │
│                                                          │
│   Used:      342.5 GPU-hours                             │
│   Remaining: 157.5 GPU-hours                             │
│   Quota:     500 GPU-hours (rolling 30 days)             │
│                                                          │
╰──────────────────────────────────────────────────────────╯

Install

uv tool install slurmq

Setup

slurmq config init       # interactive wizard
slurmq config show       # verify settings
slurmq config validate   # check syntax before deploy

Config resolution order:

  1. SLURMQ_CONFIG env var
  2. ~/.config/slurmq/config.toml (user)
  3. /etc/slurmq/config.toml (system-wide)
default_cluster = "stella"

[clusters.stella]
name = "Stella HPC"
account = "research"
qos = ["low", "medium"]
quota_limit = 500        # GPU-hours
rolling_window_days = 30

Commands

check

slurmq check                  # current user
slurmq check --user alice     # specific user
slurmq check --cluster other  # different cluster
slurmq check --forecast       # usage projection
slurmq --json check           # machine-readable
slurmq --quiet check          # silent on success (for scripts)

efficiency

Analyze job resource efficiency (like seff).

slurmq efficiency 12345

Flags low efficiency: CPU < 30%, Memory < 20%.

report

Generate usage reports (admin).

slurmq report                          # table view
slurmq report --format csv -o out.csv

monitor

Real-time monitoring with optional enforcement (admin).

slurmq monitor                # live dashboard, 30s refresh
slurmq monitor --interval 10
slurmq monitor --once         # single check, for cron
slurmq monitor --enforce      # cancel jobs over quota

stats

Cluster-wide analytics with month-over-month comparison.

slurmq stats                          # GPU utilization + wait times
slurmq stats --days 14                # custom period
slurmq stats --no-compare             # skip MoM comparison
slurmq stats -p gpu -p gpu-large      # specific partitions
slurmq stats --small-threshold 25     # custom job size threshold
slurmq --json stats                   # machine-readable

Shows:

  • GPU utilization by partition/QoS
  • Wait time analysis (median, % jobs waiting > 6h)
  • Small vs large job breakdown
  • Month-over-month trends

Enforcement

Cancel jobs automatically when users exceed quota.

[enforcement]
enabled = true
dry_run = true            # preview mode
grace_period_hours = 24   # warn before cancel
exempt_users = ["admin"]
exempt_job_prefixes = ["checkpoint_"]

Run with slurmq monitor --enforce. Disable dry_run when ready.

Grace period: users exceeding quota get a warning window before jobs are cancelled.

Job States

Problematic states are highlighted:

State Meaning
OOM Out of Memory
TO Timeout
NF Node Failure
F Failed
PR Preempted

Scripting

# check quota status
if slurmq --json check | jq -e '.status == "exceeded"' > /dev/null; then
  echo "Quota exceeded"
fi

# cron: enforce every 5 minutes (quiet mode)
*/5 * * * * slurmq --quiet monitor --once --enforce >> /var/log/slurmq.log 2>&1

Documentation

Online: dedalus-labs.github.io/slurmq

For LLMs: llms.txt | llms-full.txt

Locally:

uv sync --extra docs
uv run mkdocs serve

Development

git clone https://github.com/dedalus-labs/slurmq.git && cd slurmq
uv sync --all-extras
uv run pytest
uv run ruff check
uv run ty check

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmq-0.0.3.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slurmq-0.0.3-py3-none-any.whl (37.5 kB view details)

Uploaded Python 3

File details

Details for the file slurmq-0.0.3.tar.gz.

File metadata

  • Download URL: slurmq-0.0.3.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmq-0.0.3.tar.gz
Algorithm Hash digest
SHA256 058dc9c92a9ec34399567d75c1ff0af2ecaef208d2ad48a6398b9d47a9b5a383
MD5 49ce47bf399f4b7ae0ddfb90ef38ed93
BLAKE2b-256 103a326b55fed2076a5c82d2970a5148ae9d9e84b5f0a1a126b52e1905178957

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmq-0.0.3.tar.gz:

Publisher: release.yml on dedalus-labs/slurmq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slurmq-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: slurmq-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 37.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmq-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 892724a4b907319e777e326252258d2d44dc406108bd3afadea96322c812dddb
MD5 3c61671082917d6d0ec18891aa7f73f4
BLAKE2b-256 bc422b65198d1593044c5d2bf52577a9c133a642244136f3c7b98505f073f64b

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmq-0.0.3-py3-none-any.whl:

Publisher: release.yml on dedalus-labs/slurmq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page