Slurm GPU quota monitoring and management
Project description
slurmq
GPU quota management for SLURM clusters.
$ slurmq check
╭──────────────────── GPU Quota Report ────────────────────╮
│ │
│ User: dedalus │
│ QoS: medium │
│ Cluster: Stella HPC │
│ │
│ ████████████████████░░░░░░░░░░ 68.5% │
│ │
│ Used: 342.5 GPU-hours │
│ Remaining: 157.5 GPU-hours │
│ Quota: 500 GPU-hours (rolling 30 days) │
│ │
╰──────────────────────────────────────────────────────────╯
Install
uv tool install slurmq
Setup
slurmq config init # interactive wizard
slurmq config show # verify settings
slurmq config validate # check syntax before deploy
Config resolution order:
SLURMQ_CONFIGenv var~/.config/slurmq/config.toml(user)/etc/slurmq/config.toml(system-wide)
default_cluster = "stella"
[clusters.stella]
name = "Stella HPC"
account = "research"
qos = ["low", "medium"]
quota_limit = 500 # GPU-hours
rolling_window_days = 30
Commands
check
slurmq check # current user
slurmq check --user alice # specific user
slurmq check --cluster other # different cluster
slurmq check --forecast # usage projection
slurmq --json check # machine-readable
slurmq --quiet check # silent on success (for scripts)
efficiency
Analyze job resource efficiency (like seff).
slurmq efficiency 12345
Flags low efficiency: CPU < 30%, Memory < 20%.
report
Generate usage reports (admin).
slurmq report # table view
slurmq report --format csv -o out.csv
monitor
Real-time monitoring with optional enforcement (admin).
slurmq monitor # live dashboard, 30s refresh
slurmq monitor --interval 10
slurmq monitor --once # single check, for cron
slurmq monitor --enforce # cancel jobs over quota
stats
Cluster-wide analytics with month-over-month comparison.
slurmq stats # GPU utilization + wait times
slurmq stats --days 14 # custom period
slurmq stats --no-compare # skip MoM comparison
slurmq stats -p gpu -p gpu-large # specific partitions
slurmq stats --small-threshold 25 # custom job size threshold
slurmq --json stats # machine-readable
Shows:
- GPU utilization by partition/QoS
- Wait time analysis (median, % jobs waiting > 6h)
- Small vs large job breakdown
- Month-over-month trends
Enforcement
Cancel jobs automatically when users exceed quota.
[enforcement]
enabled = true
dry_run = true # preview mode
grace_period_hours = 24 # warn before cancel
exempt_users = ["admin"]
exempt_job_prefixes = ["checkpoint_"]
Run with slurmq monitor --enforce. Disable dry_run when ready.
Grace period: users exceeding quota get a warning window before jobs are cancelled.
Job States
Problematic states are highlighted:
| State | Meaning |
|---|---|
OOM |
Out of Memory |
TO |
Timeout |
NF |
Node Failure |
F |
Failed |
PR |
Preempted |
Scripting
# check quota status
if slurmq --json check | jq -e '.status == "exceeded"' > /dev/null; then
echo "Quota exceeded"
fi
# cron: enforce every 5 minutes (quiet mode)
*/5 * * * * slurmq --quiet monitor --once --enforce >> /var/log/slurmq.log 2>&1
Documentation
Online: dedalus-labs.github.io/slurmq
For LLMs: llms.txt | llms-full.txt
Locally:
uv sync --extra docs
uv run mkdocs serve
Development
git clone https://github.com/dedalus-labs/slurmq.git && cd slurmq
uv sync --all-extras
uv run pytest
uv run ruff check
uv run ty check
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slurmq-0.0.2.tar.gz.
File metadata
- Download URL: slurmq-0.0.2.tar.gz
- Upload date:
- Size: 25.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f54fc7e331c481148cdd3f2f0e83f25fc5f28cf6bd385a91a6e10b1ea33ffe0
|
|
| MD5 |
8abaf33402d3eead84e77ae7897ed75c
|
|
| BLAKE2b-256 |
2817f14e40efeea7a05fd3391071c3d6b059a0adbd9055e0dce77fc592981efb
|
Provenance
The following attestation bundles were made for slurmq-0.0.2.tar.gz:
Publisher:
release.yml on dedalus-labs/slurmq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurmq-0.0.2.tar.gz -
Subject digest:
7f54fc7e331c481148cdd3f2f0e83f25fc5f28cf6bd385a91a6e10b1ea33ffe0 - Sigstore transparency entry: 778696245
- Sigstore integration time:
-
Permalink:
dedalus-labs/slurmq@1ba8da097b8f26f29004b1461d25890e5d39c9fa -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dedalus-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1ba8da097b8f26f29004b1461d25890e5d39c9fa -
Trigger Event:
push
-
Statement type:
File details
Details for the file slurmq-0.0.2-py3-none-any.whl.
File metadata
- Download URL: slurmq-0.0.2-py3-none-any.whl
- Upload date:
- Size: 34.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e995c5857f4880afbb149868d592070038e0abbc18643422cd78bf28dcc838f2
|
|
| MD5 |
d3143cabc656268179cc25b183b86046
|
|
| BLAKE2b-256 |
40e4f823efedf8418724cd90bba28a0acf22b21dfd2a4e66a4d5fc87a8e6ffbb
|
Provenance
The following attestation bundles were made for slurmq-0.0.2-py3-none-any.whl:
Publisher:
release.yml on dedalus-labs/slurmq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurmq-0.0.2-py3-none-any.whl -
Subject digest:
e995c5857f4880afbb149868d592070038e0abbc18643422cd78bf28dcc838f2 - Sigstore transparency entry: 778696251
- Sigstore integration time:
-
Permalink:
dedalus-labs/slurmq@1ba8da097b8f26f29004b1461d25890e5d39c9fa -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dedalus-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1ba8da097b8f26f29004b1461d25890e5d39c9fa -
Trigger Event:
push
-
Statement type: