Skip to main content

A utility for managing SLURM jobs and nodes with enhanced display features.

Project description

WrapSlurm

WrapSlurm is a powerful and user-friendly wrapper for SLURM job management, designed to simplify job submission, resource querying, log monitoring, and cancellation in SLURM environments. With a suite of commands like wrun, wlog, wqueue, winfo, and wk, WrapSlurm enhances productivity for researchers and engineers working in high-performance computing (HPC) clusters.


Features

  • Simplified Job Submission (wr):

    • Automatically detect optimal resources (nodes, partitions, CPUs, memory, GPUs) based on the cluster's configuration.
    • Friendly summaries before each run highlight auto-detected values and log locations.
    • Persist preferred defaults (e.g., partition, account, log directory) with --save-defaults.
    • Automatically use the partition's maximum runtime when no explicit --time is provided.
    • Support for interactive and non-interactive SLURM jobs, plus a convenient --dry-run preview mode.
    • Customizable SLURM settings like time, tasks per node, exclusions, job names, and output directories.
  • Log Monitoring (wl):

    • Watch real-time SLURM logs for specific job IDs or the latest job.
  • Job Cancellation (wk):

    • Quickly terminate jobs (optionally with a signal) using a friendly wrapper around scancel.
  • Queue Visualization (wq):

    • View and analyze job queues in a prettified table format with color-coded states.
  • Node Resource Querying (wi):

    • Display detailed SLURM node information, including memory, CPU, and GPU usage.
  • Help / Usage (ws):

    • Display a summary of all WrapSlurm commands and their usage.

Installation

WrapSlurm is available on PyPI and can be installed using pip:

pip install wrapslurm

Post-Installation Notes

If the scripts wrun, wlog, wqueue, winfo, and wk are installed in a directory not included in your system's PATH (e.g., ~/.local/bin), you may need to update your PATH environment variable:

  1. Add the following line to your shell configuration file (~/.bashrc or ~/.zshrc):

    export PATH="$PATH:$HOME/.local/bin"
    
  2. Reload your shell:

    source ~/.bashrc  # or source ~/.zshrc
    

Usage

1. Submit a Job (wrun)

Basic Usage:

Submit a script with auto-detected resources:

wr ./train_script.py --epochs 10

wr now shows a colorized summary of the resources that will be requested, including values auto-detected from sinfo and those loaded from saved defaults.

wr now shows a colorized summary of the resources that will be requested, including values auto-detected from sinfo and those loaded from saved defaults.

Specify Resources:

Submit a job with explicit resources:

wr --nodes 2 --partition gp4d --account ENT212162 --cpus-per-task 8 --memory 200G --gpus 4 ./train_script.py

You can also name the job, change where helper scripts are stored, or choose a custom log directory:

wr --job-name my-training --script-dir ./sbatch --report-dir ./logs python train.py

Interactive Mode:

Start an interactive session:

wr

Use wr --interactive --nodes 2 to override the automatic detection while still launching an interactive shell.

Save Your Defaults:

You can persist frequently used settings (e.g., partition, account, log directory) so future runs pick them up automatically:

wr --save-defaults --partition gp4d --account ENT212162 --report-dir ./slurm-report

Defaults are stored in ~/.config/wrapslurm/defaults.json. Running wr --save-defaults stores the provided flags and exits without submitting a job.

Full Help:

View all available options:

wr --help

Preview the Generated Script:

wr --dry-run python train.py

Dry runs print the exact sbatch script so you can review the environment setup before submitting.


2. Monitor Logs (wlog)

wlog streams SLURM output with tail -n 20 -f so you can follow job progress without the extra load from watch.

Logs are written to ./slurm-report/%j.out and ./slurm-report/%j.err by default.

Watch the Latest Log File:

wl

Watch Logs for a Specific Job ID:

wl --job-id 12345678

To inspect stderr instead, open ./slurm-report/12345678.err with your preferred tool.


3. Cancel a Job (wk)

Send scancel commands without memorizing flags:

wk 12345678

Cancel multiple jobs in one go:

wk 12345678 12345679

Pass through additional options such as a signal or user scope:

wk 12345678 --signal SIGINT
wk --user alice 12345680

All options are forwarded to scancel, so you can combine them as needed.


4. View Job Queue (wqueue)

Display the job queue in a table format:

wqueue

Filter or split GPU vs. CPU-only jobs:

wq --gpu     # only jobs that request one or more GPUs
wq --cpu     # only CPU-only jobs (no GPUs)   (alias: --no-gpu)
wq --split   # GPU jobs and CPU-only jobs shown in two separate tables

Short flags -g (--gpu), -c (--cpu) and -s (--split) are also supported. Run wq --help for the full list of options.


5. Query Node Resources (winfo)

Basic Usage:

winfo

Include Down or Drained Nodes:

winfo --include-down

Display GPU Usage Graph:

winfo --graph

Example Workflow

  1. Query available resources:

    wi
    
  2. Submit a job:

    wr --account xxxxxx --time 2-00:00:00 ./train_script.py
    
  3. Monitor job logs:

    wl
    
  4. Check the queue:

    wq
    

Development

Cloning the Repository

git clone https://github.com/yourusername/wrapslurm.git
cd wrapslurm

Install Dependencies

Install the required Python packages:

pip install -r requirements.txt

Run Tests

Execute unit tests:

pytest

Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository.
  2. Create a feature branch:
    git checkout -b feature-name
    
  3. Commit your changes:
    git commit -m "Add feature-name"
    
  4. Push to your fork:
    git push origin feature-name
    
  5. Submit a pull request.

License

This project is licensed under the MIT License.


Acknowledgments

Special thanks to the SLURM community for making HPC resource management accessible to researchers worldwide.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wrapslurm-0.1.10.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wrapslurm-0.1.10-py3-none-any.whl (30.0 kB view details)

Uploaded Python 3

File details

Details for the file wrapslurm-0.1.10.tar.gz.

File metadata

  • Download URL: wrapslurm-0.1.10.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for wrapslurm-0.1.10.tar.gz
Algorithm Hash digest
SHA256 e56633c49f0bddae6e3d55489cc95d8b1c1112960652282e6651c82a2703d24c
MD5 e8da2d78a4c444df24e264f23d436817
BLAKE2b-256 a442a9c7c15c86bee2fc0a9866e668ffdd8310bbb4f9b5714ca6c81a0e73d53d

See more details on using hashes here.

File details

Details for the file wrapslurm-0.1.10-py3-none-any.whl.

File metadata

  • Download URL: wrapslurm-0.1.10-py3-none-any.whl
  • Upload date:
  • Size: 30.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for wrapslurm-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 7a87445d5594d4b7e1228839a1d93d929edd575f3afe22b1a6d6af24a6b12cc3
MD5 b1cfe1877a2399f9fbb4449474072a75
BLAKE2b-256 1f06ba67c526c63514ec7e0cd181ecf745dcdf10f5bf951dcb89dd50d0294308

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page