Skip to main content

A simple Linux command-line utility which submits a job to one of the multiple GPU servers

Project description

ΣΣJob

PyPI version Downloads License

ΣΣJob or SumsJob (Simple Utility for Multiple-Servers Job Submission) is a simple Linux command-line utility which submits a job to one of the multiple servers each with limited GPUs. ΣΣJob provides similar key functions for multiple servers as Slurm Workload Manager for supercomputers and computer clusters. It provides the following key functions:

  • report the state of GPUs on all servers,
  • submit a job to servers for execution in noninteractive mode, i.e., the job will be running in the background of the server,
  • submit a job to servers for execution in interactive mode, just as the job is running in your local machine,
  • display all running jobs,
  • cancel running jobs.

Motivation

Assume you have a few GPU servers: server1, server2, ... When you need to run a code from your computer, you will

  1. Select one server and log in

    $ ssh LAN (You may need to first log in a local area network)
    $ ssh server1
    
  2. Check GPU status. If no free GPU, go to step 1

    $ nvidia-smi or $ gpustat

  3. Copy the code from your computer to the server

    $ scp -r codes server1:~/project/codes
    
  4. Run the code in the server

    $ cd ~/project/codes
    $ CUDA_VISIBLE_DEVICES=0 python main.py
    
  5. Transfer back the results

    $ scp server1:~/project/codes/results.dat .
    

These steps are boring. ΣΣJob makes all these steps automatic.

Features

  • Simple to use
  • Two modes: noninteractive mode, and interactive mode
  • Noninteractive mode: the job will be running in the background of the server
    • You can turn off your local machine
  • Interactive mode: just as the job is running in your local machine
    • Display the output of the program in the terminal of your local machine in real time
    • Kill the job by Ctrl-C

Commands

  • sinfo: Report the state of GPUs on all servers.
  • srun: Submit a job to GPU servers for execution.
  • sacct: Display all running jobs ordered by the start time.
  • scancel: Cancel a running job.

$ sinfo

Report the state of GPUs on all servers. For example,

$ sinfo
chitu                       Fri Dec 31 20:05:24 2021  470.74
[0] NVIDIA GeForce RTX 3080 | 27'C,   0 % |  2190 / 10018 MB | shuaim:python3/3589(2190M)
[1] NVIDIA GeForce RTX 3080 | 53'C,   7 % |  2159 / 10014 MB | lu:python/241697(2159M)

dilu                           Fri Dec 31 20:05:26 2021  470.74
[0] NVIDIA GeForce RTX 3080 Ti | 65'C,  73 % |  1672 / 12045 MB | chenxiwu:python/352456(1672M)
[1] NVIDIA GeForce RTX 3080 Ti | 54'C,  83 % |  1610 / 12053 MB | chenxiwu:python/352111(1610M)

Available GPU: chitu [0]

$ srun jobfile [jobname]

Submit a job to GPU servers for execution. Automatically do the following steps:

  1. Find a GPU with low utilization and sufficient memory (the criterion is in the configuration file).
    • If currently no GPU available, it will wait for some time (-p PERIOD_RETRY) and then try again, until reaching the maximum retries (-n NUM_RETRY).
    • You can also specify the server and GPU by -s SERVER and --gpuid GPUID.
  2. Copy the code to the server.
  3. Run the job on it in noninteractive mode (default) or interactive mode (with -i).
  4. Save the output in a log file.
  5. For interactive mode, when the code finishes, transfer back the result files and the log file.
  • jobfile : File to be run
  • jobname : Job name, and also the folder name of the job. If not provided, a random number will be used.

Options:

  • -h, --help : Show this help message and exit
  • -i, --interact : Run the job in interactive mode
  • -s SERVER, --server SERVER : Server host name
  • --gpuid GPUID : GPU ID to be used; -1 to use CPU only
  • -n NUM_RETRY, --num_retry NUM_RETRY : Number of times to retry the submission (Default: 1000)
  • -p PERIOD_RETRY, --period_retry PERIOD_RETRY : Waiting time (seconds) between two retries after each retry failure (Default: 600)

$ sacct

Display all running jobs ordered by the start time. For example,

$ sacct
Server   JobName          Start
-------- ---------------- ----------------------
chitu    job1             12/31/2021 07:41:08 PM
chitu    job2             12/31/2021 08:14:54 PM
dilu     job3             12/31/2021 08:15:23 PM

$ scancel jobname

Cancel a running job.

  • jobname : Job name.

Installation

ΣΣJob requires Python 3.7 or later. Install with pip:

$ pip install sumsjob

You also need to do the following:

  • Make sure you can ssh to each server, ideally without typing the password by SSH keys.
  • Install gpustat in each server.
  • Create a configuration file at ~/.sumsjob/config.py. Use config.py as a template, and modify the values to your configurations.
  • Make sure ~/.local/bin is in your $PATH.

Then run sinfo to check if everything works.

License

GNU GPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SumsJob-0.7.1.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

SumsJob-0.7.1-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file SumsJob-0.7.1.tar.gz.

File metadata

  • Download URL: SumsJob-0.7.1.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for SumsJob-0.7.1.tar.gz
Algorithm Hash digest
SHA256 0c7c962a38d15482c90b9a2d5fbf862b4d481f5fa74daa6ae712733cd5c4e051
MD5 e23175cb978e5e10bf0944b9cc6d6b6a
BLAKE2b-256 bcfacc2bda12eb2839d0a48b2696d100fa00a566dd3dae070faa04cf74da9f69

See more details on using hashes here.

File details

Details for the file SumsJob-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: SumsJob-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for SumsJob-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e5e1093046da80832f5aeabe0884a40ea4a23378e1e9ba27ec4d5c5c4d2ab077
MD5 c38189b9999fbb0011272d8c644cd86c
BLAKE2b-256 4ce7c97917de284feb1c7901caa2f746c08b44f76541f90fcd751455789d5b3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page