watchmen for GPU scheduling
Project description
watchmen
A simple and easy-to-use toolkit for GPU scheduling.
Dependencies
- Python >= 3.6
- requests >= 2.24.0
- pydantic >= 1.7.1
- gpustat >= 0.6.0
- flask >= 1.1.2
- apscheduler >= 3.6.3
Installation
- Install dependencies.
$ pip install -r requirements.txt
- Install watchmen.
Install from source code:
$ pip install -e .
Or you can install the stable version package from pypi.
$ pip install gpu-watchmen -i https://pypi.org/simple
Quick Start
- Start the server
The default port of the server is 62333
$ python -m watchmen.server
If you want the server to be running backend, try:
$ nohup python -m watchmen.server &
There are some configurations for the server
usage: server.py [-h] [--host HOST] [--port PORT]
[--queue_timeout QUEUE_TIMEOUT]
[--request_interval REQUEST_INTERVAL]
[--status_queue_keep_time STATUS_QUEUE_KEEP_TIME]
optional arguments:
-h, --help show this help message and exit
--host HOST host address for api server
--port PORT port for api server
--queue_timeout QUEUE_TIMEOUT
timeout for queue waiting (seconds)
--request_interval REQUEST_INTERVAL
interval for gpu status requesting (seconds)
--status_queue_keep_time STATUS_QUEUE_KEEP_TIME
hours for keeping the client status
- Modify the source code in your project:
from watchmen import WatchClient
client = WatchClient(id="short description of this running", gpus=[1],
server_host="127.0.0.1", server_port=62333)
client.wait()
When the program goes on after client.wait()
, you are in the queue.
You can check examples in example/
for further reading.
$ cd example && python single_card_mnist.py --id="single" --cuda=0 --wait
# queue mode
$ cd example && python multi_card_mnist.py --id="multi" --cuda=2,3 --wait
# schedule mode
$ cd example && python multi_card_mnist.py --id='multi card scheduling wait' --cuda=1,0,3 --req_gpu_num=2 --wait=schedule
- Check the queue in browser.
Open the following link to your browser: http://<server ip address>:<server port>
, for example: http://192.168.126.143:62333
.
And you can get a result like the demo below. Please be aware that the page is not going to change dynamically, so you can refresh the page manually to check the latest status.
New Demo (scheduling mode supported)
Old Demo (queue mode supported)
- Reminder when program is finished.
watchmen
also support email and other kinds of reminders for message informing.
For example, you can send yourself an email when the program is finished.
from watchmen.reminder import send_email
... # your code here
send_email(
host="smtp.163.com", # email host to login, like `smtp.163.com`
port=25, # email port to login, like `25`
user="***@163.com", # user email address for login, like `***@163.com`
password="***", # password or auth code for login
receiver="***@outlook.com", # receiver email address
html_message="<h1>Your program is finished!</h1>", # content, html format supported
subject="Proram Finished Notice" # email subject
)
To get more reminders, please check watchmen/reminder.py
.
UPDATE
- v0.3.3: fix
check_finished
bug in server end, quit the main thread if the sub-thread is quit, and remove the backend cmd in the main thread - v0.3.2: fix
WatchClient
bug - v0.3.1: change
Client
intoWatchClient
, fixClientCollection
andsend_email
bug - v0.3.0: support gpu scheduling, fix blank input output, fix
check_gpus_existence
- v0.2.2: fix html package data, add multi-card example
TODO
- import user authentication modules to help the working queue delete operations
- read programs' pids to help reading program working status and kill tasks remotely
- test and support distributed model parallel configurations (with
python -m torch.distributed.launch
) - prettify the web page and divide functions into different tabs
- gpu using stats for each user and process
- quit the main thread if the sub-thread is quit
- change
Client
intoWatchClient
, in case of any ambiguity -
ClientCollection/__contains__
function should not includefinished_queue
, to help theid
releases - subject bug in
reminder/send_email()
- add schedule feature, so clients only have to request for a number and range of gpus, and the server will assign the gpu num to clients
- add reminders
- add webui html support
- add examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gpu-watchmen-0.3.3.tar.gz
.
File metadata
- Download URL: gpu-watchmen-0.3.3.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f315b21046f96237b69fd97ef6a7378ac6b2a975f83e7acc55c93ae33a6d0a0 |
|
MD5 | 4512f185cb7e99939a0ce03400a14d18 |
|
BLAKE2b-256 | 3aa909e57672df296782d6e3b3393320896b4d2a3094016f91da9064f8e9064d |
File details
Details for the file gpu_watchmen-0.3.3-py3-none-any.whl
.
File metadata
- Download URL: gpu_watchmen-0.3.3-py3-none-any.whl
- Upload date:
- Size: 11.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d01a53e69f1ce11919d6e80a9a82e2a4ba92d2bf9e9dd9814736735f7f2e3167 |
|
MD5 | dc4d9f4120b53875d22397d4b277b357 |
|
BLAKE2b-256 | 4768731874fbf30602c8e03949c27c9725717dbd84eb83ae532b71cfd3f9d0e2 |