DL training on GPU management
Project description
GPU limit management
机器学习领域的一些实验,由于参数较多,通常需要对不同参数跑多组实验。
本项目维护使用GPU程序的任务队列,动态调度任务。避免手动跑实验带来的繁琐感受。
install
代码还在大改中,bug仍很多。。。
源码安装
git clone https://github.com/lichunown/gpu-limit.git
cd gpu-limit
python setup.py install
pip 安装
pip3 install gpulimit
usage
本程序使用linux socket进行交互,后台gpulimit_server
动态调度,前台gpulimit
发送命令,获取信息。
启动后台服务
gpulimit_server # 直接启动
nohup gpulimit_server & # 后台运行
前台命令
$ gpulimitc help
GPU Task Manage:
usage:
client.py -h show help
gpulimit add [cmds] add task [cmds] to gpulimit queue.
other commands:
help [cmd] show help
add [cmds] ls GPU task queue status
ls ls GPU task queue status
show [id] show task [id] details.
rm [id] remove task [id] from manage,
if task is running, kill it.
kill [id] kill task [id]
move [id] [index(default=0)] move [id] to [index]
set [name] [value] set some property.
start [id defalut=None] Force start task(s).
log [id] show [id] output.
status show System status.
debug [id] if task [id] is `CMD_ERROR`,
use this show error traceback.
添加任务
gpulimit add [cmds]
# for example
# gpulimit add python3 main.py --lambda=12 --alpha=1
查看任务
gpulimit ls
查看任务信息
gpulimit ls
查看任务输出日志
gpulimit log [task id]
同样,也支持查看gpulimit_server
的后台输出:
gpulimit log main
scheduling
整个系统调度抽为以下4种:
- timer_call定时器:按照一定时间间隔运行
- callback_process_end:单个任务结束回调函数
- callback_add_process:用户添加任务时的回调函数
- user_start_scheduling:用户强制运行任务调用
task信息:
- priority:default=5, 越小越优先
- status
- 'CMD_ERROR':命令本身有问题,python报错(仅在windows下)
- 'complete':任务完成
- 'waiting':等待调用
- 'running':正在运行
- 'runtime_error':任务在运行过程中出错,可能是显存爆了,也有可能是程序有问题
- 'killed':被用户kill的正在运行的进程,用户可以通过start命令重启
- 'paused':暂停的进程(暂停状态仍然占用GPU显存)
- run_times:任务出错
V0.2.0
- 重写status状态
- 调整task调度
- 简化调度算法
TODO list
- change raise type, and add
try except
for exception break. - __doc__
- kill all, range
- add commits
- use priority queue as task_manage.queue
- Improve scheduling aligorithm
- catch memory error in cmds, when cmds is
python ...
and usetf
orpytorch
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gpulimit-0.2.1.tar.gz
(45.2 kB
view details)
Built Distribution
gpulimit-0.2.1-py3-none-any.whl
(50.9 kB
view details)
File details
Details for the file gpulimit-0.2.1.tar.gz
.
File metadata
- Download URL: gpulimit-0.2.1.tar.gz
- Upload date:
- Size: 45.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c6b6e02c9c07d55ba541cde9cc9ab26da22973a45d0ec5cba287dbaa03513a7 |
|
MD5 | 061bb9ce0e66db741633cd8e9a7de28a |
|
BLAKE2b-256 | 32537a55e5a76a409e34c83301e6efd74ef35312627115d4670a837eae746c32 |
File details
Details for the file gpulimit-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: gpulimit-0.2.1-py3-none-any.whl
- Upload date:
- Size: 50.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6d550028349e70b338beca2940de5ab31d74d2da134e957ada69ea2295759f9 |
|
MD5 | 4e134565bfe0c8d3638bfdfb44f64398 |
|
BLAKE2b-256 | 1f7756bc2801a0daa651d96ba3f4de3b05dd0c8126d097d01755e46d98c899f7 |