Skip to main content

DL training on GPU management

Project description

GPU limit management

机器学习领域的一些实验,由于参数较多,通常需要对不同参数跑多组实验。

本项目维护使用GPU程序的任务队列,动态调度任务。避免手动跑实验带来的繁琐感受。

install

代码还在大改中,bug仍很多。。。

源码安装

git clone https://github.com/lichunown/gpu-limit.git
cd gpu-limit
python setup.py install

pip 安装

pip3 install gpulimit

usage

本程序使用linux socket进行交互,后台gpulimit_server动态调度,前台gpulimit发送命令,获取信息。

启动后台服务

gpulimit_server # 直接启动
nohup gpulimit_server & # 后台运行

前台命令

$ gpulimitc help

GPU Task Manage:
    usage:

        client.py -h                  show help
        gpulimit add [cmds]           add task [cmds] to gpulimit queue.


    other commands:

        help [cmd]                    show help
        add [cmds]                    ls GPU task queue status
        ls                            ls GPU task queue status
        show [id]                     show task [id] details.
        rm [id]                       remove task [id] from manage, 
        							  	if task is running, kill it.

        kill [id]                     kill task [id]
        move [id] [index(default=0)]  move [id] to [index]
        set [name] [value]            set some property.
        start [id defalut=None]       Force start task(s).
        log [id]                      show [id] output.
        status                        show System status.
        debug [id]                    if task [id] is `CMD_ERROR`, 
                                      	use this show error traceback.

添加任务

gpulimit add [cmds]
# for example
# gpulimit add python3 main.py --lambda=12 --alpha=1

查看任务

gpulimit ls

查看任务信息

gpulimit ls

查看任务输出日志

gpulimit log [task id]

同样,也支持查看gpulimit_server的后台输出:

gpulimit log main

scheduling

整个系统调度抽为以下4种:

  • timer_call定时器:按照一定时间间隔运行
  • callback_process_end:单个任务结束回调函数
  • callback_add_process:用户添加任务时的回调函数
  • user_start_scheduling:用户强制运行任务调用

task信息:

  • priority:default=5, 越小越优先
  • status
    • 'CMD_ERROR':命令本身有问题,python报错(仅在windows下)
    • 'complete':任务完成
    • 'waiting':等待调用
    • 'running':正在运行
    • 'runtime_error':任务在运行过程中出错,可能是显存爆了,也有可能是程序有问题
    • 'killed':被用户kill的正在运行的进程,用户可以通过start命令重启
    • 'paused':暂停的进程(暂停状态仍然占用GPU显存)
  • run_times:任务出错

TODO list

  • change raise type, and add try except for exception break.
  • __doc__
  • kill all, range
  • add commits
  • use priority queue as task_manage.queue
  • Improve scheduling aligorithm
  • catch memory error in cmds, when cmds is python ... and usetf or pytorch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpulimit-0.1.4.tar.gz (29.7 kB view hashes)

Uploaded Source

Built Distribution

gpulimit-0.1.4-py3-none-any.whl (37.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page