Skip to main content

DL training on GPU management

Project description

GPU limit management

机器学习领域的一些实验,由于参数较多,通常需要对不同参数跑多组实验。

本项目维护使用GPU程序的任务队列,动态调度任务。避免手动跑实验带来的繁琐感受。

install

代码还在大改中,bug仍很多。。。

源码安装

git clone https://github.com/lichunown/gpu-limit.git
cd gpu-limit
python setup.py install

pip 安装

pip3 install gpulimit

usage

本程序使用linux socket进行交互,后台gpulimit_server动态调度,前台gpulimit发送命令,获取信息。

启动后台服务

gpulimit_server # 直接启动
nohup gpulimit_server & # 后台运行

前台命令

$ gpulimitc help

GPU Task Manage:
    usage:

        client.py -h                  show help
        gpulimit add [cmds]           add task [cmds] to gpulimit queue.


    other commands:

        help [cmd]                    show help
        add [cmds]                    ls GPU task queue status
        ls                            ls GPU task queue status
        show [id]                     show task [id] details.
        rm [id]                       remove task [id] from manage, 
        							  	if task is running, kill it.

        kill [id]                     kill task [id]
        move [id] [index(default=0)]  move [id] to [index]
        set [name] [value]            set some property.
        start [id defalut=None]       Force start task(s).
        log [id]                      show [id] output.
        status                        show System status.
        debug [id]                    if task [id] is `CMD_ERROR`, 
                                      	use this show error traceback.

添加任务

gpulimit add [cmds]
# for example
# gpulimit add python3 main.py --lambda=12 --alpha=1

查看任务

gpulimit ls

查看任务信息

gpulimit ls

查看任务输出日志

gpulimit log [task id]

同样,也支持查看gpulimit_server的后台输出:

gpulimit log main

scheduling

整个系统调度抽为以下4种:

  • timer_call定时器:按照一定时间间隔运行
  • callback_process_end:单个任务结束回调函数
  • callback_add_process:用户添加任务时的回调函数
  • user_start_scheduling:用户强制运行任务调用

task信息:

  • priority:default=5, 越小越优先
  • status
    • 'CMD_ERROR':命令本身有问题,python报错(仅在windows下)
    • 'complete':任务完成
    • 'waiting':等待调用
    • 'running':正在运行
    • 'runtime_error':任务在运行过程中出错,可能是显存爆了,也有可能是程序有问题
    • 'killed':被用户kill的正在运行的进程,用户可以通过start命令重启
    • 'paused':暂停的进程(暂停状态仍然占用GPU显存)
  • run_times:任务出错

V0.2.0

  • 重写status状态
  • 调整task调度
  • 简化调度算法

TODO list

  • change raise type, and add try except for exception break.
  • __doc__
  • kill all, range
  • add commits
  • use priority queue as task_manage.queue
  • Improve scheduling aligorithm
  • catch memory error in cmds, when cmds is python ... and usetf or pytorch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpulimit-0.2.1.tar.gz (45.2 kB view details)

Uploaded Source

Built Distribution

gpulimit-0.2.1-py3-none-any.whl (50.9 kB view details)

Uploaded Python 3

File details

Details for the file gpulimit-0.2.1.tar.gz.

File metadata

  • Download URL: gpulimit-0.2.1.tar.gz
  • Upload date:
  • Size: 45.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.10

File hashes

Hashes for gpulimit-0.2.1.tar.gz
Algorithm Hash digest
SHA256 1c6b6e02c9c07d55ba541cde9cc9ab26da22973a45d0ec5cba287dbaa03513a7
MD5 061bb9ce0e66db741633cd8e9a7de28a
BLAKE2b-256 32537a55e5a76a409e34c83301e6efd74ef35312627115d4670a837eae746c32

See more details on using hashes here.

File details

Details for the file gpulimit-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: gpulimit-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 50.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.10

File hashes

Hashes for gpulimit-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c6d550028349e70b338beca2940de5ab31d74d2da134e957ada69ea2295759f9
MD5 4e134565bfe0c8d3638bfdfb44f64398
BLAKE2b-256 1f7756bc2801a0daa651d96ba3f4de3b05dd0c8126d097d01755e46d98c899f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page