Skip to main content

Cuda Mangement - multi-process, scheduled jobs, distributed processing

Project description

cudam

Cuda Mangement - multi-process, scheduled jobs, distributed processing

command to check all cuda server status

date >> cuda_status.txt && echo 'cuda1' >> cuda_status.txt && ssh cuda1 'nvidia-smi' >> cuda_status.txt && echo 'cuda2' >> cuda_status.txt && ssh cuda2 'nvidia-smi' >> cuda_status.txt && echo 'cuda3' >> cuda_status.txt && ssh cuda3 'nvidia-smi' >> cuda_status.txt && echo 'cuda4' >> cuda_status.txt && ssh cuda4 'nvidia-smi' >> cuda_status.txt && echo 'cuda5' >> cuda_status.txt && ssh cuda5 'nvidia-smi' >> cuda_status.txt && echo 'cuda6' >> cuda_status.txt && ssh cuda6 'nvidia-smi' >> cuda_status.txt && echo 'cuda11' >> cuda_status.txt && ssh cuda11 'nvidia-smi' >> cuda_status.txt

server-client mode to utilize multi-GPUs across Multi-Machines

server side - develop the code that runs on a single GPU

# here is a dumb function to evaluate densenet
# it should be replaced by the actual code of evaluation
def evaluate_densenet(model):
    acc = 0.99
    return acc

client size - develop the code to send the models to server for evaluation

  • Add available GPU servers in the server list configuration file
# configuration of server list
cuda4,8000
cuda4,8001
cuda5,8000
cuda5,8001
cuda5,8002
  • The client code that concurrently evaluates models
from cudam.cudam_socket.client import GPUClientPool
DEFAULT_RUN_CODE_WORK_DIRECTORY = "/home/www/server" # the folder where the server side code resides 
DEFAULT_RUN_CODE_PATH = "server_file" # the file name of the server side code
SERVER_LIST_CONFIG = 'config/server_list.txt' # the configuration file of the server list
def pool_evaluate_densenet(model_list):
    # generat the arguments which will passed to client pool
    arr_args = []
    for m in model_list:        
        singe_args = {'model': m}
        arr_args.append({
            'path': DEFAULT_RUN_CODE_PATH,
            'entry': "evaluate_densenet",
            'work_directory': DEFAULT_RUN_CODE_WORK_DIRECTORY,
            'args': singe_args,
            'use_cuda': True # whether to use GPU or not
        })
    # init client pool
    server_list = GPUClientPool.load_server_list_from_file(SERVER_LIST_CONFIG)
    pool = GPUClientPool(server_list)
    # perform evaluation
    eval_result = pool.run_code_batch(arr_args)
    return eval_result
# main entrance
if __name__ == '__main__':
    model_list =[] # dumb model list which needs to be replaced by real models
    pool_evaluate_densenet(model_list)

start the server

  • After installation of this package, cudam_server.py should be automatically copied to the bin path; if not, please manually copy this file to the root folder of the project. The server can be started by running the following command:
nohup python cudam_server.py -s 1 -i cuda1 -p 8000 -g 0 >& log/nohup_cuda_1_8000_0.log &

run the client side python code to evaluate a batch of models

task manager

task template

#!/usr/bin/env bash

while getopts g: option;do
    case "${option}" in
    g) GPU_ID=${OPTARG};;
    esac
done

print_help(){
    printf "Parameter g(GPU ID) is mandatory\n"
    printf "g values - GPU ID"
    exit 1
}

if [ -z "${GPU_ID}" ];then
    print_help
fi

echo "start task on GPU: $GPU_ID"

# the root directory of your python script
cd ~/code/psocnn/
# the main python script accepting the gpu ID in -g argument
python3 main.py -g ${GPU_ID}

task folder structure

task folder structure

task manager

# start task manager
nohup cudam_task_manager.py -n 2 -s 2 -i 60 -f 300 &
# snap gpu
cudam_snap_gpu.py -s 2 -l 60 -g 1

install cumdam for a specific user and can not add the local path into executable PATH

  • Switch to the root folder of your project

  • Install cudam package

pip install --user cudam
  • Create a soft link of the executable file
ln -s /home/{YOURUSER}/.local/bin/cudam_task_manager.py cudam_task_manager.py
ln -s /home/{YOURUSER}/.local/bin/cudam_snap_gpu.py cudam_snap_gpu.py
  • Run the task manager
# run interactively
python cudam_task_manager.py -n 2 -s 2 -i 60 -f 300
# run in background
nohup python cudam_task_manager.py -n 2 -s 2 -i 60 -f 300 &

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cudam-0.0.6.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

cudam-0.0.6-py3-none-any.whl (53.8 kB view details)

Uploaded Python 3

File details

Details for the file cudam-0.0.6.tar.gz.

File metadata

  • Download URL: cudam-0.0.6.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.7

File hashes

Hashes for cudam-0.0.6.tar.gz
Algorithm Hash digest
SHA256 e87b137d1cf70e817a9781f1ca194865bfe6e37bee4323e6d93f883d7bbfda07
MD5 2a7ee4388b7d9b4734bf54db9a1ceaf7
BLAKE2b-256 cecbe3c028d59eeab9853ba73fb12c35b10e114917b5bdb57b1c6209546b39d2

See more details on using hashes here.

File details

Details for the file cudam-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: cudam-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 53.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.7

File hashes

Hashes for cudam-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 1358914da93359e1c01727855d18ad68e438713a8f7ffd40ef64d16149949961
MD5 cfa6915ec31e997f874b4e7e6e5ba1dd
BLAKE2b-256 7834e13d6e69ef1a1fab8f5c4344e01cde055c09e5ee8af511fbbc240d8d8711

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page