Skip to main content

Cuda Mangement - multi-process, scheduled jobs, distributed processing

Project description


Cuda Mangement - multi-process, scheduled jobs, distributed processing

command to check all cuda server status

date >> cuda_status.txt && echo 'cuda1' >> cuda_status.txt && ssh cuda1 'nvidia-smi' >> cuda_status.txt && echo 'cuda2' >> cuda_status.txt && ssh cuda2 'nvidia-smi' >> cuda_status.txt && echo 'cuda3' >> cuda_status.txt && ssh cuda3 'nvidia-smi' >> cuda_status.txt && echo 'cuda4' >> cuda_status.txt && ssh cuda4 'nvidia-smi' >> cuda_status.txt && echo 'cuda5' >> cuda_status.txt && ssh cuda5 'nvidia-smi' >> cuda_status.txt && echo 'cuda6' >> cuda_status.txt && ssh cuda6 'nvidia-smi' >> cuda_status.txt && echo 'cuda11' >> cuda_status.txt && ssh cuda11 'nvidia-smi' >> cuda_status.txt

server-client mode to utilize multi-GPUs across Multi-Machines

server side - develop the code that runs on a single GPU

# here is a dumb function to evaluate densenet
# it should be replaced by the actual code of evaluation
def evaluate_densenet(model):
    acc = 0.99
    return acc

client size - develop the code to send the models to server for evaluation

  • Add available GPU servers in the server list configuration file
# configuration of server list
  • The client code that concurrently evaluates models
from cudam.socket.client import GPUClientPool
DEFAULT_RUN_CODE_WORK_DIRECTORY = "/home/www/server" # the folder where the server side code resides 
DEFAULT_RUN_CODE_PATH = "server_file" # the file name of the server side code
SERVER_LIST_CONFIG = 'config/server_list.txt' # the configuration file of the server list
def pool_evaluate_densenet(model_list):
    # generat the arguments which will passed to client pool
    arr_args = []
    for m in model_list:        
        singe_args = {'model': m}
            'path': DEFAULT_RUN_CODE_PATH,
            'entry': "evaluate_densenet",
            'work_directory': DEFAULT_RUN_CODE_WORK_DIRECTORY,
            'args': singe_args,
            'use_cuda': True # whether to use GPU or not
    # init client pool
    server_list = GPUClientPool.load_server_list_from_file(SERVER_LIST_CONFIG)
    pool = GPUClientPool(server_list)
    # perform evaluation
    eval_result = pool.run_code_batch(arr_args)
    return eval_result
# main entrance
if __name__ == '__main__':
    model_list =[] # dumb model list which needs to be replaced by real models

start the server

  • After installation of this package, should be automatically copied to the bin path; if not, please manually copy this file to the root folder of the project. The server can be started by running the following command:

run the client side python code to evaluate a batch of models

nohup python -s 1 -i cuda1 -p 8000 -g 0 >& log/nohup_cuda_1_8000_0.log &

task manager

task template

#!/usr/bin/env bash

while getopts g: option;do
    case "${option}" in
    g) GPU_ID=${OPTARG};;

    printf "Parameter g(GPU ID) is mandatory\n"
    printf "g values - GPU ID"
    exit 1

if [ -z "${GPU_ID}" ];then

echo "start task on GPU: $GPU_ID"

# the root directory of your python script
cd ~/code/psocnn/
# the main python script accepting the gpu ID in -g argument
python3 -g ${GPU_ID}

task folder structure

task folder structure

task manager

# start task manager
nohup -n 2 -s 2 -i 60 -f 300 &
# snap gpu -s 2 -l 60 -g 1

install cumdam for a specific user and can not add the local path into executable PATH

  • Switch to the root folder of your project

  • Install cudam package

pip install --user cudam
  • Create a soft link of the executable file
ln -s /home/{YOURUSER}/.local/bin/
ln -s /home/{YOURUSER}/.local/bin/
  • Run the task manager
# run interactively
python -n 2 -s 2 -i 60 -f 300
# run in background
nohup python -n 2 -s 2 -i 60 -f 300 &

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cudam-0.0.5.tar.gz (27.2 kB view hashes)

Uploaded source

Built Distribution

cudam-0.0.5-py3-none-any.whl (53.8 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page