Cuda Mangement - multi-process, scheduled jobs, distributed processing
Project description
cudam
Cuda Mangement - multi-process, scheduled jobs, distributed processing
command to check all cuda server status
date >> cuda_status.txt && echo 'cuda1' >> cuda_status.txt && ssh cuda1 'nvidia-smi' >> cuda_status.txt && echo 'cuda2' >> cuda_status.txt && ssh cuda2 'nvidia-smi' >> cuda_status.txt && echo 'cuda3' >> cuda_status.txt && ssh cuda3 'nvidia-smi' >> cuda_status.txt && echo 'cuda4' >> cuda_status.txt && ssh cuda4 'nvidia-smi' >> cuda_status.txt && echo 'cuda5' >> cuda_status.txt && ssh cuda5 'nvidia-smi' >> cuda_status.txt && echo 'cuda6' >> cuda_status.txt && ssh cuda6 'nvidia-smi' >> cuda_status.txt && echo 'cuda11' >> cuda_status.txt && ssh cuda11 'nvidia-smi' >> cuda_status.txt
server-client mode to utilize multi-GPUs across Multi-Machines
server side - develop the code that runs on a single GPU
# here is a dumb function to evaluate densenet
# it should be replaced by the actual code of evaluation
def evaluate_densenet(model):
acc = 0.99
return acc
client size - develop the code to send the models to server for evaluation
- Add available GPU servers in the server list configuration file
# configuration of server list
cuda4,8000
cuda4,8001
cuda5,8000
cuda5,8001
cuda5,8002
- The client code that concurrently evaluates models
from cudam.cudam_socket.client import GPUClientPool
DEFAULT_RUN_CODE_WORK_DIRECTORY = "/home/www/server" # the folder where the server side code resides
DEFAULT_RUN_CODE_PATH = "server_file" # the file name of the server side code
SERVER_LIST_CONFIG = 'config/server_list.txt' # the configuration file of the server list
def pool_evaluate_densenet(model_list):
# generat the arguments which will passed to client pool
arr_args = []
for m in model_list:
singe_args = {'model': m}
arr_args.append({
'path': DEFAULT_RUN_CODE_PATH,
'entry': "evaluate_densenet",
'work_directory': DEFAULT_RUN_CODE_WORK_DIRECTORY,
'args': singe_args,
'use_cuda': True # whether to use GPU or not
})
# init client pool
server_list = GPUClientPool.load_server_list_from_file(SERVER_LIST_CONFIG)
pool = GPUClientPool(server_list)
# perform evaluation
eval_result = pool.run_code_batch(arr_args)
return eval_result
# main entrance
if __name__ == '__main__':
model_list =[] # dumb model list which needs to be replaced by real models
pool_evaluate_densenet(model_list)
start the server
- After installation of this package,
cudam_server.py
should be automatically copied to the bin path; if not, please manually copy this file to the root folder of the project. The server can be started by running the following command:
nohup python cudam_server.py -s 1 -i cuda1 -p 8000 -g 0 >& log/nohup_cuda_1_8000_0.log &
run the client side python code to evaluate a batch of models
task manager
task template
#!/usr/bin/env bash
while getopts g: option;do
case "${option}" in
g) GPU_ID=${OPTARG};;
esac
done
print_help(){
printf "Parameter g(GPU ID) is mandatory\n"
printf "g values - GPU ID"
exit 1
}
if [ -z "${GPU_ID}" ];then
print_help
fi
echo "start task on GPU: $GPU_ID"
# the root directory of your python script
cd ~/code/psocnn/
# the main python script accepting the gpu ID in -g argument
python3 main.py -g ${GPU_ID}
task folder structure
task manager
# start task manager
nohup cudam_task_manager.py -n 2 -s 2 -i 60 -f 300 &
# snap gpu
cudam_snap_gpu.py -s 2 -l 60 -g 1
install cumdam for a specific user and can not add the local path into executable PATH
-
Switch to the root folder of your project
-
Install cudam package
pip install --user cudam
- Create a soft link of the executable file
ln -s /home/{YOURUSER}/.local/bin/cudam_task_manager.py cudam_task_manager.py
ln -s /home/{YOURUSER}/.local/bin/cudam_snap_gpu.py cudam_snap_gpu.py
- Run the task manager
# run interactively
python cudam_task_manager.py -n 2 -s 2 -i 60 -f 300
# run in background
nohup python cudam_task_manager.py -n 2 -s 2 -i 60 -f 300 &
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cudam-0.0.6.tar.gz
(27.2 kB
view hashes)
Built Distribution
cudam-0.0.6-py3-none-any.whl
(53.8 kB
view hashes)