Monitor system load of the server running the nvidia/cuda docker containers.
Project description
aidockermon
Monitor system load of the server running the nvidia/cuda docker containers.
Feature
- sysinfo: system static info
- sysload: system cpu/memory load
- gpu: nvidia gpu load
- disk: disk load
- containers: containers' load that based on the nvidia/cuda image
Prerequisite
Python >= 3
Installation
pip install aidockermon
Or use setuptools
python setup.py install
Usage
$ aidockermon -h
usage: aidockermon [-h] [-v] {query,create-esindex,delete-esindex} ...
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
command:
{query,create-esindex,delete-esindex}
query Query system info, log them via syslog protocol
create-esindex Create elasticsearch index
delete-esindex Delete elasticsearch index
$ aidockermon query -h
usage: aidockermon query [-h] [-l] [-r REPEAT] [-f FILTERS [FILTERS ...]] type
positional arguments:
type info type: sysinfo, sysload, gpu, disk, containers
optional arguments:
-h, --help show this help message and exit
-l, --stdout Print pretty json to console instead of send a log
-r REPEAT, --repeat REPEAT
n/i repeat n times every i seconds
-f FILTERS [FILTERS ...], --filters FILTERS [FILTERS ...]
Filter the disk paths for disk type; filter the
container names for containers type
For example:
Show sysinfo
$ aidockermon query -l sysinfo { "gpu": { "gpu_num": 2, "driver_version": "410.104", "cuda_version": "10.0" }, "mem_tot": 67405533184, "kernel": "4.4.0-142-generic", "cpu_num": 12, "docker": { "version": "18.09.3" }, "system": "Linux" }
Show sys load
$ aidockermon query -l sysload { "mem_free": 11866185728, "mem_used": 8023793664, "cpu_perc": 57.1, "mem_perc": 12.8, "mem_avail": 58803163136, "mem_tot": 67405533184 }
Show gpu load
$ aidockermon query -l gpu { "mem_tot": 11177, "gpu_temperature": 76.0, "mem_free": 1047, "mem_used": 10130, "gpu_perc": 98.0, "gpu_id": 0, "mem_perc": 46.0 } { "mem_tot": 11178, "gpu_temperature": 66.0, "mem_free": 3737, "mem_used": 7441, "gpu_perc": 95.0, "gpu_id": 1, "mem_perc": 44.0 }
Show disk usage
$ aidockermon query disk -l -f / { "path": "/", "device": "/dev/nvme0n1p3", "total": 250702176256, "used": 21078355968, "free": 216865271808, "percent": 8.9 } $ aidockermon query disk -l -f / /disk { "path": "/", "device": "/dev/nvme0n1p3", "total": 250702176256, "used": 21078355968, "free": 216865271808, "percent": 8.9 } { "path": "/disk", "device": "/dev/sda1", "total": 1968874311680, "used": 1551374692352, "free": 317462949888, "percent": 83.0 }
Show containers' load
Note that the app_name
would be read from environment variable APP_NAME
, which
is a short description for this training program.
$ aidockermon query containers -l -f DianAI { "proc_name": "python3 test_run.py", "app_name": "测试程序", "pid": 13540, "container": "DianAI", "started_time": 1554698236, "running_time": 9343, "mem_used": 9757 } { "proc_name": "python train.py", "app_name": "", "pid": 15721, "container": "DianAI", "started_time": 1554698236, "running_time": 19343, "mem_used": 1497 } { "mem_limit": 67481047040, "net_output": 47863240948, "block_read": 1327175626752, "net_input": 18802869033, "mem_perc": 14.637655604461704, "block_write": 132278439936, "name": "DianAI", "cpu_perc": 0.0, "mem_used": 9877643264 }
Config
logging
debug: false log: version: 1 # This is the default level, which could be ignored. # CRITICAL = 50 # FATAL = CRITICAL # ERROR = 40 # WARNING = 30 # WARN = WARNING # INFO = 20 # DEBUG = 10 # NOTSET = 0 #level: 20 disable_existing_loggers: false formatters: simple: format: '%(levelname)s %(message)s' monitor: format: '%(message)s' filters: require_debug_true: (): 'aidockermon.handlers.RequireDebugTrue' handlers: console: level: DEBUG class: logging.StreamHandler formatter: simple filters: [require_debug_true] monitor: level: INFO class: rfc5424logging.handler.Rfc5424SysLogHandler address: [127.0.0.1, 1514] enterprise_id: 1 loggers: runtime: handlers: [console] level: DEBUG propagate: false monitor: handlers: [monitor, console] level: INFO propagate: false
This is the default config, which should be located at /etc/aidockermon/config.yml
.
You can modify the address
value to specify the logging target.
address: [127.0.0.1, 1514]
: UDP to 127.0.0.1:1514address: /var/log/aidockermon
: unix domain datagram socket
If you add an socktype
argument, you can specify whether to use UDP or TCP as transport protocol.
socktype: 1
: TCPsocktype: 2
: UDP
Enable TLS/SSL:
tls_enable: true tls_verify: true tls_ca_bundle: /path/to/ca-bundle.pem
Set debug
as true
, you can see message output in the console.
Cronjob
sudo cp etc/cron.d/aidockermon /etc/cron.d sudo systemctl restart cron
syslog-ng
Using syslog-ng to collect logs and send them to elasticsearch for future use such as visualization with kibana.
cp etc/syslog-ng/syslog-ng.conf /etc/syslog-ng/ sudo systemctl restart syslog-ng
Sample config:
@version: 3.20 destination d_elastic { elasticsearch2( index("syslog-ng") type("${.SDATA.meta.type}") flush-limit("0") cluster("es-syslog-ng") cluster-url("http://localhost:9200") client-mode("http") client-lib-dir(/usr/share/elasticsearch/lib) template("${MESSAGE}\n") ); }; source s_python { #unix-dgram("/var/log/aidockermon"); syslog(ip(127.0.0.1) port(1514) transport("udp") flags(no-parse)); }; log { source (s_python); parser { syslog-parser(flags(syslog-protocol)); }; destination (d_elastic); };
Modify it to specify the elasticsearch server and the log source's port and protocol.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for aidockermon-0.10.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d04ef92a1b808d2aebb812b30a091647b21c51aa18004a580fe4326c21b28850 |
|
MD5 | 2c38e5b5979cfbdc48f08aff20550587 |
|
BLAKE2-256 | a2414559f69c81d5de7ff4d689a73d16271be7c8098eb0c872d2d12521a767ad |