A utility to monitor the jobs ressources in a HPC environment, espacially OAR
Project description
Colmet - Collecting metrics about jobs running in a distributed environnement
Introduction:
Colmet is a monitoring tool to collect metrics about jobs running in a distributed environnement, especially for gathering metrics on clusters and grids. It provides currently several backends :
- Input backends:
- taskstats: fetch task metrics from the linux kernel
- rapl: intel processors realtime consumption metrics
- perfhw: perf_event counters
- jobproc: get infos from /proc
- ipmipower: get power metrics from ipmi
- temperature: get temperatures from /sys/class/thermal
- infiniband: get infiniband/omnipath network metrics
- lustre: get lustre FS stats
- Output backends:
- elasticsearch: store the metrics on elasticsearch indexes
- hdf5: store the metrics on the filesystem
- stdout: display the metrics on the terminal
It uses zeromq to transport the metrics across the network.
It is currently bound to the OAR RJMS.
A Grafana sample dashboard is provided for the elasticsearch backend. Here are some snapshots:
Installation:
Requirements
-
a Linux kernel that supports
- Taskstats
- intel_rapl (for RAPL backend)
- perf_event (for perfhw backend)
- ipmi_devintf (for ipmi backend)
-
Python Version 2.7 or newer
- python-zmq 2.2.0 or newer
- python-tables 3.3.0 or newer
- python-pyinotify 0.9.3-2 or newer
- python-requests
-
For the Elasticsearch output backend (recommended for sites with > 50 nodes)
- An Elasticsearch server
- A Grafana server (for visu)
-
For the RAPL input backend:
- libpowercap, powercap-utils (https://github.com/powercap/powercap)
-
For the infiniband backend:
perfquery
command line tool
-
for the ipmipower backend:
ipmi-oem
command line tool (freeipmi) or other configurable command
Installation
You can install, upgrade, uninstall colmet with these commands::
$ pip install [--user] colmet
$ pip install [--user] --upgrade colmet
$ pip uninstall colmet
Or from git (last development version)::
$ pip install [--user] git+https://github.com/oar-team/colmet.git
Or if you already pulled the sources::
$ pip install [--user] path/to/sources
Usage:
for the nodes :
sudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556
for the collector :
# Simple local HDF5 file collect:
colmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
# Collector with an Elasticsearch backend:
colmet-collector -vvv \
--zeromq-bind-uri tcp://192.168.0.1:5556 \
--buffer-size 5000 \
--sample-period 3 \
--elastic-host http://192.168.0.2:9200 \
--elastic-index-prefix colmet_dahu_ 2>>/var/log/colmet_err.log >> /var/log/colmet.log
You will see the number of counters retrieved in the debug log.
For more information, please refer to the help of theses scripts (--help
)
Notes about backends
Some input backends may need external libraries that need to be previously compiled and installed:
# For the perfhw backend:
cd colmet/node/backends/lib_perf_hw/ && make && cp lib_perf_hw.so /usr/local/lib/
# For the rapl backend:
cd colmet/node/backends/lib_rapl/ && make && cp lib_rapl.so /usr/local/lib/
Here's acomplete colmet-node start-up process, with perfw, rapl and more backends:
export LIB_PERFHW_PATH=/usr/local/lib/lib_perf_hw.so
export LIB_RAPL_PATH=/applis/site/colmet/lib_rapl.so
colmet-node -vvv --zeromq-uri tcp://192.168.0.1:5556 \
--cpuset_rootpath /dev/cpuset/oar \
--enable-infiniband --omnipath \
--enable-lustre \
--enable-perfhw --perfhw-list instructions cache_misses page_faults cpu_cycles cache_references \
--enable-RAPL \
--enable-jobproc \
--enable-ipmipower >> /var/log/colmet.log 2>&1
RAPL - Running Average Power Limit (Intel)
RAPL is a feature on recent Intel processors that makes possible to know the power consumption of cpu in realtime.
Usage : start colmet-node with option --enable-RAPL
A file named RAPL_mapping.[timestamp].csv is created in the working directory. It established the correspondence between counter_1
, counter_2
, etc from collected data and the actual name of the metric as well as the package and zone (core / uncore / dram) of the processor the metric refers to.
If a given counter is not supported by harware the metric name will be "counter_not_supported_by_hardware
" and 0
values will appear in the collected data; -1
values in the collected data means there is no counter mapped to the column.
Perfhw
This provides metrics collected using interface perf_event_open.
Usage : start colmet-node with option --enable-perfhw
Optionnaly choose the metrics you want (max 5 metrics) using options --perfhw-list
followed by space-separated list of the metrics/
Example : --enable-perfhw --perfhw-list instructions cpu_cycles cache_misses
A file named perfhw_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between counter_1
, counter_2
, etc from collected data and the actual name of the metric.
Available metrics (refers to perf_event_open documentation for signification) :
cpu_cycles
instructions
cache_references
cache_misses
branch_instructions
branch_misses
bus_cycles
ref_cpu_cycles
cache_l1d
cache_ll
cache_dtlb
cache_itlb
cache_bpu
cache_node
cache_op_read
cache_op_prefetch
cache_result_access
cpu_clock
task_clock
page_faults
context_switches
cpu_migrations
page_faults_min
page_faults_maj
alignment_faults
emulation_faults
dummy
bpf_output
Temperature
This backend gets temperatures from /sys/class/thermal/thermal_zone*/temp
Usage : start colmet-node with option --enable-temperature
A file named temperature_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between counter_1
, counter_2
, etc from collected data and the actual name of the metric.
Colmet CHANGELOG
Version 0.6.9
- Fix for newer pyzmq versions
Version 0.6.8
- Added nvidia GPU support
Version 0.6.7
- bugfix: glob import missing into procstats
Version 0.6.6
- Added --no-check-certificates option for elastic backend
- Added involved jobs and new metrics into jobprocstats
Version 0.6.4
- Added http auth support for elasticsearch backend
Version 0.6.3
Released on September 4th 2020
- Bugfixes into lustrestats and jobprocstats backend
Version 0.6.2
Released on September 3rd 2020
- Python package fix
Version 0.6.1
Released on September 3rd 2020
- New input backends: lustre, infiniband, temperature, rapl, perfhw, impipower, jobproc
- New ouptut backend: elasticsearch
- Example Grafana dashboard for Elasticsearch backend
- Added "involved_jobs" value for metrics that are global to a node (job 0)
- Bugfix for "dictionnary changed size during iteration"
Version 0.5.4
Released on January 19th 2018
- hdf5 extractor script for OAR RESTFUL API
- Added infiniband backend
- Added lustre backend
- Fixed cpuset_rootpath default always appended
Version 0.5.3
Released on April 29th 2015
- Removed unnecessary lock from the collector to avoid colmet to wait forever
- Removed (async) zmq eventloop and added
--sample-period
to the collector. - Fixed some bugs about hdf file
Version 0.5.2
Released on Apr 2nd 2015
- Fixed python syntax error
Version 0.5.1
Released on Apr 2nd 2015
- Fixed error about missing
requirements.txt
file in the sdist package
Version 0.5.0
Released on Apr 2nd 2015
- Don't run colmet as a daemon anymore
- Maintained compatibility with zmq 3.x/4.x
- Dropped
--zeromq-swap
(swap was dropped from zmq 3.x) - Handled zmq name change from HWM to SNDHWM and RCVHWM
- Fixed requirements
- Dropped python 2.6 support
Version 0.4.0
- Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption
- Handled HUP signal to reload
colmet-collector
- Removed
hiwater_rss
andhiwater_vm
collected metrics.
Version 0.3.1
- New metrics
hiwater_rss
andhiwater_vm
for taskstats - Worked with pyinotify 0.8
- Added
--disable-procstats
option to disable procstats backend.
Version 0.3.0
-
Divided colmet package into three parts
- colmet-node : Retrieve data from taskstats and procstats and send to collectors with ZeroMQ
- colmet-collector : A collector that stores data received by ZeroMQ in a hdf5 file
- colmet-common : Common colmet part.
-
Added some parameters of ZeroMQ backend to prevent a memory overflow
-
Simplified the command line interface
-
Dropped rrd backend because it is not yet working
-
Added
--buffer-size
option for collector to define the maximum number of counters that colmet should queue in memory before pushing it to output backend -
Handled SIGTERM and SIGINT to terminate colmet properly
Version 0.2.0
- Added options to enable hdf5 compression
- Support for multiple job by cgroup path scanning
- Used Inotify events for job list update
- Don't filter packets if no job_id range was specified, especially with zeromq backend
- Waited the cgroup_path folder creation before scanning the list of jobs
- Added procstat for node monitoring through fictive job with 0 as identifier
- Used absolute time take measure and not delay between measure, to avoid the drift of measure time
- Added workaround when a newly cgroup is created without process in it (monitoring is suspended upto one process is launched)
Version 0.0.1
- Conception
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.