A utility to monitor the jobs ressources in a HPC environment, espacially OAR
Project description
Colmet - Collecting metrics about jobs running in a distributed environnement
Introduction:
Colmet is a monitoring tool to collect metrics about jobs running in a distributed environnement, especially for gathering metrics on clusters and grids. It provides currently several backends :
- Input backends:
- taskstats: fetch task metrics from the linux kernel
- rapl: intel processors realtime consumption metrics
- perfhw: perf_event counters
- jobproc: get infos from /proc
- ipmipower: get power metrics from ipmi
- temperature: get temperatures from /sys/class/thermal
- infiniband: get infiniband/omnipath network metrics
- lustre: get lustre FS stats
- Output backends:
- elasticsearch: store the metrics on elasticsearch indexes
- hdf5: store the metrics on the filesystem
- stdout: display the metrics on the terminal
It uses zeromq to transport the metrics across the network.
It is currently bound to the OAR RJMS.
A Grafana sample dashboard is provided for the elasticsearch backend. Here are some snapshots:
Installation:
Requirements
-
a Linux kernel that supports
- Taskstats
- intel_rapl (for RAPL backend)
- perf_event (for perfhw backend)
- ipmi_devintf (for ipmi backend)
-
Python Version 2.7 or newer
- python-zmq 2.2.0 or newer
- python-tables 3.3.0 or newer
- python-pyinotify 0.9.3-2 or newer
- python-requests
-
For the Elasticsearch output backend (recommended for sites with > 50 nodes)
- An Elasticsearch server
- A Grafana server (for visu)
-
For the RAPL input backend:
- libpowercap, powercap-utils (https://github.com/powercap/powercap)
-
For the infiniband backend:
perfquery
command line tool
-
for the ipmipower backend:
ipmi-oem
command line tool (freeipmi) or other configurable command
Installation
You can install, upgrade, uninstall colmet with these commands::
$ pip install [--user] colmet
$ pip install [--user] --upgrade colmet
$ pip uninstall colmet
Or from git (last development version)::
$ pip install [--user] git+https://github.com/oar-team/colmet.git
Or if you already pulled the sources::
$ pip install [--user] path/to/sources
Usage:
for the nodes :
sudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556
for the collector :
# Simple local HDF5 file collect:
colmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
# Collector with an Elasticsearch backend:
colmet-collector -vvv \
--zeromq-bind-uri tcp://192.168.0.1:5556 \
--buffer-size 5000 \
--sample-period 3 \
--elastic-host http://192.168.0.2:9200 \
--elastic-index-prefix colmet_dahu_ 2>>/var/log/colmet_err.log >> /var/log/colmet.log
You will see the number of counters retrieved in the debug log.
For more information, please refer to the help of theses scripts (--help
)
Notes about backends
Some input backends may need external libraries that need to be previously compiled and installed:
# For the perfhw backend:
cd colmet/node/backends/lib_perf_hw/ && make && cp lib_perf_hw.so /usr/local/lib/
# For the rapl backend:
cd colmet/node/backends/lib_rapl/ && make && cp lib_rapl.so /usr/local/lib/
Here's acomplete colmet-node start-up process, with perfw, rapl and more backends:
export LIB_PERFHW_PATH=/usr/local/lib/lib_perf_hw.so
export LIB_RAPL_PATH=/applis/site/colmet/lib_rapl.so
colmet-node -vvv --zeromq-uri tcp://192.168.0.1:5556 \
--cpuset_rootpath /dev/cpuset/oar \
--enable-infiniband --omnipath \
--enable-lustre \
--enable-perfhw --perfhw-list instructions cache_misses page_faults cpu_cycles cache_references \
--enable-RAPL \
--enable-jobproc \
--enable-ipmipower >> /var/log/colmet.log 2>&1
RAPL - Running Average Power Limit (Intel)
RAPL is a feature on recent Intel processors that makes possible to know the power consumption of cpu in realtime.
Usage : start colmet-node with option --enable-RAPL
A file named RAPL_mapping.[timestamp].csv is created in the working directory. It established the correspondence between counter_1
, counter_2
, etc from collected data and the actual name of the metric as well as the package and zone (core / uncore / dram) of the processor the metric refers to.
If a given counter is not supported by harware the metric name will be "counter_not_supported_by_hardware
" and 0
values will appear in the collected data; -1
values in the collected data means there is no counter mapped to the column.
Perfhw
This provides metrics collected using interface perf_event_open.
Usage : start colmet-node with option --enable-perfhw
Optionnaly choose the metrics you want (max 5 metrics) using options --perfhw-list
followed by space-separated list of the metrics/
Example : --enable-perfhw --perfhw-list instructions cpu_cycles cache_misses
A file named perfhw_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between counter_1
, counter_2
, etc from collected data and the actual name of the metric.
Available metrics (refers to perf_event_open documentation for signification) :
cpu_cycles
instructions
cache_references
cache_misses
branch_instructions
branch_misses
bus_cycles
ref_cpu_cycles
cache_l1d
cache_ll
cache_dtlb
cache_itlb
cache_bpu
cache_node
cache_op_read
cache_op_prefetch
cache_result_access
cpu_clock
task_clock
page_faults
context_switches
cpu_migrations
page_faults_min
page_faults_maj
alignment_faults
emulation_faults
dummy
bpf_output
Temperature
This backend gets temperatures from /sys/class/thermal/thermal_zone*/temp
Usage : start colmet-node with option --enable-temperature
A file named temperature_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between counter_1
, counter_2
, etc from collected data and the actual name of the metric.
Colmet CHANGELOG
Version 0.6.11
Unreleased
Version 0.6.10
- Fixed missing exceptions handling into elasticsearch backend (collector)
- ZMQ: Prefer SNDHWM and RCVHWM to HWM
- Fixed: taskstats data could block other data to be collected when cpuset is empty (node)
Version 0.6.9
- Fix for newer pyzmq versions
Version 0.6.8
- Added nvidia GPU support
Version 0.6.7
- bugfix: glob import missing into procstats
Version 0.6.6
- Added --no-check-certificates option for elastic backend
- Added involved jobs and new metrics into jobprocstats
Version 0.6.4
- Added http auth support for elasticsearch backend
Version 0.6.3
Released on September 4th 2020
- Bugfixes into lustrestats and jobprocstats backend
Version 0.6.2
Released on September 3rd 2020
- Python package fix
Version 0.6.1
Released on September 3rd 2020
- New input backends: lustre, infiniband, temperature, rapl, perfhw, impipower, jobproc
- New ouptut backend: elasticsearch
- Example Grafana dashboard for Elasticsearch backend
- Added "involved_jobs" value for metrics that are global to a node (job 0)
- Bugfix for "dictionnary changed size during iteration"
Version 0.5.4
Released on January 19th 2018
- hdf5 extractor script for OAR RESTFUL API
- Added infiniband backend
- Added lustre backend
- Fixed cpuset_rootpath default always appended
Version 0.5.3
Released on April 29th 2015
- Removed unnecessary lock from the collector to avoid colmet to wait forever
- Removed (async) zmq eventloop and added
--sample-period
to the collector. - Fixed some bugs about hdf file
Version 0.5.2
Released on Apr 2nd 2015
- Fixed python syntax error
Version 0.5.1
Released on Apr 2nd 2015
- Fixed error about missing
requirements.txt
file in the sdist package
Version 0.5.0
Released on Apr 2nd 2015
- Don't run colmet as a daemon anymore
- Maintained compatibility with zmq 3.x/4.x
- Dropped
--zeromq-swap
(swap was dropped from zmq 3.x) - Handled zmq name change from HWM to SNDHWM and RCVHWM
- Fixed requirements
- Dropped python 2.6 support
Version 0.4.0
- Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption
- Handled HUP signal to reload
colmet-collector
- Removed
hiwater_rss
andhiwater_vm
collected metrics.
Version 0.3.1
- New metrics
hiwater_rss
andhiwater_vm
for taskstats - Worked with pyinotify 0.8
- Added
--disable-procstats
option to disable procstats backend.
Version 0.3.0
-
Divided colmet package into three parts
- colmet-node : Retrieve data from taskstats and procstats and send to collectors with ZeroMQ
- colmet-collector : A collector that stores data received by ZeroMQ in a hdf5 file
- colmet-common : Common colmet part.
-
Added some parameters of ZeroMQ backend to prevent a memory overflow
-
Simplified the command line interface
-
Dropped rrd backend because it is not yet working
-
Added
--buffer-size
option for collector to define the maximum number of counters that colmet should queue in memory before pushing it to output backend -
Handled SIGTERM and SIGINT to terminate colmet properly
Version 0.2.0
- Added options to enable hdf5 compression
- Support for multiple job by cgroup path scanning
- Used Inotify events for job list update
- Don't filter packets if no job_id range was specified, especially with zeromq backend
- Waited the cgroup_path folder creation before scanning the list of jobs
- Added procstat for node monitoring through fictive job with 0 as identifier
- Used absolute time take measure and not delay between measure, to avoid the drift of measure time
- Added workaround when a newly cgroup is created without process in it (monitoring is suspended upto one process is launched)
Version 0.0.1
- Conception
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file colmet-0.6.10.tar.gz
.
File metadata
- Download URL: colmet-0.6.10.tar.gz
- Upload date:
- Size: 61.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8899983a8f063e380196b809b08ff8af69f7cc4bcab9db54f0a8345fa5957f74 |
|
MD5 | 600367f6b10f1fbfd648f28dee55b659 |
|
BLAKE2b-256 | 341f64a5418ea27ab0557336ea130b9655e857b8425186176e24a0e519bc2fb1 |
File details
Details for the file colmet-0.6.10-py3-none-any.whl
.
File metadata
- Download URL: colmet-0.6.10-py3-none-any.whl
- Upload date:
- Size: 73.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | df93a5d602491694a16c2a592dc28c4c07b9fb103efd3207c0958f0fb13dd882 |
|
MD5 | f44a3df25d749997bd2e99f4381b1fa0 |
|
BLAKE2b-256 | 109d039c099fc5f507c6a0441a7b751c749e1463b041143f887e88e015b39328 |