Skip to main content

A utility to monitor the jobs ressources in a HPC environment, espacially OAR

Project description

Colmet - Collecting metrics about jobs running in a distributed environnement

Introduction:

Colmet is a monitoring tool to collect metrics about jobs running in a distributed environnement, especially for gathering metrics on clusters and grids. It provides currently several backends :

  • Input backends:
    • taskstats: fetch task metrics from the linux kernel
    • rapl: intel processors realtime consumption metrics
    • perfhw: perf_event counters
    • jobproc: get infos from /proc
    • ipmipower: get power metrics from ipmi
    • temperature: get temperatures from /sys/class/thermal
    • infiniband: get infiniband/omnipath network metrics
    • lustre: get lustre FS stats
  • Output backends:
    • elasticsearch: store the metrics on elasticsearch indexes
    • hdf5: store the metrics on the filesystem
    • stdout: display the metrics on the terminal

It uses zeromq to transport the metrics across the network.

It is currently bound to the OAR RJMS.

A Grafana sample dashboard is provided for the elasticsearch backend. Here are some snapshots:

Installation:

Requirements

  • a Linux kernel that supports

    • Taskstats
    • intel_rapl (for RAPL backend)
    • perf_event (for perfhw backend)
    • ipmi_devintf (for ipmi backend)
  • Python Version 2.7 or newer

    • python-zmq 2.2.0 or newer
    • python-tables 3.3.0 or newer
    • python-pyinotify 0.9.3-2 or newer
    • python-requests
  • For the Elasticsearch output backend (recommended for sites with > 50 nodes)

    • An Elasticsearch server
    • A Grafana server (for visu)
  • For the RAPL input backend:

  • For the infiniband backend:

    • perfquery command line tool
  • for the ipmipower backend:

    • ipmi-oem command line tool (freeipmi) or other configurable command

Installation

You can install, upgrade, uninstall colmet with these commands::

$ pip install [--user] colmet
$ pip install [--user] --upgrade colmet
$ pip uninstall colmet

Or from git (last development version)::

$ pip install [--user] git+https://github.com/oar-team/colmet.git

Or if you already pulled the sources::

$ pip install [--user] path/to/sources

Usage:

for the nodes :

sudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556

for the collector :

# Simple local HDF5 file collect:
colmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
# Collector with an Elasticsearch backend:
  colmet-collector -vvv \
    --zeromq-bind-uri tcp://192.168.0.1:5556 \
    --buffer-size 5000 \
    --sample-period 3 \
    --elastic-host http://192.168.0.2:9200 \
    --elastic-index-prefix colmet_dahu_ 2>>/var/log/colmet_err.log >> /var/log/colmet.log

You will see the number of counters retrieved in the debug log.

For more information, please refer to the help of theses scripts (--help)

Notes about backends

Some input backends may need external libraries that need to be previously compiled and installed:

# For the perfhw backend:
cd colmet/node/backends/lib_perf_hw/ && make && cp lib_perf_hw.so /usr/local/lib/
# For the rapl backend:
cd colmet/node/backends/lib_rapl/ && make && cp lib_rapl.so /usr/local/lib/

Here's acomplete colmet-node start-up process, with perfw, rapl and more backends:

export LIB_PERFHW_PATH=/usr/local/lib/lib_perf_hw.so
export LIB_RAPL_PATH=/applis/site/colmet/lib_rapl.so

colmet-node -vvv --zeromq-uri tcp://192.168.0.1:5556 \
   --cpuset_rootpath /dev/cpuset/oar \
   --enable-infiniband --omnipath \
   --enable-lustre \
   --enable-perfhw --perfhw-list instructions cache_misses page_faults cpu_cycles cache_references \
   --enable-RAPL \
   --enable-jobproc \
   --enable-ipmipower >> /var/log/colmet.log 2>&1

RAPL - Running Average Power Limit (Intel)

RAPL is a feature on recent Intel processors that makes possible to know the power consumption of cpu in realtime.

Usage : start colmet-node with option --enable-RAPL

A file named RAPL_mapping.[timestamp].csv is created in the working directory. It established the correspondence between counter_1, counter_2, etc from collected data and the actual name of the metric as well as the package and zone (core / uncore / dram) of the processor the metric refers to.

If a given counter is not supported by harware the metric name will be "counter_not_supported_by_hardware" and 0 values will appear in the collected data; -1 values in the collected data means there is no counter mapped to the column.

Perfhw

This provides metrics collected using interface perf_event_open.

Usage : start colmet-node with option --enable-perfhw

Optionnaly choose the metrics you want (max 5 metrics) using options --perfhw-list followed by space-separated list of the metrics/

Example : --enable-perfhw --perfhw-list instructions cpu_cycles cache_misses

A file named perfhw_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between counter_1, counter_2, etc from collected data and the actual name of the metric.

Available metrics (refers to perf_event_open documentation for signification) :

cpu_cycles 
instructions 
cache_references 
cache_misses 
branch_instructions
branch_misses
bus_cycles 
ref_cpu_cycles 
cache_l1d 
cache_ll
cache_dtlb 
cache_itlb 
cache_bpu 
cache_node 
cache_op_read 
cache_op_prefetch 
cache_result_access 
cpu_clock 
task_clock 
page_faults 
context_switches 
cpu_migrations
page_faults_min
page_faults_maj
alignment_faults 
emulation_faults
dummy
bpf_output

Temperature

This backend gets temperatures from /sys/class/thermal/thermal_zone*/temp

Usage : start colmet-node with option --enable-temperature

A file named temperature_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between counter_1, counter_2, etc from collected data and the actual name of the metric.

Colmet CHANGELOG

Version 0.6.9

  • Fix for newer pyzmq versions

Version 0.6.8

  • Added nvidia GPU support

Version 0.6.7

  • bugfix: glob import missing into procstats

Version 0.6.6

  • Added --no-check-certificates option for elastic backend
  • Added involved jobs and new metrics into jobprocstats

Version 0.6.4

  • Added http auth support for elasticsearch backend

Version 0.6.3

Released on September 4th 2020

  • Bugfixes into lustrestats and jobprocstats backend

Version 0.6.2

Released on September 3rd 2020

  • Python package fix

Version 0.6.1

Released on September 3rd 2020

  • New input backends: lustre, infiniband, temperature, rapl, perfhw, impipower, jobproc
  • New ouptut backend: elasticsearch
  • Example Grafana dashboard for Elasticsearch backend
  • Added "involved_jobs" value for metrics that are global to a node (job 0)
  • Bugfix for "dictionnary changed size during iteration"

Version 0.5.4

Released on January 19th 2018

  • hdf5 extractor script for OAR RESTFUL API
  • Added infiniband backend
  • Added lustre backend
  • Fixed cpuset_rootpath default always appended

Version 0.5.3

Released on April 29th 2015

  • Removed unnecessary lock from the collector to avoid colmet to wait forever
  • Removed (async) zmq eventloop and added --sample-period to the collector.
  • Fixed some bugs about hdf file

Version 0.5.2

Released on Apr 2nd 2015

  • Fixed python syntax error

Version 0.5.1

Released on Apr 2nd 2015

  • Fixed error about missing requirements.txt file in the sdist package

Version 0.5.0

Released on Apr 2nd 2015

  • Don't run colmet as a daemon anymore
  • Maintained compatibility with zmq 3.x/4.x
  • Dropped --zeromq-swap (swap was dropped from zmq 3.x)
  • Handled zmq name change from HWM to SNDHWM and RCVHWM
  • Fixed requirements
  • Dropped python 2.6 support

Version 0.4.0

  • Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption
  • Handled HUP signal to reload colmet-collector
  • Removed hiwater_rss and hiwater_vm collected metrics.

Version 0.3.1

  • New metrics hiwater_rss and hiwater_vm for taskstats
  • Worked with pyinotify 0.8
  • Added --disable-procstats option to disable procstats backend.

Version 0.3.0

  • Divided colmet package into three parts

    • colmet-node : Retrieve data from taskstats and procstats and send to collectors with ZeroMQ
    • colmet-collector : A collector that stores data received by ZeroMQ in a hdf5 file
    • colmet-common : Common colmet part.
  • Added some parameters of ZeroMQ backend to prevent a memory overflow

  • Simplified the command line interface

  • Dropped rrd backend because it is not yet working

  • Added --buffer-size option for collector to define the maximum number of counters that colmet should queue in memory before pushing it to output backend

  • Handled SIGTERM and SIGINT to terminate colmet properly

Version 0.2.0

  • Added options to enable hdf5 compression
  • Support for multiple job by cgroup path scanning
  • Used Inotify events for job list update
  • Don't filter packets if no job_id range was specified, especially with zeromq backend
  • Waited the cgroup_path folder creation before scanning the list of jobs
  • Added procstat for node monitoring through fictive job with 0 as identifier
  • Used absolute time take measure and not delay between measure, to avoid the drift of measure time
  • Added workaround when a newly cgroup is created without process in it (monitoring is suspended upto one process is launched)

Version 0.0.1

  • Conception

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

colmet-0.6.9.tar.gz (61.2 kB view hashes)

Uploaded Source

Built Distribution

colmet-0.6.9-py3-none-any.whl (72.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page