Skip to main content

A utility to monitor the jobs ressources in a HPC environment, espacially OAR

Project description

Colmet - Collecting metrics about jobs running in a distributed environnement


Colmet is a monitoring tool to collect metrics about jobs running in a distributed environnement, especially for gathering metrics on clusters and grids. It provides currently several backends :

  • Input backends:
    • taskstats: fetch task metrics from the linux kernel
    • rapl: intel processors realtime consumption metrics
    • perfhw: perf_event counters
    • jobproc: get infos from /proc
    • ipmipower: get power metrics from ipmi
    • temperature: get temperatures from /sys/class/thermal
    • infiniband: get infiniband/omnipath network metrics
    • lustre: get lustre FS stats
  • Output backends:
    • elasticsearch: store the metrics on elasticsearch indexes
    • hdf5: store the metrics on the filesystem
    • stdout: display the metrics on the terminal

It uses zeromq to transport the metrics across the network.

It is currently bound to the OAR RJMS.

A Grafana sample dashboard is provided for the elasticsearch backend. Here are some snapshots:



  • a Linux kernel that supports

    • Taskstats
    • intel_rapl (for RAPL backend)
    • perf_event (for perfhw backend)
    • ipmi_devintf (for ipmi backend)
  • Python Version 2.7 or newer

    • python-zmq 2.2.0 or newer
    • python-tables 3.3.0 or newer
    • python-pyinotify 0.9.3-2 or newer
    • python-requests
  • For the Elasticsearch output backend (recommended for sites with > 50 nodes)

    • An Elasticsearch server
    • A Grafana server (for visu)
  • For the RAPL input backend:

  • For the infiniband backend:

    • perfquery command line tool
  • for the ipmipower backend:

    • ipmi-oem command line tool (freeipmi) or other configurable command


You can install, upgrade, uninstall colmet with these commands::

$ pip install [--user] colmet
$ pip install [--user] --upgrade colmet
$ pip uninstall colmet

Or from git (last development version)::

$ pip install [--user] git+

Or if you already pulled the sources::

$ pip install [--user] path/to/sources


for the nodes :

sudo colmet-node -vvv --zeromq-uri tcp://

for the collector :

# Simple local HDF5 file collect:
colmet-collector -vvv --zeromq-bind-uri tcp:// --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
# Collector with an Elasticsearch backend:
  colmet-collector -vvv \
    --zeromq-bind-uri tcp:// \
    --buffer-size 5000 \
    --sample-period 3 \
    --elastic-host \
    --elastic-index-prefix colmet_dahu_ 2>>/var/log/colmet_err.log >> /var/log/colmet.log

You will see the number of counters retrieved in the debug log.

For more information, please refer to the help of theses scripts (--help)

Notes about backends

Some input backends may need external libraries that need to be previously compiled and installed:

# For the perfhw backend:
cd colmet/node/backends/lib_perf_hw/ && make && cp /usr/local/lib/
# For the rapl backend:
cd colmet/node/backends/lib_rapl/ && make && cp /usr/local/lib/

Here's acomplete colmet-node start-up process, with perfw, rapl and more backends:

export LIB_PERFHW_PATH=/usr/local/lib/
export LIB_RAPL_PATH=/applis/site/colmet/

colmet-node -vvv --zeromq-uri tcp:// \
   --cpuset_rootpath /dev/cpuset/oar \
   --enable-infiniband --omnipath \
   --enable-lustre \
   --enable-perfhw --perfhw-list instructions cache_misses page_faults cpu_cycles cache_references \
   --enable-RAPL \
   --enable-jobproc \
   --enable-ipmipower >> /var/log/colmet.log 2>&1

RAPL - Running Average Power Limit (Intel)

RAPL is a feature on recent Intel processors that makes possible to know the power consumption of cpu in realtime.

Usage : start colmet-node with option --enable-RAPL

A file named RAPL_mapping.[timestamp].csv is created in the working directory. It established the correspondence between counter_1, counter_2, etc from collected data and the actual name of the metric as well as the package and zone (core / uncore / dram) of the processor the metric refers to.

If a given counter is not supported by harware the metric name will be "counter_not_supported_by_hardware" and 0 values will appear in the collected data; -1 values in the collected data means there is no counter mapped to the column.


This provides metrics collected using interface perf_event_open.

Usage : start colmet-node with option --enable-perfhw

Optionnaly choose the metrics you want (max 5 metrics) using options --perfhw-list followed by space-separated list of the metrics/

Example : --enable-perfhw --perfhw-list instructions cpu_cycles cache_misses

A file named perfhw_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between counter_1, counter_2, etc from collected data and the actual name of the metric.

Available metrics (refers to perf_event_open documentation for signification) :



This backend gets temperatures from /sys/class/thermal/thermal_zone*/temp

Usage : start colmet-node with option --enable-temperature

A file named temperature_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between counter_1, counter_2, etc from collected data and the actual name of the metric.


Version 0.6.9

  • Fix for newer pyzmq versions

Version 0.6.8

  • Added nvidia GPU support

Version 0.6.7

  • bugfix: glob import missing into procstats

Version 0.6.6

  • Added --no-check-certificates option for elastic backend
  • Added involved jobs and new metrics into jobprocstats

Version 0.6.4

  • Added http auth support for elasticsearch backend

Version 0.6.3

Released on September 4th 2020

  • Bugfixes into lustrestats and jobprocstats backend

Version 0.6.2

Released on September 3rd 2020

  • Python package fix

Version 0.6.1

Released on September 3rd 2020

  • New input backends: lustre, infiniband, temperature, rapl, perfhw, impipower, jobproc
  • New ouptut backend: elasticsearch
  • Example Grafana dashboard for Elasticsearch backend
  • Added "involved_jobs" value for metrics that are global to a node (job 0)
  • Bugfix for "dictionnary changed size during iteration"

Version 0.5.4

Released on January 19th 2018

  • hdf5 extractor script for OAR RESTFUL API
  • Added infiniband backend
  • Added lustre backend
  • Fixed cpuset_rootpath default always appended

Version 0.5.3

Released on April 29th 2015

  • Removed unnecessary lock from the collector to avoid colmet to wait forever
  • Removed (async) zmq eventloop and added --sample-period to the collector.
  • Fixed some bugs about hdf file

Version 0.5.2

Released on Apr 2nd 2015

  • Fixed python syntax error

Version 0.5.1

Released on Apr 2nd 2015

  • Fixed error about missing requirements.txt file in the sdist package

Version 0.5.0

Released on Apr 2nd 2015

  • Don't run colmet as a daemon anymore
  • Maintained compatibility with zmq 3.x/4.x
  • Dropped --zeromq-swap (swap was dropped from zmq 3.x)
  • Handled zmq name change from HWM to SNDHWM and RCVHWM
  • Fixed requirements
  • Dropped python 2.6 support

Version 0.4.0

  • Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption
  • Handled HUP signal to reload colmet-collector
  • Removed hiwater_rss and hiwater_vm collected metrics.

Version 0.3.1

  • New metrics hiwater_rss and hiwater_vm for taskstats
  • Worked with pyinotify 0.8
  • Added --disable-procstats option to disable procstats backend.

Version 0.3.0

  • Divided colmet package into three parts

    • colmet-node : Retrieve data from taskstats and procstats and send to collectors with ZeroMQ
    • colmet-collector : A collector that stores data received by ZeroMQ in a hdf5 file
    • colmet-common : Common colmet part.
  • Added some parameters of ZeroMQ backend to prevent a memory overflow

  • Simplified the command line interface

  • Dropped rrd backend because it is not yet working

  • Added --buffer-size option for collector to define the maximum number of counters that colmet should queue in memory before pushing it to output backend

  • Handled SIGTERM and SIGINT to terminate colmet properly

Version 0.2.0

  • Added options to enable hdf5 compression
  • Support for multiple job by cgroup path scanning
  • Used Inotify events for job list update
  • Don't filter packets if no job_id range was specified, especially with zeromq backend
  • Waited the cgroup_path folder creation before scanning the list of jobs
  • Added procstat for node monitoring through fictive job with 0 as identifier
  • Used absolute time take measure and not delay between measure, to avoid the drift of measure time
  • Added workaround when a newly cgroup is created without process in it (monitoring is suspended upto one process is launched)

Version 0.0.1

  • Conception

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

colmet-0.6.9.tar.gz (61.2 kB view hashes)

Uploaded source

Built Distribution

colmet-0.6.9-py3-none-any.whl (72.7 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page