A utility to monitor the jobs ressources in a HPC environment, espacially OAR
Project description
Colmet - Collecting metrics about jobs running in a distributed environnement
Introduction:
-------------
Colmet is a monitoring tool to collect metrics about jobs running in a
distributed environnement, especially for gathering metrics on clusters and
grids. It provides currently several backends :
- taskstats: fetch task metrics from the linux kernel
- stdout: display the metrics on the terminal
- zeromq: transport the metrics across the network
- hdf5: store the metrics on the filesystem
Installation:
-------------
Requirements
~~~~~~~~~~~~
* a Linux kernel that supports
- Taskstats
* Python Version 2.7 or newer
- python-zmq 2.2.0 or newer
- python-tables 2.3.1 or newer
- python-pyinotify 0.9.3-2 or newer
Installation
~~~~~~~~~~~~
You can install, upgrade, uninstall colmet with these commands::
$ pip install [--user] colmet
$ pip install [--user] --upgrade colmet
$ pip uninstall colmet
Or from git (last development version)::
$ pip install [--user] git+https://github.com/oar-team/colmet.git
Or if you already pulled the sources::
$ pip install [--user] path/to/sources
Or if you don't have pip::
$ easy_install colmet
Usage:
------
for the nodes :
sudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556
for the collector :
colmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
You will see the number of counters retrieved in the debug log.
For more information, please refer to the help of theses scripts (--help)
Licensing:
----------
This product is distributed under the GNU General Public License Version2.
Please read through the file LICENSE for more information about our license.
Colmet CHANGELOG
================
version 0.5.2
-------------
Released on Apr 2nd 2015
- Fixed python syntax error
version 0.5.1
-------------
Released on Apr 2nd 2015
- Fixed error about missing ``requirements.txt`` file in the sdist package
version 0.5.0
-------------
Released on Apr 2nd 2015
- Don't run colmet as a daemon anymore
- Maintained compatibility with zmq 3.x/4.x
- Dropped ``--zeromq-swap`` (swap was dropped from zmq 3.x)
- Handled zmq name change from HWM to SNDHWM and RCVHWM
- Fixed requirements
- Dropped python 2.6 support
version 0.4.0
-------------
- Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption
- Handled HUP signal to reload ``colmet-collector``
- Removed ``hiwater_rss`` and ``hiwater_vm`` collected metrics.
version 0.3.1
-------------
- New metrics ``hiwater_rss`` and ``hiwater_vm`` for taskstats
- Worked with pyinotify 0.8
- Added ``--disable-procstats`` option to disable procstats backend.
version 0.3.0
-------------
- Divided colmet package into three parts
- colmet-node : Retrieve data from taskstats and procstats and send to
collectors with ZeroMQ
- colmet-collector : A collector that stores data received by ZeroMQ in a
hdf5 file
- colmet-common : Common colmet part.
- Added some parameters of ZeroMQ backend to prevent a memory overflow
- Simplified the command line interface
- Dropped rrd backend because it is not yet working
- Added ``--buffer-size`` option for collector to define the maximum number of
counters that colmet should queue in memory before pushing it to output
backend
- Handled SIGTERM and SIGINT to terminate colmet properly
version 0.2.0
-------------
- Added options to enable hdf5 compression
- Support for multiple job by cgroup path scanning
- Used Inotify events for job list update
- Don't filter packets if no job_id range was specified, especially with zeromq
backend
- Waited the cgroup_path folder creation before scanning the list of jobs
- Added procstat for node monitoring through fictive job with 0 as identifier
- Used absolute time take measure and not delay between measure, to avoid the
drift of measure time
- Added workaround when a newly cgroup is created without process in it
(monitoring is suspended upto one process is launched)
version 0.0.1
-------------
- Conception
Introduction:
-------------
Colmet is a monitoring tool to collect metrics about jobs running in a
distributed environnement, especially for gathering metrics on clusters and
grids. It provides currently several backends :
- taskstats: fetch task metrics from the linux kernel
- stdout: display the metrics on the terminal
- zeromq: transport the metrics across the network
- hdf5: store the metrics on the filesystem
Installation:
-------------
Requirements
~~~~~~~~~~~~
* a Linux kernel that supports
- Taskstats
* Python Version 2.7 or newer
- python-zmq 2.2.0 or newer
- python-tables 2.3.1 or newer
- python-pyinotify 0.9.3-2 or newer
Installation
~~~~~~~~~~~~
You can install, upgrade, uninstall colmet with these commands::
$ pip install [--user] colmet
$ pip install [--user] --upgrade colmet
$ pip uninstall colmet
Or from git (last development version)::
$ pip install [--user] git+https://github.com/oar-team/colmet.git
Or if you already pulled the sources::
$ pip install [--user] path/to/sources
Or if you don't have pip::
$ easy_install colmet
Usage:
------
for the nodes :
sudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556
for the collector :
colmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
You will see the number of counters retrieved in the debug log.
For more information, please refer to the help of theses scripts (--help)
Licensing:
----------
This product is distributed under the GNU General Public License Version2.
Please read through the file LICENSE for more information about our license.
Colmet CHANGELOG
================
version 0.5.2
-------------
Released on Apr 2nd 2015
- Fixed python syntax error
version 0.5.1
-------------
Released on Apr 2nd 2015
- Fixed error about missing ``requirements.txt`` file in the sdist package
version 0.5.0
-------------
Released on Apr 2nd 2015
- Don't run colmet as a daemon anymore
- Maintained compatibility with zmq 3.x/4.x
- Dropped ``--zeromq-swap`` (swap was dropped from zmq 3.x)
- Handled zmq name change from HWM to SNDHWM and RCVHWM
- Fixed requirements
- Dropped python 2.6 support
version 0.4.0
-------------
- Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption
- Handled HUP signal to reload ``colmet-collector``
- Removed ``hiwater_rss`` and ``hiwater_vm`` collected metrics.
version 0.3.1
-------------
- New metrics ``hiwater_rss`` and ``hiwater_vm`` for taskstats
- Worked with pyinotify 0.8
- Added ``--disable-procstats`` option to disable procstats backend.
version 0.3.0
-------------
- Divided colmet package into three parts
- colmet-node : Retrieve data from taskstats and procstats and send to
collectors with ZeroMQ
- colmet-collector : A collector that stores data received by ZeroMQ in a
hdf5 file
- colmet-common : Common colmet part.
- Added some parameters of ZeroMQ backend to prevent a memory overflow
- Simplified the command line interface
- Dropped rrd backend because it is not yet working
- Added ``--buffer-size`` option for collector to define the maximum number of
counters that colmet should queue in memory before pushing it to output
backend
- Handled SIGTERM and SIGINT to terminate colmet properly
version 0.2.0
-------------
- Added options to enable hdf5 compression
- Support for multiple job by cgroup path scanning
- Used Inotify events for job list update
- Don't filter packets if no job_id range was specified, especially with zeromq
backend
- Waited the cgroup_path folder creation before scanning the list of jobs
- Added procstat for node monitoring through fictive job with 0 as identifier
- Used absolute time take measure and not delay between measure, to avoid the
drift of measure time
- Added workaround when a newly cgroup is created without process in it
(monitoring is suspended upto one process is launched)
version 0.0.1
-------------
- Conception
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
colmet-0.5.2.tar.gz
(30.1 kB
view hashes)