Skip to main content

Measures the resource utilization of a specific process over time

Project description

Schniepel

Measures the resource utilization of a specific process over time.

This program also measures the utilization / saturation of system-wide resources making it straightforward to put the process-specific metrics into context.

Built for Linux. Windows and Mac OS support might come.

Highlights:

  • High sampling rate: by default, Schniepel uses a sampling interval of 0.5 seconds for making narrow spikes visible.
  • Schniepel is built for monitoring a program subject to process ID changes. This is useful for longevity experiments when the monitored process occasionaly restarts (for instance as of fail-over scenarios).
  • Schniepel can run unsupervised and infinitely long with predictable disk space requirements (it applies an output file rotation and retention policy).
  • Schniepel helps keeping data organized: time series data is written into HDF5 files, and annotated with relevant metadata such as the program invocation time, system hostname, and Schniepel software version.
  • Schniepel comes with a data plotting tool (separate from the data acquisition program).
  • Schniepel values measurement correctness very highly. The core sampling loop does little work besides the measurement itself: it writes each sample to a queue. A separate process consumes this queue and persists the time series data to disk, for later inspection. This keeps the sampling rate predictable upon disk write latency spikes, or generally upon backpressure. This matters especially in cloud environments where we sometimes see fsync latencies of multiple seconds.

Motivation

This was born out of a need for solid tooling. We started with pidstat from sysstat, launched as pidstat -hud -p $PID 1 1. We found that it does not properly account for multiple threads running in the same process, and that various issues in that regard exist in this program across various versions (see here, here, and here).

The program cpustat open-sourced by Uber has a delightful README about the general measurement methodology and overall seems to be a great tool. However, it seems to be optimized for interactive usage (whereas we were looking for a robust measurement program which can be pointed at a process and then be left unattended for a significant while) and there does not seem to be a decent approach towards persisting the collected time series data on disk for later inspection (it seems to be able to write a binary file when using -cpuprofile but it is a little unclear what this file contains and how to analyze the data).

The program psrecord (which effectively wraps psutil) has a similar fundamental idea as Schniepel; it however does not have a clear separation of concerns between persisting the data to disk, performing the measurement itself, and plotting the data, making it too error-prone and not production-ready.

Usage

Hints and tricks

Convert an HDF5 file to a CSV file

I recommend de-serialize and re-serialize using pandas. Example one-liner:

python -c 'import sys; import pandas as pd; df = pd.read_hdf(sys.argv[1], key="schniepel_timeseries"); df.to_csv(sys.argv[2], index=False)' messer_20190718_213115.hdf5.0001 /tmp/hdf5-as-csv.csv

Note that this significantly inflates the file size (e.g., from 50 MiB to 300 MiB).

Notes

  • Schniepel tries to not asymmetrically hide measurement uncertainty. For example, you might see it measure a CPU utilization of a single-threaded process slightly larger than 100 %. That's simply the measurement error. In other tooling such as sysstat it seems to be common practice to asymmetrically hide measurement uncertainty by capping values when they are known to in theory not exceed a certain threshold (example).

  • Must be run with root privileges.

  • The value -1 has a special meaning for some metrics (NaN, which cannot be represented properly in HDF5). Example: A disk write latency of -1 ms means that no write happened in the corresponding time interval.

  • The highest meaningful sampling rate is limited by the kernel's timer and bookkeeping system.

Measurands (columns, and their units)

The quantities intended to be measured.

proc_cpu_id

The ID of the CPU that this process is currently running on.

Momentary state at sampling time.

proc_cpu_util_percent_total

The CPU utilization of the process in percent.

Mean over the past sampling interval.

If the inspected process is known to contain just a single thread then this can still sometimes be larger than 100 % as of measurement errors. If the process contains more than one thread then this can go far beyond 100 %.

This is based on the sum of the time spent in user space and in kernel space. For a more fine-grained picture the following two metrics are also available: proc_cpu_util_percent_user, and proc_cpu_util_percent_system.

proc_num_threads

The number of threads in the process.

Momentary state at sampling time.

proc_num_ip_sockets_open

The number of sockets currently being open. This includes IPv4 and IPv6 and does not distinguish between TCP and UDP, and the connection state also does not matter.

Momentary state at sampling time.

proc_num_fds

The number of file descriptors currently opened by this process.

Momentary state at sampling time.

proc_disk_read_throughput_mibps and proc_disk_write_throughput_mibps

The disk I/O throughput of the inspected process, in MiB/s.

Based on Linux' /proc/<pid>/io rchar and wchar. A highly relevant piece of documentation (emphasis mine):

The number of bytes which this task has caused to be read from storage. This is simply the sum of bytes which this process passed to read() and pread(). It includes things like tty IO and it is unaffected by whether or not actual physical disk IO was required (the read might have been satisfied from pagecache)

Mean over the past sampling interval.

proc_mem_rss_percent

Fraction of process resident set size (RSS) relative to machine's physical memory size in percent.

Momentary state at sampling time.

proc_ctx_switch_rate_hz

The rate of (voluntary and involuntary) context switches in Hz.

Mean over the past sampling interval.

(list incomplete)

Valuable references

External references on the subject matter that I found useful during development.

About system performance measurement, and kernel time bookkeeping:

About disk I/O statistics:

Others:

Musings about HDF5:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schniepel-0.1.0.tar.gz (35.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page