Measures the resource utilization of a specific process over time

These details have not been verified by PyPI

Project links

Homepage

Project description

Schniepel

Measures the resource utilization of a specific process over time.

This program also measures the utilization / saturation of system-wide resources making it straightforward to put the process-specific metrics into context.

Built for Linux. Windows and Mac OS support might come.

Highlights:

High sampling rate: by default, Schniepel uses a sampling interval of 0.5 seconds for making narrow spikes visible.
Schniepel is built for monitoring a program subject to process ID changes. This is useful for longevity experiments when the monitored process occasionaly restarts (for instance as of fail-over scenarios).
Schniepel can run unsupervised and infinitely long with predictable disk space requirements (it applies an output file rotation and retention policy).
Schniepel helps keeping data organized: time series data is written into HDF5 files, and annotated with relevant metadata such as the program invocation time, system hostname, and Schniepel software version.
Schniepel comes with a data plotting tool (separate from the data acquisition program).
Schniepel values measurement correctness very highly. The core sampling loop does little work besides the measurement itself: it writes each sample to a queue. A separate process consumes this queue and persists the time series data to disk, for later inspection. This keeps the sampling rate predictable upon disk write latency spikes, or generally upon backpressure. This matters especially in cloud environments where we sometimes see fsync latencies of multiple seconds.

Motivation

This was born out of a need for solid tooling. We started with pidstat from sysstat, launched as pidstat -hud -p $PID 1 1. We found that it does not properly account for multiple threads running in the same process, and that various issues in that regard exist in this program across various versions (see here, here, and here).

The program cpustat open-sourced by Uber has a delightful README about the general measurement methodology and overall seems to be a great tool. However, it seems to be optimized for interactive usage (whereas we were looking for a robust measurement program which can be pointed at a process and then be left unattended for a significant while) and there does not seem to be a decent approach towards persisting the collected time series data on disk for later inspection (it seems to be able to write a binary file when using -cpuprofile but it is a little unclear what this file contains and how to analyze the data).

The program psrecord (which effectively wraps psutil) has a similar fundamental idea as Schniepel; it however does not have a clear separation of concerns between persisting the data to disk, performing the measurement itself, and plotting the data, making it too error-prone and not production-ready.

Usage

Hints and tricks

Convert an HDF5 file to a CSV file

I recommend de-serialize and re-serialize using pandas. Example one-liner:

python -c 'import sys; import pandas as pd; df = pd.read_hdf(sys.argv[1], key="schniepel_timeseries"); df.to_csv(sys.argv[2], index=False)' messer_20190718_213115.hdf5.0001 /tmp/hdf5-as-csv.csv

Note that this significantly inflates the file size (e.g., from 50 MiB to 300 MiB).

Notes

Schniepel tries to not asymmetrically hide measurement uncertainty. For example, you might see it measure a CPU utilization of a single-threaded process slightly larger than 100 %. That's simply the measurement error. In other tooling such as sysstat it seems to be common practice to asymmetrically hide measurement uncertainty by capping values when they are known to in theory not exceed a certain threshold (example).
Must be run with root privileges.
The value -1 has a special meaning for some metrics (NaN, which cannot be represented properly in HDF5). Example: A disk write latency of -1 ms means that no write happened in the corresponding time interval.
The highest meaningful sampling rate is limited by the kernel's timer and bookkeeping system.

Measurands (columns, and their units)

The quantities intended to be measured.

`proc_cpu_id`

The ID of the CPU that this process is currently running on.

Momentary state at sampling time.

`proc_cpu_util_percent_total`

The CPU utilization of the process in percent.

Mean over the past sampling interval.

If the inspected process is known to contain just a single thread then this can still sometimes be larger than 100 % as of measurement errors. If the process contains more than one thread then this can go far beyond 100 %.

This is based on the sum of the time spent in user space and in kernel space. For a more fine-grained picture the following two metrics are also available: proc_cpu_util_percent_user, and proc_cpu_util_percent_system.

`proc_num_threads`

The number of threads in the process.

Momentary state at sampling time.

`proc_num_ip_sockets_open`

The number of sockets currently being open. This includes IPv4 and IPv6 and does not distinguish between TCP and UDP, and the connection state also does not matter.

Momentary state at sampling time.

`proc_num_fds`

The number of file descriptors currently opened by this process.

Momentary state at sampling time.

`proc_disk_read_throughput_mibps` and `proc_disk_write_throughput_mibps`

The disk I/O throughput of the inspected process, in MiB/s.

Based on Linux' /proc/<pid>/io rchar and wchar. A highly relevant piece of documentation (emphasis mine):

The number of bytes which this task has caused to be read from storage. This is simply the sum of bytes which this process passed to read() and pread(). It includes things like tty IO and it is unaffected by whether or not actual physical disk IO was required (the read might have been satisfied from pagecache)

Mean over the past sampling interval.

`proc_mem_rss_percent`

Fraction of process resident set size (RSS) relative to machine's physical memory size in percent.

Momentary state at sampling time.

`proc_ctx_switch_rate_hz`

The rate of (voluntary and involuntary) context switches in Hz.

Mean over the past sampling interval.

(list incomplete)

Valuable references

External references on the subject matter that I found useful during development.

About system performance measurement, and kernel time bookkeeping:

About disk I/O statistics:

https://www.xaprb.com/blog/2010/01/09/how-linux-iostat-computes-its-results/
https://www.kernel.org/doc/Documentation/iostats.txt
https://blog.serverfault.com/2010/07/06/777852755/ (interpreting iostat output)
https://unix.stackexchange.com/a/462732 (What are merged writes?)
https://stackoverflow.com/a/8512978 (what is%util in iostat?)
https://coderwall.com/p/utc42q/understanding-iostat
https://www.percona.com/doc/percona-toolkit/LATEST/pt-diskstats.html

Others:

https://serverfault.com/a/85481/121951 (about system memory statistics)

Musings about HDF5:

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Jul 31, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schniepel-0.1.0.tar.gz (35.0 kB view details)

Uploaded Jul 31, 2019 Source

File details

Details for the file schniepel-0.1.0.tar.gz.

File metadata

Download URL: schniepel-0.1.0.tar.gz
Upload date: Jul 31, 2019
Size: 35.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.6

File hashes

Hashes for schniepel-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2dd45d51342c8648c58023843682e9d03b344e415aeb9570055727ce6371c487`
MD5	`337f61ba5e0d6392b7e2c55a67069f0e`
BLAKE2b-256	`00b4b85522460a4d7d6ef73d1e1f4c9807b7078b2e691f6a62f457269451a2ea`

See more details on using hashes here.

schniepel 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Schniepel

Motivation

Usage

Hints and tricks

Convert an HDF5 file to a CSV file

Notes

Measurands (columns, and their units)

`proc_cpu_id`

`proc_cpu_util_percent_total`

`proc_num_threads`

`proc_num_ip_sockets_open`

`proc_num_fds`

`proc_disk_read_throughput_mibps` and `proc_disk_write_throughput_mibps`

`proc_mem_rss_percent`

`proc_ctx_switch_rate_hz`

Valuable references

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

schniepel 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Schniepel

Motivation

Usage

Hints and tricks

Convert an HDF5 file to a CSV file

Notes

Measurands (columns, and their units)

proc_cpu_id

proc_cpu_util_percent_total

proc_num_threads

proc_num_ip_sockets_open

proc_num_fds

proc_disk_read_throughput_mibps and proc_disk_write_throughput_mibps

proc_mem_rss_percent

proc_ctx_switch_rate_hz

Valuable references

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

`proc_cpu_id`

`proc_cpu_util_percent_total`

`proc_num_threads`

`proc_num_ip_sockets_open`

`proc_num_fds`

`proc_disk_read_throughput_mibps` and `proc_disk_write_throughput_mibps`

`proc_mem_rss_percent`

`proc_ctx_switch_rate_hz`