A CLI tool that helps capture metrics from the Operating System
Project description
pmeter
A python tool that measures the tcp and udp network metrics
CSE-603 PDP Project
#####Contributors Deepika Ghodki Aman Harsh Neha Mishra Jacob Goldverg
Links to Relevant Papers
- Historical Analysis and Real-Time Tuning
- Cheng, Liang and Marsic, Ivan. ‘Java-based Tools for Accurate Bandwidth Measurement of Digital Subscriber Line Networks’. 1 Jan. 2002 : 333 – 344.
- Java-based tools for accurate bandwidth measurement of Digital Subscriber Line
- Energy-saving Cross-layer Optimization of Big Data Transfer Based on Historical Log Analysis
- Cross-layer Optimization of Big Data Transfer Throughput and Energy Consumption
- HARP: Predictive Transfer Optimization Based on Historical Analysis and Real-time Probing
The Problem
Currently, the OneDataShare Transfer-Service do not collect/report(to AWS deployment) the network state they experience. Tools such as "sar" and "ethtool" report metrics like: ping, bandwidth, latency, link capacity, RTT,, etc to the user that allow them to understand bottlenecks in their network.
Metrics we collect
- Kernel Level:
- active cores
- cpu frequency
- energy consumption
- cpu Architecture
- Application Level:
- pipelining
- concurrency
- parallelism
- chunk size
- Network Level:
- RTT
- Bandwidth
- BDP(Link capacity * RTT)
- packet loss rate
- link capacity
- Data Characteristics:
- Number of files
- Total Size of transfer
- Average file size
- std deviation of file sizes
- file types in the transfer
End System Resource Usage:
- % of CPU’s used
- % of NIC used.
Solution
We initially explored three soultions and we have decided solution 1 would be sufficient and provide accurate enough metrics.
Solution 1. Writing a Python script which the Transfer-Service will run as a CRON job to collect the network conditions periodically. The script will create a file that will be formatted metric report, and the Transfer-Service will then read/parse that file and send it to CockroachDB/Prometheous that will be run on the AWS backend
The current state of the project is we have a python script that supports: kernel, and some network level metrics. The script generates a a file in the users home directory under ~/.pmeter/pmeter_measure.txt, this file stores a json dump of the ODS_Metrics object inside of the file. The cli is able to run for a number of measurements or a certain amount of time. Every "row" of the file is a new ODS_Metrics object that stores a new measurement. This file then gets parsed and cleaned up as the Transfer-Service reads from it and appends its own data to the object(file count, types of files,,, etc)then proceeds to store the data in InfluxDB/CockroachDB.
The aggregator service is a publisher to InfluxDB and has The aggregator service is a service running in the OneDataShare (ODS) VPC which summarizes and computes the data so we can perform some visualization. We currently have a graph being generated with one metric(latency) and we will now begin to explore ML.
Recap
Before we began exploring ML models we began by breaking down the problems which we are attempting to solve again.
- What parameters(concurrency, pipelining, parallelism, and chunk size) are optimal for performing a big data file transfer?
- What is the given network condition that a host is experiencing?
The Data:
The current data we are generating is what is commonly refereed to as "Time-Series Data". InfluxDb defines this type of data as "Time series data, also referred to as time-stamped data, is a sequence of data points indexed in time order. Time-stamped is data collected at different points in time.". It is essentially data that represents a snapshot of time for something, this something in our case in the kernel/network conditions that the Operating System is experiencing.
Example Graphs
Challenges per Solution
Solution 1: We currently expect the actual metrics to not be as accurate as the manual implementation on the Java application. As UDP/TCP are dynamic we know that having separate connections(python sockets vs java sockets) will create variability in the measurements. Another source of variability is using another programming language will only provide an estimation of what the Transfer-Service is experiencing in performance as the Java is completely virtualized. The benefit of this approach is that Python has many libraries more network measuring libraries.
- Bandwidth is still only realizable bandwidth for the ODS Transfer-Service.
- Ping traditionally uses ICMP which requires a high level of permissions, so if the process is not able to run ICMP ping then we use TCP ping which is less accurate but better than nothing.
- We are still observing the difference between CDB and InfluxDB in terms of extrapolating data. We currently fully support both database types and are now attempting to swell the DB and see how performance is.
TO-DO
- Run the cli on DTN on CCR for 1 week to gather some data.
- Explore various regressions that would let us extrapolate relationships in values.
- Create a set of graphs(with types) to generate to summarize the conditions the host is going through over time.
Libraries to be used per solution
ping: will allow measurement of packet loss and latency psutil: A networking library that exposes kernel/os level metrics. statsd: A library that allows us to construct concise reports for sending to AWS. influxdb: A Time series database that allows us to store and generate trivial graphs.
Solution 1. tcp-latency, udp-latency, ping, psutil(Exposes: CPU, NIC metrics) allows us to compute RTT, Bandwidth, estimated link capacity.
List of Technologies
Tools: ping, psutil, Technologies: Java, Python, CockroachDB, Prometheus, Grafana
What we will Accomplish
By the end of the semester we would like to have the transfer service to be fully monitoring its network conditions and reporting it periodically back to the ODS backend. We will be using either CockroachDB or Prometheus to be storing the time-series data thus allowing the ODS deployment to optimize the transfer based on the papers above. For extra browny points we would like to implement a Grafana dashboard so every user can be aware of the network conditions around their transfer.
What we have accomplished
- We have a CLI that captures the kernel, and network parameters that the OS exposes to the application layer
- We have a time series DB(InfluxDB) which allows us to store and manipulate the time
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pmeter_ods-1.0.12.tar.gz
.
File metadata
- Download URL: pmeter_ods-1.0.12.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c9d27abf41f1ab669a2ce0b8f5afaccbe2cda90d127a58da980dac966c066064 |
|
MD5 | 424bdfbdc35280159e9575e996fc664c |
|
BLAKE2b-256 | 22d7b12dd4019a191757a892b5f780c17bbd535a72d8a3eb770558ecb6aec74b |
File details
Details for the file pmeter_ods-1.0.12-py3-none-any.whl
.
File metadata
- Download URL: pmeter_ods-1.0.12-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 834c026db2a897ed21d050d09d6162c99177c55b9392a3c16e7e91700d915d77 |
|
MD5 | 6d068ba4464e6d93d94025edcff21d6d |
|
BLAKE2b-256 | 96167b5b70cd6107632b4bb685f2d331975a71b9024f842c55e9c4fb096ddb0f |