Skip to main content

A CLI tool that helps capture metrics from the Operating System

Project description

pmeter

A python tool that measures the tcp and udp network metrics

CSE-603 PDP Project

#####Contributors Deepika Ghodki Aman Harsh Neha Mishra Jacob Goldverg

Links to Relevant Papers

  1. Historical Analysis and Real-Time Tuning
  2. Cheng, Liang and Marsic, Ivan. ‘Java-based Tools for Accurate Bandwidth Measurement of Digital Subscriber Line Networks’. 1 Jan. 2002 : 333 – 344.
  3. Java-based tools for accurate bandwidth measurement of Digital Subscriber Line
  4. Energy-saving Cross-layer Optimization of Big Data Transfer Based on Historical Log Analysis
  5. Cross-layer Optimization of Big Data Transfer Throughput and Energy Consumption
  6. HARP: Predictive Transfer Optimization Based on Historical Analysis and Real-time Probing

The Problem

Currently, the OneDataShare Transfer-Service do not collect/report(to AWS deployment) the network state they experience. Tools such as "sar" and "ethtool" report metrics like: ping, bandwidth, latency, link capacity, RTT,, etc to the user that allow them to understand bottlenecks in their network.

Metrics we collect

  1. Kernel Level:
    • active cores
    • cpu frequency
    • energy consumption
    • cpu Architecture
  2. Application Level:
    • pipelining
    • concurrency
    • parallelism
    • chunk size
  3. Network Level:
    • RTT
    • Bandwidth
    • BDP(Link capacity * RTT)
    • packet loss rate
    • link capacity
  4. Data Characteristics:
    • Number of files
    • Total Size of transfer
    • Average file size
    • std deviation of file sizes
    • file types in the transfer

End System Resource Usage:

  • % of CPU’s used
  • % of NIC used.

Solution

We initially explored three soultions and we have decided solution 1 would be sufficient and provide accurate enough metrics.

Solution 1. Writing a Python script which the Transfer-Service will run as a CRON job to collect the network conditions periodically. The script will create a file that will be formatted metric report, and the Transfer-Service will then read/parse that file and send it to CockroachDB/Prometheous that will be run on the AWS backend

The current state of the project is we have a python script that supports: kernel, and some network level metrics. The script generates a a file in the users home directory under ~/.pmeter/pmeter_measure.txt, this file stores a json dump of the ODS_Metrics object inside of the file. The cli is able to run for a number of measurements or a certain amount of time. Every "row" of the file is a new ODS_Metrics object that stores a new measurement. This file then gets parsed and cleaned up as the Transfer-Service reads from it and appends its own data to the object(file count, types of files,,, etc)then proceeds to store the data in InfluxDB/CockroachDB.

The aggregator service is a publisher to InfluxDB and has The aggregator service is a service running in the OneDataShare (ODS) VPC which summarizes and computes the data so we can perform some visualization. We currently have a graph being generated with one metric(latency) and we will now begin to explore ML.

Recap

Before we began exploring ML models we began by breaking down the problems which we are attempting to solve again.

  1. What parameters(concurrency, pipelining, parallelism, and chunk size) are optimal for performing a big data file transfer?
  2. What is the given network condition that a host is experiencing?
The Data:

The current data we are generating is what is commonly refereed to as "Time-Series Data". InfluxDb defines this type of data as "Time series data, also referred to as time-stamped data, is a sequence of data points indexed in time order. Time-stamped is data collected at different points in time.". It is essentially data that represents a snapshot of time for something, this something in our case in the kernel/network conditions that the Operating System is experiencing.

Example Graphs

Latency and CPU Frequency RTT RTT Over Time Example Query

Challenges per Solution

Solution 1: We currently expect the actual metrics to not be as accurate as the manual implementation on the Java application. As UDP/TCP are dynamic we know that having separate connections(python sockets vs java sockets) will create variability in the measurements. Another source of variability is using another programming language will only provide an estimation of what the Transfer-Service is experiencing in performance as the Java is completely virtualized. The benefit of this approach is that Python has many libraries more network measuring libraries.

  1. Bandwidth is still only realizable bandwidth for the ODS Transfer-Service.
  2. Ping traditionally uses ICMP which requires a high level of permissions, so if the process is not able to run ICMP ping then we use TCP ping which is less accurate but better than nothing.
  3. We are still observing the difference between CDB and InfluxDB in terms of extrapolating data. We currently fully support both database types and are now attempting to swell the DB and see how performance is.

TO-DO

  1. Run the cli on DTN on CCR for 1 week to gather some data.
  2. Explore various regressions that would let us extrapolate relationships in values.
  3. Create a set of graphs(with types) to generate to summarize the conditions the host is going through over time.

Libraries to be used per solution

ping: will allow measurement of packet loss and latency psutil: A networking library that exposes kernel/os level metrics. statsd: A library that allows us to construct concise reports for sending to AWS. influxdb: A Time series database that allows us to store and generate trivial graphs.

Solution 1. tcp-latency, udp-latency, ping, psutil(Exposes: CPU, NIC metrics) allows us to compute RTT, Bandwidth, estimated link capacity.

List of Technologies

Tools: ping, psutil, Technologies: Java, Python, CockroachDB, Prometheus, Grafana

What we will Accomplish

By the end of the semester we would like to have the transfer service to be fully monitoring its network conditions and reporting it periodically back to the ODS backend. We will be using either CockroachDB or Prometheus to be storing the time-series data thus allowing the ODS deployment to optimize the transfer based on the papers above. For extra browny points we would like to implement a Grafana dashboard so every user can be aware of the network conditions around their transfer.

What we have accomplished

  1. We have a CLI that captures the kernel, and network parameters that the OS exposes to the application layer
  2. We have a time series DB(InfluxDB) which allows us to store and manipulate the time

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pmeter_ods-1.0.17.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

pmeter_ods-1.0.17-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file pmeter_ods-1.0.17.tar.gz.

File metadata

  • Download URL: pmeter_ods-1.0.17.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for pmeter_ods-1.0.17.tar.gz
Algorithm Hash digest
SHA256 c657f1c05fa7555070d4b6e85b35b502826799c2c43301bac567f46366700bfc
MD5 cb2c96607eabd032f292f88b354ee1ca
BLAKE2b-256 c70f8d38ef24044e4355012e6e94c48fa1e0600261740196aea15f4007cfb24c

See more details on using hashes here.

File details

Details for the file pmeter_ods-1.0.17-py3-none-any.whl.

File metadata

  • Download URL: pmeter_ods-1.0.17-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for pmeter_ods-1.0.17-py3-none-any.whl
Algorithm Hash digest
SHA256 7a7526b6bd22c028a3268e04cec99c6712d0c7e4dbffc132a4776110c7759c90
MD5 aeee3a252de3f6fda4a82fd4b09b5c03
BLAKE2b-256 e47d2aa7349113d3bbfc9a7b5e7512053f74e82bbfe976a1ac6cf241b31c8a55

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page