Skip to main content

Datacentertracesdatasets-cli is an open-source tool designed to facilitate the usage of Datacentertracesdatasets package from a command line interface

Project description

version Python 3.9 license

Datacentertracesdatasets_clicli: Command-line interface for datacentertracesdatasets package

Table of Contents

Description

Datacentertracesdatasets-cli is a command-line interface tool that act as an interface of the datacentertracesdatasets package. Datacentertracesdatasets package facilitates the access to three dataset traces: Alibaba2018, Azure_v2 and Google2019.

This command-line interface is OS independent and can be easily installed and used.

Available original datasets

Public datasets organized for machine learning or artificial intelligence usage. The following dasets can be used:

Alibaba 2018 machine usage

Processed from the original files found at: https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018

This repository dataset of machine usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field            | Type       | Label | Comment                                            |
+--------------------------------------------------------------------------------------------+
| cpu_util_percent | bigint     |       | [0, 100]                                           |
| mem_util_percent | bigint     |       | [0, 100]                                           |
| net_in           | double     |       | normarlized in coming network traffic, [0, 100]    |
| net_out          | double     |       | normarlized out going network traffic, [0, 100]    |
| disk_io_percent  | double     |       | [0, 100], abnormal values are of -1 or 101         |
+--------------------------------------------------------------------------------------------+

Three sampled datasets are found: average value of each column grouped every 10 seconds as original, and downsampled to 30 seconds and 300 seconds as well. Every column includes the average utilization of the whole data center.

Figures

Some figures were generated from these datasets

cpu_util_percent_usage_days_1_to_8_grouped_10_seconds
Figure: CPU utilization sampled every 10 seconds
mem_util_percent_usage_days_1_to_8_grouped_300_seconds
Figure: Memory utilization sampled every 300 seconds
net_in_usage_days_1_to_8_grouped_300_seconds
Figure: Net in sampled every 300 seconds
net_out_usage_days_1_to_8_grouped_300_seconds
Figure: Net out sampled every 300 seconds
disk_io_percent_usage_days_1_to_8_grouped_300_seconds
Figure: Disk io sampled every 300 seconds

Google 2019 instance usage

Processed from the original dataset and queried using Big Query. More information available at: https://research.google/tools/datasets/google-cluster-workload-traces-2019/

This repository dataset of instance usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field                         | Type       | Label | Comment                               |
+--------------------------------------------------------------------------------------------+
| avg_cpu                       | double     |       | [0, 1]                                |
| avg_mem                       | double     |       | [0, 1]                                |
| avg_assigned_mem              | double     |       | [0, 1]                                |
| avg_cycles_per_instruction    | double     |       | [0, _]                                |
+--------------------------------------------------------------------------------------------+

One sampled dataset is found: average value of each column grouped every 300 seconds as original. Every column includes the average utilization of the whole data center.

Figures

Some figures were generated from these datasets

cpu_usage_day_26
Figure: CPU usage day 26 sampled every 300 seconds
mem_usage_day_26
Figure: Mem usage day 26 sampled every 300 seconds
assigned_mem_day_26
Figure: Assigned mem day 26 sampled every 300 seconds
cycles_per_instruction_day_26
Figure: Cycles per instruction day 26 sampled every 300 seconds

Azure v2 virtual machine workload

Processed from the original dataset. More information available at: https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md

This repository dataset of instance usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field                         | Type       | Label | Comment                               |
+--------------------------------------------------------------------------------------------+
| cpu_usage                     | double     |       | [0, _]                                |
| assigned_mem                  | double     |       | [0, _]                                |
+--------------------------------------------------------------------------------------------+

One sampled dataset is found: sum value of each column grouped every 300 seconds as original. For computing CPU_usage, we used core_count usage of each virtual machine. Every column includes the total consumption of the whole data center virtual machines. There is a version of each file including timestamp (from 0 to 2591700, in 300 seconds timestep), and other version without timestamp

Figures

Some figures were generated from these datasets

cpu_usage_month
Figure: CPU total usage by virtual machines sampled every 300 seconds.
assigned_mem_month
Figure: Total assigned memory for virtual machines sampled every 300 seconds.

Available sythetic datasets

Moreover, for dataset augmentation and deep learning purposes, the datasets have been augmented using TimeGAN (https://github.com/DamianUS/timegan-pytorch) trained models.

The augmented datasets are composed of time series of lenght 288 and sampled to 300 seconds, that would correspond to one operational day of the data center.

Installation

To install the tool in your local environment, just run follow command:

pip install datacentertracesdatasets-cli

Basic usage examples:

Some examples for obtaining the datasets are shown below.

  1. Full Azure_V2 original dataset sampled at 300 seconds:

    datacentertracesdatasets-cli -trace azure_v2
    

    The resulting file will be found at:

  2. Full Alibaba2018 original dataset sampled at 10 seconds:

    datacentertracesdatasets-cli -trace alibaba2018 -stride 10
    
  3. A synthetic sample of 1 day for Google2019 sampled at 300 seconds:

    datacentertracesdatasets-cli -trace google2019 -generation synthetic
    
  4. A synthetic sample of 1 day for Google2019 sampled at 300 seconds providing a filename:

    datacentertracesdatasets-cli -trace google2019 -generation synthetic -file my_dataset.csv
    

License

Datacentertracesdatasets-cli is free and open-source software licensed under the MIT license.

Acknowledgements

Project PID2021-122208OB-I00, PROYEXCEL_00286 and TED2021-132695B-I00 project, funded by MCIN / AEI / 10.13039 / 501100011033, by Andalusian Regional Government, and by the European Union - NextGenerationEU.

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacentertracesdatasets-cli-1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file datacentertracesdatasets-cli-1.0.tar.gz.

File metadata

File hashes

Hashes for datacentertracesdatasets-cli-1.0.tar.gz
Algorithm Hash digest
SHA256 4fd139645cb12e2ad1cf8ad835e3cea43817460312b1f45f3c43e772f086e976
MD5 45723eda5d5c95218b9156678bae5d34
BLAKE2b-256 889ad535504dd7584dbc5a5febc3b4990a64a9e7db0cea16543a4a03ed98c8c7

See more details on using hashes here.

File details

Details for the file datacentertracesdatasets_cli-1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for datacentertracesdatasets_cli-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 397e86f02c8c841df8458a43230fd23083ee51f145255b594355927c9601b061
MD5 34a9f6498cffc19719ba1f6275888dbc
BLAKE2b-256 80351278c667162c98005fb0d7e4dd1f128e1c788488a75069fee1be941653fe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page