Datacentertracesdatasets-cli is an open-source tool designed to facilitate the usage of Datacentertracesdatasets package from a command line interface
Project description
Datacentertracesdatasets_clicli: Command-line interface for datacentertracesdatasets package
Table of Contents
Description
Datacentertracesdatasets-cli is a command-line interface tool that act as an interface of the datacentertracesdatasets package. Datacentertracesdatasets package facilitates the access to three dataset traces: Alibaba2018, Azure_v2 and Google2019.
This command-line interface is OS independent and can be easily installed and used.
Available original datasets
Public datasets organized for machine learning or artificial intelligence usage. The following dasets can be used:
Alibaba 2018 machine usage
Processed from the original files found at: https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018
This repository dataset of machine usage includes the following columns:
+--------------------------------------------------------------------------------------------+
| Field | Type | Label | Comment |
+--------------------------------------------------------------------------------------------+
| cpu_util_percent | bigint | | [0, 100] |
| mem_util_percent | bigint | | [0, 100] |
| net_in | double | | normarlized in coming network traffic, [0, 100] |
| net_out | double | | normarlized out going network traffic, [0, 100] |
| disk_io_percent | double | | [0, 100], abnormal values are of -1 or 101 |
+--------------------------------------------------------------------------------------------+
Three sampled datasets are found: average value of each column grouped every 10 seconds as original, and downsampled to 30 seconds and 300 seconds as well. Every column includes the average utilization of the whole data center.
Figures
Some figures were generated from these datasets
Figure: CPU utilization sampled every 10 seconds |
Figure: Memory utilization sampled every 300 seconds |
Figure: Net in sampled every 300 seconds |
Figure: Net out sampled every 300 seconds |
Figure: Disk io sampled every 300 seconds |
Google 2019 instance usage
Processed from the original dataset and queried using Big Query. More information available at: https://research.google/tools/datasets/google-cluster-workload-traces-2019/
This repository dataset of instance usage includes the following columns:
+--------------------------------------------------------------------------------------------+
| Field | Type | Label | Comment |
+--------------------------------------------------------------------------------------------+
| avg_cpu | double | | [0, 1] |
| avg_mem | double | | [0, 1] |
| avg_assigned_mem | double | | [0, 1] |
| avg_cycles_per_instruction | double | | [0, _] |
+--------------------------------------------------------------------------------------------+
One sampled dataset is found: average value of each column grouped every 300 seconds as original. Every column includes the average utilization of the whole data center.
Figures
Some figures were generated from these datasets
Figure: CPU usage day 26 sampled every 300 seconds |
Figure: Mem usage day 26 sampled every 300 seconds |
Figure: Assigned mem day 26 sampled every 300 seconds |
Figure: Cycles per instruction day 26 sampled every 300 seconds |
Azure v2 virtual machine workload
Processed from the original dataset. More information available at: https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md
This repository dataset of instance usage includes the following columns:
+--------------------------------------------------------------------------------------------+
| Field | Type | Label | Comment |
+--------------------------------------------------------------------------------------------+
| cpu_usage | double | | [0, _] |
| assigned_mem | double | | [0, _] |
+--------------------------------------------------------------------------------------------+
One sampled dataset is found: sum value of each column grouped every 300 seconds as original. For computing CPU_usage, we used core_count usage of each virtual machine. Every column includes the total consumption of the whole data center virtual machines. There is a version of each file including timestamp (from 0 to 2591700, in 300 seconds timestep), and other version without timestamp
Figures
Some figures were generated from these datasets
Figure: CPU total usage by virtual machines sampled every 300 seconds. |
Figure: Total assigned memory for virtual machines sampled every 300 seconds. |
Available sythetic datasets
Moreover, for dataset augmentation and deep learning purposes, the datasets have been augmented using TimeGAN (https://github.com/DamianUS/timegan-pytorch) trained models.
The augmented datasets are composed of time series of lenght 288 and sampled to 300 seconds, that would correspond to one operational day of the data center.
Installation
To install the tool in your local environment, just run follow command:
pip install datacentertracesdatasets-cli
Basic usage examples:
Some examples for obtaining the datasets are shown below.
-
Full Azure_V2 original dataset sampled at 300 seconds:
datacentertracesdatasets-cli -trace azure_v2
The resulting file will be found at:
-
Full Alibaba2018 original dataset sampled at 10 seconds:
datacentertracesdatasets-cli -trace alibaba2018 -stride 10
-
A synthetic sample of 1 day for Google2019 sampled at 300 seconds:
datacentertracesdatasets-cli -trace google2019 -generation synthetic
-
A synthetic sample of 1 day for Google2019 sampled at 300 seconds providing a filename:
datacentertracesdatasets-cli -trace google2019 -generation synthetic -file my_dataset.csv
License
Datacentertracesdatasets-cli is free and open-source software licensed under the MIT license.
Acknowledgements
Project PID2021-122208OB-I00, PROYEXCEL_00286 and TED2021-132695B-I00 project, funded by MCIN / AEI / 10.13039 / 501100011033, by Andalusian Regional Government, and by the European Union - NextGenerationEU.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datacentertracesdatasets-cli-1.0.tar.gz
.
File metadata
- Download URL: datacentertracesdatasets-cli-1.0.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4fd139645cb12e2ad1cf8ad835e3cea43817460312b1f45f3c43e772f086e976 |
|
MD5 | 45723eda5d5c95218b9156678bae5d34 |
|
BLAKE2b-256 | 889ad535504dd7584dbc5a5febc3b4990a64a9e7db0cea16543a4a03ed98c8c7 |
File details
Details for the file datacentertracesdatasets_cli-1.0-py3-none-any.whl
.
File metadata
- Download URL: datacentertracesdatasets_cli-1.0-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 397e86f02c8c841df8458a43230fd23083ee51f145255b594355927c9601b061 |
|
MD5 | 34a9f6498cffc19719ba1f6275888dbc |
|
BLAKE2b-256 | 80351278c667162c98005fb0d7e4dd1f128e1c788488a75069fee1be941653fe |