Skip to main content

Data center traces for machine learning tasks

Project description

DataCenter-Traces-Datasets

Pip package that makes available the datasets published in: https://github.com/alejandrofdez-us/DataCenter-Traces-Datasets. Please, check the mentioned repository for a deeper understanding.

Alibaba 2018 machine usage

Processed from the original files found at: https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018

This repository dataset of machine usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field            | Type       | Label | Comment                                            |
+--------------------------------------------------------------------------------------------+
| cpu_util_percent | bigint     |       | [0, 100]                                           |
| mem_util_percent | bigint     |       | [0, 100]                                           |
| net_in           | double     |       | normarlized in coming network traffic, [0, 100]    |
| net_out          | double     |       | normarlized out going network traffic, [0, 100]    |
| disk_io_percent  | double     |       | [0, 100], abnormal values are of -1 or 101         |
+--------------------------------------------------------------------------------------------+

Three sampled datasets are found: average value of each column grouped every 10 seconds as original, and downsampled to 30 seconds and 300 seconds as well. Every column includes the average utilization of the whole data center.

Figures

Some figures were generated from these datasets

cpu_util_percent_usage_days_1_to_8_grouped_10_seconds
Figure: CPU utilization sampled every 10 seconds
mem_util_percent_usage_days_1_to_8_grouped_300_seconds
Figure: Memory utilization sampled every 300 seconds
net_in_usage_days_1_to_8_grouped_300_seconds
Figure: Net in sampled every 300 seconds
net_out_usage_days_1_to_8_grouped_300_seconds
Figure: Net out sampled every 300 seconds
disk_io_percent_usage_days_1_to_8_grouped_300_seconds
Figure: Disk io sampled every 300 seconds

Google 2019 instance usage

Processed from the original dataset and queried using Big Query. More information available at: https://research.google/tools/datasets/google-cluster-workload-traces-2019/

This repository dataset of instance usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field                         | Type       | Label | Comment                               |
+--------------------------------------------------------------------------------------------+
| cpu_util_percent              | double     |       | [0, 100]                              |
| mem_util_percent              | double     |       | [0, 100]                              |
| assigned_mem_percent          | double     |       | [0, 100]                              |
| avg_cycles_per_instruction    | double     |       | [0, _]                                |
+--------------------------------------------------------------------------------------------+

One sampled dataset is found: average value of each column grouped every 300 seconds as original. Every column includes the average utilization of the whole data center.

Figures

Some figures were generated from these datasets

cpu_usage_day_26
Figure: CPU usage day 26 sampled every 300 seconds
mem_usage_day_26
Figure: Mem usage day 26 sampled every 300 seconds
assigned_mem_day_26
Figure: Assigned mem day 26 sampled every 300 seconds
cycles_per_instruction_day_26
Figure: Cycles per instruction day 26 sampled every 300 seconds

Azure v2 virtual machine workload

Processed from the original dataset. More information available at: https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md

This repository dataset of instance usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field                         | Type       | Label | Comment                               |
+--------------------------------------------------------------------------------------------+
| cpu_usage                     | double     |       | [0, _]                                |
| assigned_mem                  | double     |       | [0, _]                                |
+--------------------------------------------------------------------------------------------+

One sampled dataset is found: sum value of each column grouped every 300 seconds as original. Every column includes the total consumption of the whole data center virtual machines.

Figures

Some figures were generated from these datasets

cpu_usage_month
Figure: CPU total usage by virtual machines sampled every 300 seconds.
assigned_mem_month
Figure: Total assigned memory for virtual machines sampled every 300 seconds.

Installation

pip install datacentertracesdatasets

Usage

To load the original Alibaba's 2018 machine usage, with the mean usage of all machines for each timestamp (8 days, 10 seconds timestep) as a Pandas DataFrame:

from datacentertracesdatasets import loadtraces
alibaba_2018_original_machine_usage_df = loadtraces.get_trace(trace_name='alibaba2018', trace_type='machine_usage', stride_seconds=10)

If, instead of a Pandas DataFrame, a numpy NDArray is needed, the format parameter can be used:

azure_v2_machine_usage_ndarray = loadtraces.get_trace(trace_name='azure_v2', trace_type='machine_usage', stride_seconds=300, format='ndarray')

Or, for Google 2019 machine usage:

azure_v2_machine_usage_ndarray = loadtraces.get_trace(trace_name='google2019', trace_type='machine_usage', stride_seconds=300, format='ndarray')

In addition to the original Alibaba 2018 machine usage dataset, which has a 10-seconds timestep, two additional downsampled versions of 30 and 300 seconds timesteps are provided, which can be retrieved by using the stride_seconds argument:

alibaba_2018_machine_usage_300_timestep_df = loadtraces.get_trace(trace_name='alibaba2018', trace_type='machine_usage', stride_seconds=300, format='ndarray')

Dataset metadata

The dataset structure and metadata can be retrieved with the get_dataset_info method:

dataset_info = get_dataset_info(trace_name='alibaba2018', trace_type='machine_usage', stride_seconds=300)

Which returns:

dataset_info = {
                "timestamp_frequency_secs": 300,
                "column_config": {
                    "cpu_util_percent": {
                        "column_index": 0,
                        "y_axis_min": 0,
                        "y_axis_max": 100
                    },
                    "mem_util_percent": {
                        "column_index": 1,
                        "y_axis_min": 0,
                        "y_axis_max": 100
                    },
                    "net_in": {
                        "column_index": 2,
                        "y_axis_min": 0,
                        "y_axis_max": 100
                    },
                    "net_out": {
                        "column_index": 3,
                        "y_axis_min": 0,
                        "y_axis_max": 100
                    },
                    "disk_io_percent": {
                        "column_index": 4,
                        "y_axis_min": 0,
                        "y_axis_max": 100
                    }

                },
                "metadata": {
                    "fields": {
                        "cpu_util_percent": {
                            "type": "numerical",
                            "subtype": "float"
                        },
                        "mem_util_percent": {
                            "type": "numerical",
                            "subtype": "float"
                        },
                        "net_in": {
                            "type": "numerical",
                            "subtype": "float"
                        },
                        "net_out": {
                            "type": "numerical",
                            "subtype": "float"
                        },
                        "disk_io_percent": {
                            "type": "numerical",
                            "subtype": "float"
                        }
                    }
                }
            }

Currently only Alibaba 2018's, Google 2019's and Azure_V2 (virtual) machine usage traces are available. In the future, we plan to add the following:

  • Alibaba's 2018 batch_task workload trace.
  • Google's 2019 batch_task workload trace.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacentertracesdatasets-0.3.2.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

datacentertracesdatasets-0.3.2-py3-none-any.whl (4.1 MB view details)

Uploaded Python 3

File details

Details for the file datacentertracesdatasets-0.3.2.tar.gz.

File metadata

File hashes

Hashes for datacentertracesdatasets-0.3.2.tar.gz
Algorithm Hash digest
SHA256 c4495131bb40941e713fc94176351f1b5a4c5bda81c97e9524a51c66980c6ff1
MD5 00e0943dcfe2f4a51f3b9c5b4dc8daa7
BLAKE2b-256 c70d3c86b652c28a2035f0c957838a66bd410d9ab09be4a101d2c8af357b7f60

See more details on using hashes here.

File details

Details for the file datacentertracesdatasets-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for datacentertracesdatasets-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3a9bb093aa4e5e42ae3f338b472c5d6f63de34353781414e760fb53513ad3f81
MD5 b0f726d934eea47435ca8c92acaba3cb
BLAKE2b-256 c4f522abfe47c93e9cd084c63cb5fb1f0ae9b815bdf2225bf63966ea6f5d62d1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page