Skip to main content

A Toolkit for Multivariate Time Series Data

Project description

MVTS Data Toolkit

A Toolkit for Pre-processing Multivariate Time Series Data

  • Title: MVTS Data Toolkit: A Toolkit for Pre-processing Multivariate Time Series Data
  • Journal: SoftwareX Journal (Elsevier) -- [under-review]
  • Authors: Azim Ahmadzadeh >, Kankana Sinha >, Berkay Aydin >, Rafal A. Angryk >
  • Demo Author: Azim Ahmadzadeh
  • Last Modified: Jan 24, 2020

MVTS_Date_Toolkit Icon

Abstract: We developed a domain-independent Python package to facilitate the preprocessing routines required in preparation of any multi-class, multivariate time series data. It provides a comprehensive set of 48 statistical features for extracting the important characteristics of time series. The feature extraction process is automated in a sequential and parallel fashion, and is supplemented with an extensive summary report about the data. Using other modules, different data normalization methods and imputation are at users' disposal. To cater the class-imbalance issue, that is often intrinsic to real-world datasets, a set of generic but user-friendly, sampling methods are also developed.

This package provides:

  • Feature Collection: A collection of 48 statistical features useful for analysis of time series,
  • Feature Extraction: An automated feature-extraction process, with both parallel and sequential execution capabilities,
  • Visualization: Quick and easy visualization for analysis of the extracted features,
  • Data Analysis: A quick analysis of the mvts data and the extracted features, in tabular and illustrative modes,
  • Normalization: A set of data transformation tools for normalization of the extracted features,
  • Sampling: A set of generic methods to provide an array of undersampling and oversampling remedies for balancing the class-imbalance datasets.

PyPI license PyPI license PyPI license

Requirements


Try it online

Click on the badge below to try the demo provided in the notebook demo.ipynb, online:

Binder


Install it from PyPI

You can install this package, directly from Python Package Index (PyPI), using pip as follows:

pip install mvtsdata-toolkit

PyPI license


See Documentation

Check out the documentation of the project here:

PyPI license

(http://dmlab.cs.gsu.edu/docs/mvtsdata_toolkit/)


Data Rules:

MVTS Files

It is assumed that the input dataset is a collection of multivariate time series (mvts), following these assumptions:

  1. Each mvts is stored in a tab-delimited, csv file. Each column represents either the time series or some metadata such as timestamp. An mvts data with t time series and k metadata columns, each of length d, has a dimension of d * (t + k)).

  2. File names can also be used to have some metadata encoded using a tag followed by [], for each piece of info. The tag indicates what that piece of info is about, and the actual information should be stored inside the proceeding square brackets. For instance, A_id[123]_lab[1].csv indicates that this mvts is assigned the id 123 and the label 1. If tags are used, the metadata will be extracted and added to the extracted features automatically. To learn more about how the tags can be used see the documentation in features.feature_extractor.py .

  3. If the embedded values contain paired braces within [], (e.g. for id, id[123[001]]), then the metadata extractor would still be able to extract the info correctly, however for unpaired braces (e.g. for id, id[123[001]) it will raise an exception.


Main Components:


Demo

The Jupyer notebook demo is designed to give a tour of the main functionalities of MVTS Data Toolkit. Users can either click on the binder badge and run it online, or clone this project and run it on their local machine.

A dataset of 2000 mvts files can be downloaded within the steps of this demo.


Example Usage

In following examples, the string '/PATH/TO/CONFIG.YML' points to the user's configuration file.


Data Analysis

This package allows analysis of both raw mvts data and the extracted features.

Using mvts_data_analysis module users can easily get a glimpse of their raw data.

from data_analysis.mvts_data_analysis import MVTSDataAnalysis
mda = MVTSDataAnalysis('/PATH/TO/CONFIG.YML')
mda.compute_summary(first_k=50,
                    params_name=['TOTUSJH', 'TOTBSQ', 'TOTPOT'])

Then, mda.print_stat_of_directory() gives the size of the data, in total and on average, and mda.summary returns a dataframe with several statistics on each of the time series. The statistics are Val-Count, Null-Count, mean, min, 25th (percentile), 50th (= median), 75th, and max.

For large datasets, it is recommended to use the parallel version of this method, as follows:

mda.compute_summary_in_parallel(first_k=50,
                                n_jobs=4,
                                params_name=['TOTUSJH', 'TOTBSQ', 'TOTPOT'],)

which utilizes 4 processes to extract the summary statistics in parallel. We explained in our paper in more details, about computing the statistics in parallel.

Using extracted_features_analysis module users can also get some analysis from the extracted features (see Section Feature Extraction). Suppose the dataframe of the extracted features is loaded as a pandas dataframe into a variable called extracted_features_df. Then,

from data_analysis.extracted_features_analysis import ExtractedFeaturesAnalysis
efa = ExtractedFeaturesAnalysis(extracted_features_df, excluded_col=['id'])
efa.compute_summary()

that excludes the column id of the extracted features from the analysis. After the summary is computed, the following methods can be used:

efa.get_class_population(label='lab')
efa.get_missing_values()
efa.get_five_num_summary()

Feature Extraction

This snippet shows how feature_extractor module can be used, for extracting 4 statistics (i.e., min, max, median, and mean), from 3 time series parameteres (i.e., TOTUSJH, TOTBSQ, and TOTPOT).

from features.feature_extractor import FeatureExtractor

fe = FeatureExtractor(path_to_config='/PATH/TO/CONFIG.YML')
fe.do_extraction(features_name=['get_min', 'get_max', 'get_median', 'get_mean'],
                 params_name=['TOTUSJH', 'TOTBSQ', 'TOTPOT'], first_k=50)

Note that user's configuration file must have the path to the raw mvts using the key PATH_TO_MVTS.

To benefit from the parallel execution, do:

fe.do_extraction_in_parallel(n_jobs=4,
                             features_index=[0, 1, 2, 3],
                             params_index=[0, 1, 2], first_k=50)

Here, for the sake of providing a richer example, we used features_index and params_index instead of their names. This numeric mapping of features and parameters makes it easier to deal with a numerous lengthy names. These lists will be mapped to the list of parameters and features provided in the user's configuration file, under the keys MVTS_PARAMETERS and STATISTICAL_FEATURES, respectively.

In FeatureExtractor class, several plotting functionalities are implemented that can be easily used as follows:

params = ['TOTUSJH_median', 'TOTUSJH_mean', 'TOTBSQ_median', 'TOTBSQ_mean']
fe.plot_boxplot(params)
fe.plot_violinplot(params)
fe.plot_correlation_heatmap(params)
fe.plot_covariance_heatmap(params)
fe.plot_splom(params)

Sampling

After the statistical features are extracted from the mvts data, to remedy the class-imbalance issue (if exists) a set of generic sampling methods are provided in sampler module.

from sampling.sampler import Sampler

sampler = Sampler(extracted_features_df, label_col_name='lab')
sampler.sample(desired_populations={'N': 100, 'Y': 100})

which randomly samples 100 instances of the N class and 100 instances of the Y class. If either of the classes does not have enough samples, then after the entire samples are taken, the remaining needed instances will be sampled with replacement. Depending on the provided populations, this method could be an undersampling or an oversampling technique.

Users can use ratios instead of size as follows:

sampler.sample(desrired_ratios = {'N': 0.50, 'Y': -1})

which means take 50% of N-class instances, and all of Y-class instances.

For other approaches, see the /demo.


Normalizing

The extracted features often require normalization. Using normalizer module, it can be easily normalized as follows:

from normalizing import normalizer
normalizer.zero_one_normalize()
df_normalized = normalizer.zero_one_normalize(extracted_features_df)

that again, extracted_features_df is assumed to be a pandas dataframe of the extracted features.

In this module, the following four normalizers are provided:

  • zero_one_normalizer()
  • negativeone_one_normalize()
  • standardize()
  • robust_standardize()

Extra files:

  • bitbucket-pipelines.yml is a configuration file for pipelining the deployment steps before each release.
  • CONSTANTS.py keeps track of the root directory, and a few other pieces of information that are needed for the demo.
  • demo.ipynb is the demo Jupyter notebook that can walk the interested users through the functionalities this toolkit provides.
  • README.md has the content of this very manual.
  • requirements.txt keeps track of all dependencies.
  • setup.py is used to generate the binary files needed for generating the pip-installble version of this package.

Citation

Currently, this package is under review in SoftwareX journal. If you are interested in using this, I can share the manuscrip with you. Till it is published, it can be cited as follows:

@article{ahmadzadeh2020mvts,
  title={MVTS-Data Toolkit: A Python Package for Preprocessing Multivariate Time Series Data}},
  author={Azim Ahmadzadeh, Kankana Sinha, Berkay Aydin, Rafal A. Angryk},
  journal={SoftwareX},
  volume={},
  pages={},
  year={under-review},
  publisher={Elsevier}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for mvtsdata-toolkit, version 0.1.5
Filename, size File type Python version Upload date Hashes
Filename, size mvtsdata_toolkit-0.1.5-py3-none-any.whl (7.0 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size mvtsdata_toolkit-0.1.5.tar.gz (9.0 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page