Skip to main content

Kernel similarity for classification and clustering of multi-variate time series with missing values.

Project description

arXiv

The Time Series Cluster Kernel (TCK) is a kernel similarity for multivariate time series with missing values. The kernel can be used to perform tasks such as classification, clustering, and dimensionality reduction.


TCK is based on an ensemble of Gaussian Mixture Models for time series that use informative Bayesian priors robust to missing values. The similarity between two time series is proportional to the number of times the two time series are assigned to the same mixtures.

Installation

The recommended installation is with pip:

pip install tck

Alternatively, you can install the library from source:

git clone https://github.com/FilippoMB/https://github.com/FilippoMB/Time-Series-Cluster-Kernel.git
cd https://github.com/FilippoMB/Time-Series-Cluster-Kernel
pip install -e .

Quick start

The following scripts provide minimalistic examples that illustrate how to use the library for different tasks.

To run them, download the project and cd to the root folder:

git clone https://github.com/FilippoMB/https://github.com/FilippoMB/Time-Series-Cluster-Kernel.git
cd https://github.com/FilippoMB/Time-Series-Cluster-Kernel

Classification

python examples/classification.py

Clustering

python examples/clustering.py

The following notebooks illustrate more advanced use-cases.

  • Perform time series dimensionality reduction, cluster analysis, and visualize the results: view or Open In Colab

Running on Windows

TCK uses multiprocessing. While using multiprocessing in Python on windows, it is necessary to protect the entry point of the program by using

if __name__ == '__main__':

Please, refer to the following examples.

Classification

python examples/classification_windows.py

Clustering

python examples/clustering_windows.py

Datasets

Data format

  • TCK works both with univariate and multivariate time series. The dataset must be stored in a numpy array of shape [N, T, V], where N is the number of variables, T is the number of time steps, and V is the number of variables (V=1 in the univariate case).
  • If the time series in the same dataset have a different number of time steps, T corresponds to the maximum length of the time series in the dataset. All the time series shorter than T should be padded with trailing zeros to match the dimension T. Alternatively, one can use interpolation to stretch the shorter time series up to length T.
  • The time series can contain missing data. Missing data dare indicated by entries np.nan in the data array.

Available datasets

There are several univariate and multivariate time series classification datasets immediately available for test and benchmarking purposes.

To list of available datasets can be retrieved as follows

from tck.datasets import DataLoader
downloader.available_datasets(details=True) # Leave the default at False to just get the names

A dataset can be loaded as follows

Xtr, Ytr, Xte, Yte = downloader.get_data('Japanese_Vowels')

Configuration and detailed usage

There are few hyperparameters that can be tuned to modify the TCK behavior.

tck = TCK(G, C)
  • G is the number of GMMs.
  • C is the number of components in the GMMs.

Usually, the higher the better but the computations take longer.

tck.fit(X, minN, minV, maxV, minT, maxT, I)

minN: Minimum percentage of samples to be used in the training of the GMMs. minV: Minimum number of attributes to be sampled from the dataset. maxV: Maximum number of attributes to be sampled from the dataset. minT: Minimum length of time segments to be sampled from the dataset. maxT: Maximum length of time segments to be sampled from the dataset. I: Number of iterations for the MAP-EM algorithm.

These parameters are usually less sensitive and can be left to their default value in most cases.

Ktr = tck.predict(mode='tr-tr')
Kte = tck.predict(Xte=Xte, mode='tr-te')
  • If mode='tr-tr', returns the similarity matrix between training samples, i.e., Ktr[i,j] is the similarity between time series i and j in the training set.
  • If mode='tr-te', it is necessary to pass the test set Xte as additional imput. The returned similarity matrix Kte[i,j] is the similarity between time series i in the test set and time series j in the training set.

Citation

Please, consider citing the original paper if you are using TCK in your reasearch.

@article{mikalsen2018time,
  title={Time series cluster kernel for learning similarities between multivariate time series with missing data},
  author={Mikalsen, Karl {\O}yvind and Bianchi, Filippo Maria and Soguero-Ruiz, Cristina and Jenssen, Robert},
  journal={Pattern Recognition},
  volume={76},
  pages={569--581},
  year={2018},
  publisher={Elsevier}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tck-0.1.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tck-0.1-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file tck-0.1.tar.gz.

File metadata

  • Download URL: tck-0.1.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for tck-0.1.tar.gz
Algorithm Hash digest
SHA256 ccc2c82ed11318809e29b126a8f2d18d8870a8ebed4b2218889c288c287a09d5
MD5 5029e026e7a5f80b9af299e709444f4f
BLAKE2b-256 ae3d144d0bbbcdb840ec491224b42285246d235d52dfed0a8f1e1f4fb9e9b0c6

See more details on using hashes here.

File details

Details for the file tck-0.1-py3-none-any.whl.

File metadata

  • Download URL: tck-0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for tck-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ed78ad8f5c39716a8d7932f6888b07f2317b85790d80fb404327264bf7b8925e
MD5 b67e861a0c06767b93248b9c3eee1eb3
BLAKE2b-256 4b28e4e0ca3c59878c54319a4bb60f50f66b36aed168a003390c26e9aa7e12ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page