Kernel similarity for classification and clustering of multi-variate time series with missing values.
Project description
The Time Series Cluster Kernel (TCK) is a kernel similarity for multivariate time series with missing values. The kernel can be used to perform tasks such as classification, clustering, and dimensionality reduction.
TCK is based on an ensemble of Gaussian Mixture Models for time series that use informative Bayesian priors robust to missing values. The similarity between two time series is proportional to the number of times the two time series are assigned to the same mixtures.
Installation
The recommended installation is with pip:
pip install tck
Alternatively, you can install the library from source:
git clone https://github.com/FilippoMB/https://github.com/FilippoMB/Time-Series-Cluster-Kernel.git
cd https://github.com/FilippoMB/Time-Series-Cluster-Kernel
pip install -e .
Quick start
The following scripts provide minimalistic examples that illustrate how to use the library for different tasks.
To run them, download the project and cd to the root folder:
git clone https://github.com/FilippoMB/https://github.com/FilippoMB/Time-Series-Cluster-Kernel.git
cd https://github.com/FilippoMB/Time-Series-Cluster-Kernel
Classification
python examples/classification.py
Clustering
python examples/clustering.py
The following notebooks illustrate more advanced use-cases.
- Perform time series dimensionality reduction, cluster analysis, and visualize the results: view or
Running on Windows
TCK uses multiprocessing. While using multiprocessing in Python on windows, it is necessary to protect the entry point of the program by using
if __name__ == '__main__':
Please, refer to the following examples.
Classification
python examples/classification_windows.py
Clustering
python examples/clustering_windows.py
Datasets
Data format
- TCK works both with univariate and multivariate time series. The dataset must be stored in a numpy array of shape
[N, T, V], whereNis the number of variables,Tis the number of time steps, andVis the number of variables (V=1in the univariate case). - If the time series in the same dataset have a different number of time steps,
Tcorresponds to the maximum length of the time series in the dataset. All the time series shorter thanTshould be padded with trailing zeros to match the dimensionT. Alternatively, one can use interpolation to stretch the shorter time series up to lengthT. - The time series can contain missing data. Missing data dare indicated by entries
np.nanin the data array.
Available datasets
There are several univariate and multivariate time series classification datasets immediately available for test and benchmarking purposes.
To list of available datasets can be retrieved as follows
from tck.datasets import DataLoader
downloader.available_datasets(details=True) # Leave at False to just get the names
A dataset can be loaded as follows
Xtr, Ytr, Xte, Yte = downloader.get_data('Japanese_Vowels')
Configuration and detailed usage
There are few hyperparameters that can be tuned to modify the TCK behavior.
tck = TCK(G, C)
Gis the number of GMMs.Cis the number of components in the GMMs.
Usually, the higher the better but the computations take longer.
tck.fit(X, minN, minV, maxV, minT, maxT, I)
minN: Minimum percentage of samples to be used in the training of the GMMs.
minV: Minimum number of attributes to be sampled from the dataset.
maxV: Maximum number of attributes to be sampled from the dataset.
minT: Minimum length of time segments to be sampled from the dataset.
maxT: Maximum length of time segments to be sampled from the dataset.
I: Number of iterations for the MAP-EM algorithm.
These parameters are usually less sensitive and can be left to their default value in most cases.
Ktr = tck.predict(mode='tr-tr')
Kte = tck.predict(Xte=Xte, mode='tr-te')
- If
mode='tr-tr', returns the similarity matrix between training samples, i.e.,Ktr[i,j]is the similarity between time seriesiandjin the training set. - If
mode='tr-te', it is necessary to pass the test setXteas additional imput. The returned similarity matrixKte[i,j]is the similarity between time seriesiin the test set and time seriesjin the training set.
Citation
Please, consider citing the original paper if you are using TCK in your reasearch.
@article{mikalsen2018time,
title={Time series cluster kernel for learning similarities between multivariate time series with missing data},
author={Mikalsen, Karl {\O}yvind and Bianchi, Filippo Maria and Soguero-Ruiz, Cristina and Jenssen, Robert},
journal={Pattern Recognition},
volume={76},
pages={569--581},
year={2018},
publisher={Elsevier}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tck-0.2.tar.gz.
File metadata
- Download URL: tck-0.2.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3595f61025092af9f5b559b3c4a7430ca18e26992bbe7a3f08b72c82cde3d002
|
|
| MD5 |
a4096eccb9068082c2d917baf50af4ae
|
|
| BLAKE2b-256 |
d192a4f7e218c447f6e210ae4e7fff4e8dbbb99a869f6975a2dadf57c598f3b1
|
File details
Details for the file tck-0.2-py3-none-any.whl.
File metadata
- Download URL: tck-0.2-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cce65d34f51a9411c5cac34605fee2697124d50bffb5718ea2a1e13a63c3ae8
|
|
| MD5 |
65400b79dec2743615e465e620f66048
|
|
| BLAKE2b-256 |
07eaa9ebfd884b7c362b6fce6c4e51e1e5ad22cd71559d8aad997c17bead9f4e
|