SimilarityTS is an open-source project designed to facilitate the evaluation and comparison of multivariate time series data
Project description
SimilarityTS: Toolkit for the Evaluation of Similarity for multivariate time series
Table of Contents
- Package Description
- Installation
- Usage
- Configuring the toolkit
- Extending the toolkit
- License
- Acknowledgements
Package Description
SimilarityTS is an open-source package designed to facilitate the evaluation and comparison of multivariate time series data. It provides a comprehensive toolkit for analyzing, visualizing, and reporting multiple metrics and figures derived from time series datasets. The toolkit simplifies the process of evaluating the similarity of time series by offering data preprocessing, metrics computation, visualization, statistical analysis, and report generation functionalities. With its customizable features, SimilarityTS empowers researchers and data scientists to gain insights, identify patterns, and make informed decisions based on their time series data.
A command line interface tool is also available at: https://github.com/alejandrofdez-us/similarity-ts-cli.
Available metrics
This toolkit can compute the following metrics:
kl
: Kullback-Leibler divergencejs
: Jensen-Shannon divergenceks
: Kolmogorov-Smirnov testmmd
: Maximum Mean Discrepancydtw
Dynamic Time Warpingcc
: Difference of co-variancescp
: Difference of correlationshi
: Difference of histograms
Available figures
This toolkit can generate the following figures:
-
2d
: the ordinary graphical representation of the time series in a 2D figure with the time represented on the x axis and the data values on the y-axis for- the complete multivariate time series; and
- a plot per column.
Each generated figure plots both the
ts1
and thets2
data to easily obtain key insights into the similarities or differences between them. -
delta
: the differences between the values of each column grouped by periods of time. For instance, the differences between the percentage of cpu used every 10, 25 or 50 minutes. These delta can be used as a means of comparison between time series short-/mid-/long-term patterns. -
pca
: the linear dimensionality reduction technique that aims to find the principal components of a data set by computing the linear combinations of the original characteristics that explain the most variance in the data. -
tsne
: a tool for visualising high-dimensional data sets in a 2D or 3D graphical representation allowing the creation of a single map that reveals the structure of the data at many different scales. -
dtw
path: In addition to the numerical similarity measure, the graphical representation of the DTW path of each column can be useful to better analyse the similarities or differences between the time series columns. Notice that there is no multivariate representation of DTW paths, only single column representations.
Installation
Install the package using pip in your local environment:
pip install similarity-ts
Usage
Users must create a new SimilarityTs
object by calling its constructor and passing the following parameters.
ts1
This time series may represent the baseline or ground truth time series as anumpy
array with shape[length, num_features]
.ts2s
A single or a set of time series as anumpy
array with shape[num_time_series, length, num_features]
.
Constraints:
ts1
time-series andts2s
time-series file(s) must:- have the same dimensionality (number of columns)
- not include a timestamp column
- include only numeric values
- all
ts2s
time-series must have the same length (number of rows).
If ts1
time-series is longer (more rows) than ts2s
time-series, the ts1
time series will be
divided in windows of the same length as the ts2s
time-series.
For each ts2s
time-series, the most similar window (*) from ts1
time series is selected.
Finally, metrics and figures that assess the similarity between each pair of ts2s
time-series and its
associated most similar ts1
window are computed.
(*) The metric used for the selection of the most
similar ts1
time-series window per each ts2s
time-series file is selectable. dtw
is the default selected metric, however, any of
the
metrics are also available for this purpose. See the toolkit configuration section.
Minimal usage examples:
Usage examples can be found at: https://github.com/alejandrofdez-us/similarity-ts/tree/main/usage_examples.
-
Compute metrics between random time series (
ts1
: one time series of lenght 200 and 2 dimensions andts2
: five time series of length 100 and 2 dimensions):import numpy as np from similarity_ts.similarity_ts import SimilarityTs ts1 = np.random.rand(200, 2) ts2s = np.random.rand(5, 100, 2) similarity_ts = SimilarityTs(ts1, ts2s) for ts2_name, metric_name, computed_metric in similarity_ts.get_metric_computer(): print(f'{ts2_name}. {metric_name}: {computed_metric}')
-
Compute metrics and figures between random time series and save figures:
import os import numpy as np from similarity_ts.plots.plot_factory import PlotFactory from similarity_ts.similarity_ts import SimilarityTs def main(): ts1 = np.random.rand(200, 2) ts2s = np.random.rand(5, 100, 2) similarity_ts = SimilarityTs(ts1, ts2s) for ts2_name, metric_name, computed_metric in similarity_ts.get_metric_computer(): print(f'{ts2_name}. {metric_name}: {computed_metric}') for ts2_name, plot_name, generated_plots in similarity_ts.get_plot_computer(): __save_figures(ts2_name, plot_name, generated_plots) def __save_figures(filename, plot_name, generated_plots): for plot in generated_plots: dir_path = __create_directory(filename, f'figures', plot_name) plot[0].savefig(f'{dir_path}{plot[0].axes[0].get_title()}.pdf', format='pdf', bbox_inches='tight') def __create_directory(filename, path, plot_name): if plot_name in PlotFactory.get_instance().figures_requires_all_samples: dir_path = f'{path}/{plot_name}/' else: original_filename = os.path.splitext(filename)[0] dir_path = f'{path}/{original_filename}/{plot_name}/' os.makedirs(dir_path, exist_ok=True) return dir_path if __name__ == '__main__': main()
Configuring the Toolkit
Users can provide metrics or figures to be computed/generated and some other parameterisation. The following code snippet
creates a configuration object that should be passed to the SimilarityTs
constructor:
def __create_similarity_ts_config():
# The list of metrics names that will be computed
metric_config = MetricConfig(['js', 'mmd'])
# The list of figure names that will be generated and the time step in seconds of the time series.
plot_config = PlotConfig(['delta', 'pca'], timestamp_frequency_seconds=300)
# Name of each time series of the ts2s set of time series
ts2_names = ['ts2_1_name', 'ts2_2_name', 'ts2_3_name', 'ts2_4_name', 'ts2_5_name']
# Name of the features
header_names = ['feature1_name', 'feature2_name']
# Creation of the configuration
# stride for cutting the ts1 when needed
# metric used for selecting the most similar window
similarity_ts_config = SimilarityTsConfig(metric_config, plot_config,
stride=10, window_selection_metric='kl',
ts2_names=ts2_names, header_names=header_names)
return similarity_ts_config
If no metrics nor figures are provided, the tool will compute all the available metrics and figures.
The following arguments are also available for fine-tuning:
timestamp_frequency_seconds
: the frequency in seconds in which samples were taken. This is needed to generate the delta figures with correct time magnitudes. By default is1
second.stride
: whents1
time-series is longer thants2s
time-series the windows are computed by using a stride of1
by default. Sometimes using a larger value for the stride parameter improves the performance by skipping the computation of similarity between so many windows.window_selection_metric
: the metric used for the selection of the most similarts1
time-series window per eachts2s
time-series file is selectable.dtw
is the default selected metric, however, any of the metrics are also available for this purpose. See the toolkit configuration section.ts2_names
: name of each time series of thets2s
set of time series.header_names
: name of the features.
Extending the toolkit
Additionally, users may implement their own metric or figure classes and include them by using the MetricFactory
or PlotFactory
register methods. To ensure compatibility with our toolkit, they have to inherit from the base classes Metric
and Plot
.
The following code snippet is an example of how to introduce the Euclidean distance metric:
#eu.py
import numpy as np
from similarity_ts.metrics.metric import Metric
class EuclideanDistance(Metric):
def __init__(self):
super().__init__()
self.name = 'ed'
def compute(self, ts1, ts2, similarity_ts):
metric_result = {'Multivariate': self.__ed(ts1, ts2)}
return metric_result
def compute_distance(self, ts1, ts2):
return self.__ed(ts1, ts2)
def __ed(self, ts1, ts2):
return np.linalg.norm(ts1 - ts2)
Afterward, this metric can be registered by using the register_metric(metric_class)
method from MetricFactory
as shown in the following code snippet:
import numpy as np
from similarity_ts.similarity_ts import SimilarityTs
from similarity_ts.metrics.metric_factory import MetricFactory
from ed import EuclideanDistance
MetricFactory.get_instance().register_metric(EuclideanDistance)
ts1 = np.random.rand(200, 2)
ts2s = np.random.rand(5, 100, 2)
similarity_ts = SimilarityTs(ts1, ts2s)
for ts2_name, metric_name, computed_metric in similarity_ts.get_metric_computer():
print(f'{ts2_name}. {metric_name}: {computed_metric}')
Similarly, new plots can be introduced. For instance a SimilarityPlotByCorrelation
could be defined as:
#cc_plot.py
import numpy as np
import matplotlib.pyplot as plt
from similarity_ts.plots.plot import Plot
class SimilarityPlotByCorrelation(Plot):
def __init__(self, fig_size=(8, 6)):
super().__init__(fig_size)
self.name = 'cc-plot'
def compute(self, similarity_ts, ts2_filename):
super().compute(similarity_ts, ts2_filename)
n_features = self.ts1.shape[1]
similarity = np.corrcoef(self.ts1.T, self.ts2.T)
fig, ax = plt.subplots()
im = ax.imshow(similarity, cmap='RdYlBu', vmin=-1, vmax=1)
ax.set_xticks(np.arange(n_features*2))
ax.set_yticks(np.arange(n_features*2))
xticklabels = [f'ts1_{nfeatures_index}'for nfeatures_index in range(1, n_features+1)]
xticklabels = xticklabels + [f'ts2_{nfeatures_index}'for nfeatures_index in range(1, n_features+1)]
ax.set_xticklabels(xticklabels)
ax.set_yticklabels(xticklabels)
ax.set_xlabel('Feature')
ax.set_ylabel('Feature')
for i in range(n_features*2):
for j in range(n_features*2):
ax.text(j, i, f'{similarity[i, j]:.2f}', ha='center', va='center', color='black')
cbar = ax.figure.colorbar(im, ax=ax)
cbar.ax.set_ylabel('Similarity', rotation=-90, va='bottom')
plt.title('similarity-correlation')
plt.tight_layout()
return [(fig, ax)]
Afterward, this plot can be registered by using the register_plot(plot_class)
method from PlotFactory
as shown in the following code snippet that register the new metric and the new plot:
import os
import numpy as np
from similarity_ts.plots.plot_factory import PlotFactory
from similarity_ts.similarity_ts import SimilarityTs
from similarity_ts.metrics.metric_factory import MetricFactory
from ed import EuclideanDistance
from cc_plot import SimilarityPlotByCorrelation
def main():
MetricFactory.get_instance().register_metric(EuclideanDistance)
PlotFactory.get_instance().register_plot(SimilarityPlotByCorrelation)
ts1 = np.random.rand(200, 2)
ts2s = np.random.rand(5, 100, 2)
similarity_ts = SimilarityTs(ts1, ts2s)
for ts2_name, metric_name, computed_metric in similarity_ts.get_metric_computer():
print(f'{ts2_name}. {metric_name}: {computed_metric}')
for ts2_name, plot_name, generated_plots in similarity_ts.get_plot_computer():
__save_figures(ts2_name, plot_name, generated_plots)
def __save_figures(filename, plot_name, generated_plots):
for plot in generated_plots:
dir_path = __create_directory(filename, f'figures', plot_name)
plot[0].savefig(f'{dir_path}{plot[0].axes[0].get_title()}.pdf', format='pdf', bbox_inches='tight')
def __create_directory(filename, path, plot_name):
if plot_name in PlotFactory.get_instance().figures_requires_all_samples:
dir_path = f'{path}/{plot_name}/'
else:
original_filename = os.path.splitext(filename)[0]
dir_path = f'{path}/{original_filename}/{plot_name}/'
os.makedirs(dir_path, exist_ok=True)
return dir_path
if __name__ == '__main__':
main()
License
SimilarityTS toolkit is free and open-source software licensed under the MIT license.
Acknowledgements
Project PID2021-122208OB-I00, PROYEXCEL_00286 and TED2021-132695B-I00 project, funded by MCIN / AEI / 10.13039 / 501100011033, by Andalusian Regional Government, and by the European Union - NextGenerationEU.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file similarity_ts-1.0.6.tar.gz
.
File metadata
- Download URL: similarity_ts-1.0.6.tar.gz
- Upload date:
- Size: 22.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd7f9481adda8695cee1f808bec88808dea52bbe26bc2680feadbfbbf2e80ae9 |
|
MD5 | 72182354ccaedf90657211836e1ecfed |
|
BLAKE2b-256 | 6cae4862b709f04cb4fabb83858edec8e6decea9332e61a302d300a7928ccbd7 |
File details
Details for the file similarity_ts-1.0.6-py3-none-any.whl
.
File metadata
- Download URL: similarity_ts-1.0.6-py3-none-any.whl
- Upload date:
- Size: 26.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bcfdc29303667f304b5260e06a9d65467bc62f3e70bff3e171eac62d0ec4ff3 |
|
MD5 | e17fcd08c436acf76f2b0027a864508e |
|
BLAKE2b-256 | a4006ba487bdcd4e2adeaeb83033b55fc834984f72efa6a4e11971b73ef629ad |