Skip to main content

A package linking symbolic representation with sklearn for time series prediction

Project description

slearn

Build Status PyPI version PyPI pyversions License: MIT

A package linking symbolic representation with sklearn for time series prediction.

Symbolic representations of time series have proved their usefulness in the field of time series motif discovery, clustering, classification, forecasting, anomaly detection, etc. Symbolic time series representation method do not only reduce the dimensionality of time series but also speedup the downstream time series task. It has been demonstrated by [S. Elsworth and S. Güttel, Time series forecasting using LSTM networks: a symbolic approach, arXiv, 2020] that symbolic forecasting has greatly reduce the sensitivity of hyperparameter settings for Long Short Term Memory networks. How to appropriately deploy machine learning algorithm on the level of symbols instead of raw time series poses a challenge to the interest of applications. To boost the development of research community on symbolic representation, we develop this Python library to simplify the process of machine learning algorithm practice on symbolic representation.

Now let's get started!

Install the slearn package simply by

$ pip install slearn
Support Classifiers Parameter call
Multi-layer Perceptron 'MLPClassifier'
K-Nearest Neighbors 'KNeighborsClassifier'
Gaussian Naive Bayes 'GaussianNB'
Decision Tree 'DecisionTreeClassifier'
Support Vector Classification 'SVC'
Radial-basis Function Kernel 'RBF'
Logistic Regression 'LogisticRegression'
Quadratic Discriminant Analysis 'QuadraticDiscriminantAnalysis'
AdaBoost classifier 'AdaBoostClassifier'
Random Forest 'RandomForestClassifier'
LightGBM 'LGBM'

Symbolic machine learning prediction

Import the package

from slearn import symbolicML

We can predict any symbolic sequence by choosing the classifiers available in scikit-learn.

string = 'aaaabbbccd'
sbml = symbolicML(classifier_name="MLPClassifier", ws=3, random_seed=0, verbose=0)
x, y = sbml._encoding(string)
pred = sbml.forecasting(x, y, step=5, hidden_layer_sizes=(10,10), learning_rate_init=0.1)
print(pred) #  ['d', 'b', 'a', 'b', 'b'] 

Also, you can use it by passing into parameters of dictionary form

string = 'aaaabbbccd'
sbml = symbolicML(classifier_name="MLPClassifier", ws=3, random_seed=0, verbose=0)
x, y = sbml._encoding(string)
params = {'hidden_layer_sizes':(10,10), 'activation':'relu', 'learning_rate_init':0.1}
pred = sbml.forecasting(x, y, step=5, **params)
print(pred) # ['d', 'b', 'a', 'b', 'b'] # the prediction

The parameter settings for the chosen classifier follow the same as the scikit-learn library, so just ensure that parameters are existing in the scikit-learn classifiers. More details are refer to scikit-learn website.

Prediction with symbolic representation

Load libraries.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from slearn import *

time_series = pd.read_csv("Amazon.csv") # load the required dataset, here we use Amazon stock daily close price.
ts = time_series.Close.values

Set the number of symbols you would like to predict.

step = 50

You can select the available classifiers and symbolic representation method (currently we support SAX and ABBA) for prediction. Similarly, the parameters of the chosen classifier follow the same as the scikit-learn library. We usually deploy ABBA symbolic representation, since it achieves better forecasting against SAX.

Use Gaussian Naive Bayes method:

sl = slearn(method='fABBA',  ws=3, step=step, classifier_name="GaussianNB")
sl.set_symbols(series=ts, tol=0.01, alpha=0.2) 
sklearn_params = {'var_smoothing':0.001}
abba_nb_pred = sl.predict(**sklearn_params)

For the last two lines, they can also be replaced with the alternative way in a clear form:

abba_nb_pred = sl.predict(var_smoothing=0.001)

This follows the same as below.

Try neural network models method:

sl = slearn(method='fABBA', ws=3, step=step, classifier_name="MLPClassifier")
sl.set_symbols(series=ts, tol=0.01, alpha=0.2) 
sklearn_params = {'hidden_layer_sizes':(20,80), 'learning_rate_init':0.1}
abba_nn_pred = sl.predict(**sklearn_params)

Now we try to preduct real-world time series. We can plot the prediction and compare the results.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from slearn import * # old code, now updated
np.random.seed(0)
time_series = pd.read_csv("doc/Amazon.csv")
ts = time_series.Close.values
length = len(ts)
train, test = ts[:round(0.9*length)], ts[round(0.9*length):]
sl = slearn(method='fABBA', ws=8, step=1000, classifier_name="GaussianNB")
sl.set_symbols(series=train, tol=0.01, alpha=0.1) 
abba_nb_pred = sl.predict(var_smoothing=0.001)
sl = slearn(method='fABBA', ws=8, step=1000, classifier_name="DecisionTreeClassifier")
sl.set_symbols(series=train, tol=0.01, alpha=0.1) 
abba_nn_pred = sl.predict(max_depth=10, random_state=0)
sl = slearn(method='fABBA', ws=8, step=100, classifier_name="SVC")
sl.set_symbols(series=train, tol=0.01, alpha=0.1) 
abba_svc_pred = sl.predict(C=20)
min_len = np.min([len(test), len(abba_nb_pred), len(abba_nn_pred)])
plt.figure(figsize=(20, 5))
sns.set(font_scale=2, style="whitegrid")
sns.lineplot(data=test[:min_len], linewidth=6, color='c', label='ground truth')
sns.lineplot(data=abba_nb_pred[:min_len], linewidth=6, color='tomato', label='prediction (ABBA - GaussianNB)')
sns.lineplot(data=abba_nn_pred[:min_len], linewidth=6, color='m', label='prediction (ABBA - DecisionTreeClassifier)')
sns.lineplot(data=abba_svc_pred[:min_len], linewidth=6, color='yellowgreen', label='prediction (ABBA - Support Vector Classification)')
plt.legend()
plt.tick_params(axis='both', labelsize=25)
plt.show()

original image

Flexible symbolic sequence generator

slearn library also contains functions for the generation of strings of tunable complexity using the LZW compressing method as base to approximate Kolmogorov complexity.

from slearn import *
df_strings = LZWStringLibrary(symbols=3, complexity=[3, 9])
df_strings

Processing: 2 of 2

nr_symbols LZW_complexity length string
0 3 3 3 BCA
1 3 9 12 ABCBBCBBABCC
df_iters = pd.DataFrame()
for i, string in enumerate(df_strings['string']):
    kwargs = df_strings.iloc[i,:-1].to_dict()
    seed_string = df_strings.iloc[i,-1]
    df_iter = RNN_Iteration(seed_string, iterations=2, architecture='LSTM', **kwargs)
    df_iter.loc[:, kwargs.keys()] = kwargs.values()
    df_iters = df_iters.append(df_iter)
df_iter.reset_index(drop=True, inplace=True)

...

df_iters.reset_index(drop=True, inplace=True)
df_iters
jw dl total_epochs seq_test seq_forecast total_time nr_symbols LZW_complexity length
0 1.000000 1.0 12 ABCABCABCA ABCABCABCA 2.685486 3 3 3
1 1.000000 1.0 14 ABCABCABCA ABCABCABCA 2.436733 3 3 3
2 0.657143 0.5 36 CBBCBBABCC AABCABCABC 3.352712 3 9 12
3 0.704762 0.4 36 CBBCBBABCC ABCBABBBBB 3.811584 3 9 12

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slearn-0.1.3.tar.gz (20.0 kB view details)

Uploaded Source

File details

Details for the file slearn-0.1.3.tar.gz.

File metadata

  • Download URL: slearn-0.1.3.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for slearn-0.1.3.tar.gz
Algorithm Hash digest
SHA256 e7f36d2b12507b2b1e117ccf7b8b07c566427f2acb6f37acf4abe7c03f258a67
MD5 40e4c5d123714c3fa6d28a3fd01f9764
BLAKE2b-256 8f54152308a3cfec4be81bb067db4a840437f0e35417b0c44a5c9b7c08d94890

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page