Sequence Graph Transform (SGT) is a sequence embedding function. SGT extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. The long and short term patterns embedded in SGT can be tuned without any increase in the computation.

# Sequence Graph Transform (SGT) — Sequence Embedding for Clustering, Classification, and Search

#### Maintained by: Chitta Ranjan

The following will cover,

## SGT Class Definition

Sequence Graph Transform (SGT) is a sequence embedding function. SGT extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. The long and short term patterns embedded in SGT can be tuned without any increase in the computation."

class SGT():
'''
Compute embedding of a single or a collection of discrete item
sequences. A discrete item sequence is a sequence made from a set
discrete elements, also known as alphabet set. For example,
suppose the alphabet set is the set of roman letters,
{A, B, ..., Z}. This set is made of discrete elements. Examples of
sequences from such a set are AABADDSA, UADSFJPFFFOIHOUGD, etc.
Such sequence datasets are commonly found in online industry,
for example, item purchase history, where the alphabet set is
the set of all product items. Sequence datasets are abundant in
bioinformatics as protein sequences.
Using the embeddings created here, classification and clustering
models can be built for sequence datasets.
Read more in https://arxiv.org/pdf/1608.03533.pdf
'''

Parameters
----------
Input:

alphabets       Optional, except if mode is Spark.
The set of alphabets that make up all
the sequences in the dataset. If not passed, the
alphabet set is automatically computed as the
unique set of elements that make all the sequences.
A list or 1d-array of the set of elements that make up the
sequences. For example, np.array(["A", "B", "C"].
If mode is 'spark', the alphabets are necessary.

kappa           Tuning parameter, kappa > 0, to change the extraction of
long-term dependency. Higher the value the lesser
the long-term dependency captured in the embedding.
Typical values for kappa are 1, 5, 10.

lengthsensitive Default false. This is set to true if the embedding of
should have the information of the length of the sequence.
If set to false then the embedding of two sequences with
similar pattern but different lengths will be the same.
lengthsensitive = false is similar to length-normalization.

flatten         Default True. If True the SGT embedding is flattened and returned as
a vector. Otherwise, it is returned as a matrix with the row and col
names same as the alphabets. The matrix form is used for
interpretation purposes. Especially, to understand how the alphabets
are "related". Otherwise, for applying machine learning or deep
learning algorithms, the embedding vectors are required.

mode            Choices in {'default', 'multiprocessing'}. Note: 'multiprocessing'
mode requires pandas==1.0.3+ and pandarallel libraries.

processors      Used if mode is 'multiprocessing'. By default, the
number of processors used in multiprocessing is
number of available - 1.
'''

Attributes
----------
def fit(sequence)

Extract Sequence Graph Transform features using Algorithm-2 in https://arxiv.org/abs/1608.03533.
Input:
sequence        An array of discrete elements. For example,
np.array(["B","B","A","C","A","C","A","A","B","A"].

Output:
sgt embedding   sgt matrix or vector (depending on Flatten==False or True) of the sequence

--
def fit_transform(corpus)

Extract SGT embeddings for all sequences in a corpus. It finds
the alphabets encompassing all the sequences in the corpus, if not inputted.
However, if the mode is 'spark', then the alphabets list has to be
explicitly given in Sgt object declaration.

Input:
corpus          A list of sequences. Each sequence is a list of alphabets.

Output:
sgt embedding of all sequences in the corpus.

--
def transform(corpus)

Find SGT embeddings of a new data sample belonging to the same population
of the corpus that was fitted initially.


## Install SGT

Install SGT in Python by running,

\$ pip install sgt

import sgt
sgt.__version__
from sgt import SGT

'2.0.0'

# -*- coding: utf-8 -*-
# Authors: Chitta Ranjan <cran2367@gmail.com>
#
# License: BSD 3 clause


## Installation Test Examples

In the following, there are a few test examples to verify the installation.

# Learning a sgt embedding as a matrix with
# rows and columns as the sequence alphabets.
# This embedding shows the relationship between
# the alphabets. The higher the value the
# stronger the relationship.

sgt = SGT(flatten=False)
sequence = np.array(["B","B","A","C","A","C","A","A","B","A"])
sgt.fit(sequence)

A B C
A 0.090616 0.131002 0.261849
B 0.086569 0.123042 0.052544
C 0.137142 0.028263 0.135335
# SGT embedding to a vector. The vector
# embedding is useful for directly applying
# a machine learning algorithm.

sgt = SGT(flatten=True)
sequence = np.array(["B","B","A","C","A","C","A","A","B","A"])
sgt.fit(sequence)

(A, A)    0.090616
(A, B)    0.131002
(A, C)    0.261849
(B, A)    0.086569
(B, B)    0.123042
(B, C)    0.052544
(C, A)    0.137142
(C, B)    0.028263
(C, C)    0.135335
dtype: float64

'''
SGT embedding on a corpus of sequences.
Test the two processing modes within the
SGT class: 'default', 'multiprocessing'.

'''

# A sample corpus of two sequences.
corpus = pd.DataFrame([[1, ["B","B","A","C","A","C","A","A","B","A"]],
[2, ["C", "Z", "Z", "Z", "D"]]],
columns=['id', 'sequence'])
corpus

id sequence
0 1 [B, B, A, C, A, C, A, A, B, A]
1 2 [C, Z, Z, Z, D]
# Learning the sgt embeddings as vector for
# all sequences in a corpus.
# mode: 'default'
sgt = SGT(kappa=1,
flatten=True,
lengthsensitive=False,
mode='default')
sgt.fit_transform(corpus)

id (A, A) (A, B) (A, C) (A, D) (A, Z) (B, A) (B, B) (B, C) (B, D) ... (D, A) (D, B) (D, C) (D, D) (D, Z) (Z, A) (Z, B) (Z, C) (Z, D) (Z, Z)
0 1.0 0.090616 0.131002 0.261849 0.0 0.0 0.086569 0.123042 0.052544 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000
1 2.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.184334 0.290365

2 rows × 26 columns

# Learning the sgt embeddings as vector for
# all sequences in a corpus.
# mode: 'multiprocessing'

import pandarallel  # required library for multiprocessing

sgt = SGT(kappa=1,
flatten=True,
lengthsensitive=False,
mode='multiprocessing')

sgt.fit_transform(corpus)

INFO: Pandarallel will run on 7 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

id (A, A) (A, B) (A, C) (A, D) (A, Z) (B, A) (B, B) (B, C) (B, D) ... (D, A) (D, B) (D, C) (D, D) (D, Z) (Z, A) (Z, B) (Z, C) (Z, D) (Z, Z)
0 1.0 0.090616 0.131002 0.261849 0.0 0.0 0.086569 0.123042 0.052544 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000
1 2.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.184334 0.290365

2 rows × 26 columns

## Load Libraries for Illustrative Examples

from sgt import SGT

import numpy as np
import pandas as pd
from itertools import chain
from itertools import product as iterproduct
import warnings

import pickle

########
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
import sklearn.metrics
import time

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(7) # fix random seed for reproducibility

# from sgt import Sgt


## Sequence Clustering

A form of unsupervised learning from sequences is clustering. For example, in

• user weblogs sequences: clustering the weblogs segments users into groups with similar browsing behavior. This helps in targeted marketing, anomaly detection, and other web customizations.

• protein sequences: clustering proteins with similar structures help researchers study the commonalities between species. It also helps in faster search in some search algorithms.

In the following, clustering on a protein sequence dataset will be shown.

### Protein Sequence Clustering

The data used here is taken from www.uniprot.org. This is a public database for proteins. The data contains the protein sequences and their functions.

# Loading data

# Data preprocessing
corpus = corpus.loc[:,['Entry','Sequence']]
corpus.columns = ['id', 'sequence']
corpus['sequence'] = corpus['sequence'].map(list)
corpus

id sequence
0 M7MCX3 [M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...
1 K6PL84 [M, E, I, E, K, N, Y, R, M, N, S, L, F, E, F, ...
2 R4W5V3 [M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...
3 T2A126 [M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...
4 L0SHD5 [M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...
... ... ...
2107 A0A081R612 [M, M, N, M, Q, N, M, M, R, Q, A, Q, K, L, Q, ...
2108 A0A081QQM2 [M, M, N, M, Q, N, M, M, R, Q, A, Q, K, L, Q, ...
2109 J1A517 [M, M, R, Q, A, Q, K, L, Q, K, Q, M, E, Q, S, ...
2110 F5U1T6 [M, M, N, M, Q, S, M, M, K, Q, A, Q, K, L, Q, ...
2111 J3A2T7 [M, M, N, M, Q, N, M, M, K, Q, A, Q, K, L, Q, ...

2112 rows × 2 columns

%%time
# Compute SGT embeddings
sgt_ = SGT(kappa=1,
lengthsensitive=False,
mode='multiprocessing')
sgtembedding_df = sgt_.fit_transform(corpus)

INFO: Pandarallel will run on 7 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
CPU times: user 324 ms, sys: 68 ms, total: 392 ms
Wall time: 9.02 s

sgtembedding_df

id (A, A) (A, C) (A, D) (A, E) (A, F) (A, G) (A, H) (A, I) (A, K) ... (Y, M) (Y, N) (Y, P) (Y, Q) (Y, R) (Y, S) (Y, T) (Y, V) (Y, W) (Y, Y)
0 M7MCX3 0.020180 0.0 0.009635 0.013529 0.009360 0.003205 2.944887e-10 0.002226 0.000379 ... 0.009196 0.007964 0.036788 0.000195 0.001513 0.020665 0.000542 0.007479 0.0 0.010419
1 K6PL84 0.001604 0.0 0.012637 0.006323 0.006224 0.004819 3.560677e-03 0.001124 0.012136 ... 0.135335 0.006568 0.038901 0.011298 0.012578 0.009913 0.001079 0.000023 0.0 0.007728
2 R4W5V3 0.012448 0.0 0.008408 0.016363 0.027469 0.003205 2.944887e-10 0.004249 0.013013 ... 0.008114 0.007128 0.000000 0.000203 0.001757 0.022736 0.000249 0.012652 0.0 0.008533
3 T2A126 0.010545 0.0 0.012560 0.014212 0.013728 0.000000 2.944887e-10 0.007223 0.000309 ... 0.000325 0.009669 0.000000 0.003182 0.001904 0.015607 0.000577 0.007479 0.0 0.008648
4 L0SHD5 0.020180 0.0 0.008628 0.015033 0.009360 0.003205 2.944887e-10 0.002226 0.000379 ... 0.009196 0.007964 0.036788 0.000195 0.001513 0.020665 0.000542 0.007479 0.0 0.010419
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2107 A0A081R612 0.014805 0.0 0.004159 0.017541 0.012701 0.013099 0.000000e+00 0.017043 0.004732 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
2108 A0A081QQM2 0.010774 0.0 0.004283 0.014732 0.014340 0.014846 0.000000e+00 0.016806 0.005406 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
2109 J1A517 0.010774 0.0 0.004283 0.014732 0.014340 0.014846 0.000000e+00 0.014500 0.005406 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
2110 F5U1T6 0.015209 0.0 0.005175 0.023888 0.011410 0.011510 0.000000e+00 0.021145 0.009280 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
2111 J3A2T7 0.005240 0.0 0.012301 0.013178 0.014744 0.014705 0.000000e+00 0.000981 0.007957 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000

2112 rows × 401 columns

# Set the id column as the dataframe index
sgtembedding_df = sgtembedding_df.set_index('id')
sgtembedding_df

(A, A) (A, C) (A, D) (A, E) (A, F) (A, G) (A, H) (A, I) (A, K) (A, L) ... (Y, M) (Y, N) (Y, P) (Y, Q) (Y, R) (Y, S) (Y, T) (Y, V) (Y, W) (Y, Y)
id
M7MCX3 0.020180 0.0 0.009635 0.013529 0.009360 0.003205 2.944887e-10 0.002226 0.000379 0.021703 ... 0.009196 0.007964 0.036788 0.000195 0.001513 0.020665 0.000542 0.007479 0.0 0.010419
K6PL84 0.001604 0.0 0.012637 0.006323 0.006224 0.004819 3.560677e-03 0.001124 0.012136 0.018427 ... 0.135335 0.006568 0.038901 0.011298 0.012578 0.009913 0.001079 0.000023 0.0 0.007728
R4W5V3 0.012448 0.0 0.008408 0.016363 0.027469 0.003205 2.944887e-10 0.004249 0.013013 0.031118 ... 0.008114 0.007128 0.000000 0.000203 0.001757 0.022736 0.000249 0.012652 0.0 0.008533
T2A126 0.010545 0.0 0.012560 0.014212 0.013728 0.000000 2.944887e-10 0.007223 0.000309 0.028531 ... 0.000325 0.009669 0.000000 0.003182 0.001904 0.015607 0.000577 0.007479 0.0 0.008648
L0SHD5 0.020180 0.0 0.008628 0.015033 0.009360 0.003205 2.944887e-10 0.002226 0.000379 0.021703 ... 0.009196 0.007964 0.036788 0.000195 0.001513 0.020665 0.000542 0.007479 0.0 0.010419
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
A0A081R612 0.014805 0.0 0.004159 0.017541 0.012701 0.013099 0.000000e+00 0.017043 0.004732 0.014904 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
A0A081QQM2 0.010774 0.0 0.004283 0.014732 0.014340 0.014846 0.000000e+00 0.016806 0.005406 0.014083 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
J1A517 0.010774 0.0 0.004283 0.014732 0.014340 0.014846 0.000000e+00 0.014500 0.005406 0.014083 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
F5U1T6 0.015209 0.0 0.005175 0.023888 0.011410 0.011510 0.000000e+00 0.021145 0.009280 0.017466 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
J3A2T7 0.005240 0.0 0.012301 0.013178 0.014744 0.014705 0.000000e+00 0.000981 0.007957 0.017112 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000

2112 rows × 400 columns

We perform PCA on the sequence embeddings and then do kmeans clustering.

pca = PCA(n_components=2)
pca.fit(sgtembedding_df)

X=pca.transform(sgtembedding_df)

print(np.sum(pca.explained_variance_ratio_))
df = pd.DataFrame(data=X, columns=['x1', 'x2'])

0.6432744907364981

x1 x2
0 0.384913 -0.269873
1 0.022764 0.135995
2 0.177792 -0.172454
3 0.168074 -0.147334
4 0.383616 -0.271163
kmeans = KMeans(n_clusters=3, max_iter =300)
kmeans.fit(df)

labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_

fig = plt.figure(figsize=(5, 5))
colmap = {1: 'r', 2: 'g', 3: 'b'}
colors = list(map(lambda x: colmap[x+1], labels))
plt.scatter(df['x1'], df['x2'], color=colors, alpha=0.5, edgecolor=colors)

<matplotlib.collections.PathCollection at 0x14c5a77f0>


## Sequence Classification using Deep Learning in TensorFlow

The protein data set used above is also labeled. The labels represent the protein functions. Similarly, there are other labeled sequence data sets. For example, DARPA shared an intrusion weblog data set. It contains weblog sequences with positive labels if the log represents a network intrusion.

In such problems supervised learning is employed. Classification is a supervised learning we will demonstrate here.

### Protein Sequence Classification

The data set is taken from https://www.uniprot.org . The protein sequences in the data set have one of the two functions,

• Binds to DNA and alters its conformation. May be involved in regulation of gene expression, nucleoid organization and DNA protection.
• Might take part in the signal recognition particle (SRP) pathway. This is inferred from the conservation of its genetic proximity to ftsY/ffh. May be a regulatory protein.

There are a total of 2113 samples. The sequence lengths vary between 100-700.

# Loading data

# Data preprocessing
y = data['Function [CC]']
encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)

corpus = data.loc[:,['Entry','Sequence']]
corpus.columns = ['id', 'sequence']
corpus['sequence'] = corpus['sequence'].map(list)


#### Sequence embeddings

# Sequence embedding
sgt_ = SGT(kappa=1,
lengthsensitive=False,
mode='multiprocessing')
sgtembedding_df = sgt_.fit_transform(corpus)
X = sgtembedding_df.set_index('id')

INFO: Pandarallel will run on 7 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


We will perform a 10-fold cross-validation to measure the performance of the classification model.

kfold = 10
X = X
y = encoded_y

random_state = 1

test_F1 = np.zeros(kfold)
skf = KFold(n_splits = kfold, shuffle = True, random_state = random_state)
k = 0
epochs = 50
batch_size = 128

for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]

model = Sequential()
model.add(Dense(64, input_shape = (X_train.shape[1],)))

model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=0)

y_pred = model.predict_proba(X_test).round().astype(int)
y_train_pred = model.predict_proba(X_train).round().astype(int)

test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred)
k+=1

print ('Average f1 score', np.mean(test_F1))

Average f1 score 1.0


### Weblog Classification for Intrusion Detection

This data sample is taken from https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset. This is a network intrusion data containing audit logs and any attack as a positive label. Since, network intrusion is a rare event, the data is unbalanced. Here we will,

• build a sequence classification model to predict a network intrusion.

Each sequence contains in the data is a series of activity, for example, {login, password}. The alphabets in the input data sequences are already encoded into integers. The original sequences data file is also present in the /data directory.

# Loading data
data.columns

Index(['timeduration', 'seqlen', 'seq', 'class'], dtype='object')

data['id'] = data.index

# Data preprocessing
y = data['class']
encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)

corpus = data.loc[:,['id','seq']]
corpus.columns = ['id', 'sequence']
corpus['sequence'] = corpus['sequence'].map(list)


#### Sequence embeddings

In this data, the sequence embeddings should be length-sensitive.

The lengths are important here because sequences with similar patterns but different lengths can have different labels. Consider a simple example of two sessions: {login, pswd, login, pswd,...} and {login, pswd,...(repeated several times)..., login, pswd}.

While the first session can be a regular user mistyping the password once, the other session is possibly an attack to guess the password. Thus, the sequence lengths are as important as the patterns.

Therefore, lengthsensitive=True is used here.

# Sequence embedding
sgt_ = SGT(kappa=5,
lengthsensitive=True,
mode='multiprocessing')
sgtembedding_df = sgt_.fit_transform(corpus)
sgtembedding_df = sgtembedding_df.set_index('id')
sgtembedding_df

INFO: Pandarallel will run on 7 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

(0, 0) (0, 1) (0, 2) (0, 3) (0, 4) (0, 5) (0, 6) (0, 7) (0, 8) (0, 9) ... (~, 1) (~, 2) (~, 3) (~, 4) (~, 5) (~, 6) (~, 7) (~, 8) (~, 9) (~, ~)
id
0.0 0.000000 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.485034 0.486999 0.485802 0.483097 0.483956 0.000000 0.000000 0.000000 0.000000 0.178609
1.0 0.000000 0.025622 0.228156 0.000000e+00 0.000000e+00 1.310714e-09 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.447620 0.452097 0.464568 0.367296 0.525141 0.455018 0.374364 0.414081 0.549981 0.172479
2.0 0.000000 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.525605 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.193359 0.071469
3.0 0.077999 0.208974 0.230338 1.830519e-01 1.200926e-17 1.696880e-01 0.093646 7.985870e-02 2.896813e-05 3.701710e-05 ... 0.474072 0.468353 0.463594 0.177507 0.551270 0.418652 0.309652 0.384657 0.378225 0.170362
4.0 0.000000 0.023695 0.217819 2.188276e-33 0.000000e+00 6.075992e-11 0.000000 0.000000e+00 5.681668e-39 0.000000e+00 ... 0.464120 0.468229 0.452170 0.000000 0.501242 0.000000 0.300534 0.161961 0.000000 0.167082
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
106.0 0.000000 0.024495 0.219929 2.035190e-17 1.073271e-18 5.656994e-11 0.000000 0.000000e+00 5.047380e-29 0.000000e+00 ... 0.502213 0.544343 0.477281 0.175901 0.461103 0.000000 0.000000 0.162796 0.000000 0.167687
107.0 0.110422 0.227478 0.217549 1.723963e-01 1.033292e-14 3.896725e-07 0.083685 2.940589e-08 8.864072e-02 4.813990e-29 ... 0.490398 0.522016 0.466808 0.470603 0.479795 0.480057 0.194888 0.172397 0.164873 0.172271
108.0 0.005646 0.202424 0.196786 2.281242e-01 1.133936e-01 1.862098e-01 0.000000 1.212869e-01 9.180520e-08 0.000000e+00 ... 0.432834 0.434953 0.439615 0.390864 0.481764 0.600875 0.166766 0.165368 0.000000 0.171729
109.0 0.000000 0.025616 0.238176 3.889176e-55 1.332427e-60 1.408003e-09 0.000000 9.845377e-60 0.000000e+00 0.000000e+00 ... 0.421318 0.439985 0.467953 0.440951 0.527165 0.864717 0.407155 0.399335 0.251304 0.171885
110.0 0.000000 0.022868 0.203513 9.273472e-64 0.000000e+00 1.240870e-09 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.478090 0.454871 0.459109 0.000000 0.490534 0.370357 0.000000 0.162997 0.000000 0.162089

111 rows × 121 columns

#### Applying PCA on the embeddings

The embeddings are sparse and high-dimensional. PCA is, therefore, applied for dimension reduction.

from sklearn.decomposition import PCA
pca = PCA(n_components=35)
pca.fit(sgtembedding_df)
X = pca.transform(sgtembedding_df)
print(np.sum(pca.explained_variance_ratio_))

0.9962446146783123


#### Building a Multi-Layer Perceptron Classifier

The PCA transforms of the embeddings are used directly as inputs to an MLP classifier.

kfold = 3
random_state = 11

X = X
y = encoded_y

test_F1 = np.zeros(kfold)
time_k = np.zeros(kfold)
skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state)
k = 0
epochs = 300
batch_size = 15

# class_weight = {0 : 1., 1: 1.,}  # The weights can be changed and made inversely proportional to the class size to improve the accuracy.
class_weight = {0 : 0.12, 1: 0.88,}

for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

model = Sequential()
model.summary()

start_time = time.time()
model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=1, class_weight=class_weight)
end_time = time.time()
time_k[k] = end_time-start_time

y_pred = model.predict_proba(X_test).round().astype(int)
y_train_pred = model.predict_proba(X_train).round().astype(int)
test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred)
k += 1

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_30 (Dense)             (None, 128)               4608
_________________________________________________________________
activation_30 (Activation)   (None, 128)               0
_________________________________________________________________
dropout_20 (Dropout)         (None, 128)               0
_________________________________________________________________
dense_31 (Dense)             (None, 1)                 129
_________________________________________________________________
activation_31 (Activation)   (None, 1)                 0
=================================================================
Total params: 4,737
Trainable params: 4,737
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
Train on 74 samples
Epoch 1/300
74/74 [==============================] - 0s 7ms/sample - loss: 0.1487 - accuracy: 0.5270
Epoch 2/300
74/74 [==============================] - 0s 120us/sample - loss: 0.1421 - accuracy: 0.5000
...
74/74 [==============================] - 0s 118us/sample - loss: 0.0299 - accuracy: 0.8784
Epoch 300/300
74/74 [==============================] - 0s 133us/sample - loss: 0.0296 - accuracy: 0.8649

print ('Average f1 score', np.mean(test_F1))
print ('Average Run time', np.mean(time_k))

Average f1 score 0.6341880341880342
Average Run time 3.880180994669596


#### Building an LSTM Classifier on the sequences for comparison

We built an LSTM Classifier on the sequences to compare the accuracy.

X = data['seq']
encoded_X = np.ndarray(shape=(len(X),), dtype=list)
for i in range(0,len(X)):
encoded_X[i]=X.iloc[i].split("~")
X

0      1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~...
1      6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~...
2      19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~1...
3      6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~...
4      5~5~17~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~...
...
106    10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~3...
107    5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~...
108    6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~...
109    6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5...
110    5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~...
Name: seq, Length: 111, dtype: object

max_seq_length = np.max(data['seqlen'])
encoded_X = tf.keras.preprocessing.sequence.pad_sequences(encoded_X, maxlen=max_seq_length)

kfold = 3
random_state = 11

test_F1 = np.zeros(kfold)
time_k = np.zeros(kfold)

epochs = 50
batch_size = 15
skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state)
k = 0

for train_index, test_index in skf.split(encoded_X, y):
X_train, X_test = encoded_X[train_index], encoded_X[test_index]
y_train, y_test = y[train_index], y[test_index]

embedding_vecor_length = 32
top_words=50
model = Sequential()

model.summary()

start_time = time.time()
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1)
end_time=time.time()
time_k[k]=end_time-start_time

y_pred = model.predict_proba(X_test).round().astype(int)
y_train_pred=model.predict_proba(X_train).round().astype(int)
test_F1[k]=sklearn.metrics.f1_score(y_test, y_pred)
k+=1

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, 1773, 32)          1600
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320
_________________________________________________________________
dense_36 (Dense)             (None, 1)                 33
_________________________________________________________________
activation_36 (Activation)   (None, 1)                 0
=================================================================
Total params: 9,953
Trainable params: 9,953
Non-trainable params: 0
_________________________________________________________________
Train on 74 samples
Epoch 1/50
74/74 [==============================] - 5s 72ms/sample - loss: 0.6894 - accuracy: 0.5676
Epoch 2/50
74/74 [==============================] - 4s 48ms/sample - loss: 0.6590 - accuracy: 0.8784
...
Epoch 50/50
74/74 [==============================] - 4s 51ms/sample - loss: 0.1596 - accuracy: 0.9324

print ('Average f1 score', np.mean(test_F1))
print ('Average Run time', np.mean(time_k))

Average f1 score 0.36111111111111116
Average Run time 192.46954011917114


We find that the LSTM classifier gives a significantly lower F1 score. This may be improved by changing the model. However, we find that the SGT embedding could work with a small and unbalanced data without the need of a complicated classifier model.

LSTM models typically require more data for training and also has significantly more computation time. The LSTM model above took 425.6 secs while the MLP model took just 9.1 secs.

## Sequence Search

Sequence data sets are generally large. For example, sequences of listening history in music streaming services, such as Pandora, for more than 70M users are huge. In protein data bases there could be even larger size. For instance, the Uniprot data repository has more than 177M sequences.

Searching for similar sequences in such large data bases is challenging. SGT embedding provides a simple solution. In the following it will be shown on a protein data set that SGT embedding can be used to compute similarity between a query sequence and the sequence corpus using a dot product. The sequences with the highest dot product are returned as the most similar sequence to the query.

### Protein Sequence Search

In the following, a sample of 10k protein sequences are used for illustration. The data is taken from https://www.uniprot.org .

# Loading data
data = pd.read_csv('data/protein-uniprot-reviewed-Ano-10k.tab', sep='\t')

# Data preprocessing
corpus = data.loc[:,['Entry','Sequence']]
corpus.columns = ['id', 'sequence']
corpus['sequence'] = corpus['sequence'].map(list)

id sequence
0 I2WKR6 [M, V, H, K, S, D, S, D, E, L, A, A, L, R, A, ...
1 A0A2A6M8K9 [M, Q, E, S, L, V, V, R, R, E, T, H, I, A, A, ...
2 A0A3G5KEC3 [M, A, S, G, A, Y, S, K, Y, L, F, Q, I, I, G, ...
# Protein sequence alphabets
alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K',
'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V',
'W', 'X', 'Y', 'U', 'O']  # List of amino acids

# Alphabets are known and inputted
# as arguments for faster computation
sgt_ = SGT(alphabets=alphabets,
lengthsensitive=True,
kappa=1,
flatten=True,
mode='multiprocessing')

sgtembedding_df = sgt_.fit_transform(corpus)
sgtembedding_df = sgtembedding_df.set_index('id')

INFO: Pandarallel will run on 7 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

'''
Search proteins similar to a query protein.
The approach is to find the SGT embedding of the
query protein and find its similarity with the
embeddings of the protein database.
'''

# Step 1. Compute sgt embedding for the query protein.
query_protein_sgt_embedding = sgt_.fit(list(query_protein))

# Step 2. Compute the dot product of query embedding
# with the protein embedding database.
similarity = sgtembedding_df.dot(query_protein_sgt_embedding)

# Step 3. Return the top k protein names based on similarity.
similarity.sort_values(ascending=False)

id
K0ZGN5        2773.749663
A0A0Y1CPH7    1617.451379
A0A5R8LCJ1    1566.833152
A0A290WY40    1448.772820
A0A073K6N6    1392.267250
...
A0A1S7UBK4     160.074989
A0A2S7T1R9     156.580584
A0A0E0UQV6     155.834932
A0A1Y5Y0S0     148.862049
B0NRP3         117.656497
Length: 10063, dtype: float64


## SGT - Spark for Distributed Computing

As mentioned in the previous section, sequence data sets can be large. SGT complexity is linear with the number of sequences in a data set. Still if the data size is large the computation becomes high. For example, for a set of 1M protein sequences the default SGT mode takes over 24 hours.

Using distributed computing with Spark the runtime can be significantly reduced. For instance, SGT-Spark on the same 1M protein data set took less than 29 minutes.

In the following, Spark implementation for SGT is shown. First, it is applied on a smaller 10k data set for comparison. Then it is applied on 1M data set without any syntactical change.

'''
'''
data = sc.textFile('data/protein-uniprot-reviewed-Ano-10k.tab')

data = data.filter(lambda row: row != header)   #filter out header
data.take(1)  # See one sample

Out[3]: ['I2WKR6\tI2WKR6_ECOLX\tunreviewed\tType III restriction enzyme, res subunit (EC 3.1.21.5)\tEC90111_4246\tEscherichia coli 9.0111\t786\tMVHKSDSDELAALRAENVRLVSLLEAHGIEWRRKPQSPVPRVSVLSTNEKVALFRRLFRGRDDVWALRWESKTSGKSGYSPACANEWQLGICGKPRIKCGDCAHRQLIPVSDLVIYHHLAGTHTAGMYPLLEDDSCYFLAVDFDEAEWQKDASAFMRSCDELGVPAALEISRSRQGAHVWIFFASRVSAREARRLGTAIISYTCSRTRQLRLGSYDRLFPNQDTMPKGGFGNLIALPLQKRPRELGGSVFVDMNLQPYPDQWAFLVSVIPMNVQDIEPTILRATGSIHPLDVNFINEEDLGTPWEEKKSSGNRLNIAVTEPLIITLANQIYFEKAQLPQALVNRLIRLAAFPNPEFYKAQAMRMSVWNKPRVIGCAENYPQHIALPRGCLDSALSFLRYNNIAAELIDKRFAGTECNAVFTGNLRAEQEEAVSALLRYDTGVLCAPTAFGKTVTAAAVIARRKVNTLILVHRTELLKQWQERLAVFLQVGDSIGIIGGGKHKPCGNIDIAVVQSISRHGEVEPLVRNYGQIIVDECHHIGAVSFSAILKETNARYLLGLTATPIRRDGLHPIIFMYCGAIRHTAARPKESLHNLEVLTRSRFTSGHLPSDARIQDIFREIALDHDRTVAIAEEAMKAFGQGRKVLVLTERTDHLDDIASVMNTLKLSPFVLHSRLSKKKRTMLISGLNALPPDSPRILLSTGRLIGEGFDHPPLDTLILAMPVSWKGTLQQYAGRLHREHTGKSDVRIIDFVDTAYPVLLRMWDKRQRGYKAMGYRIVADGEGLSF']
# Repartition for increasing the parallel processes
data = data.repartition(500)

def preprocessing(line):
'''
Original data are lines where each line has \t
separated values. We are interested in preserving
the first value (entry id), tmp[0], and the last value
(the sequence), tmp[-1].
'''
tmp = line.split('\t')
id = tmp[0]
sequence = list(tmp[-1])
return (id, sequence)

processeddata = data.map(lambda line: preprocessing(line))
processeddata.take(1)  # See one sample

Out[5]: [('A0A2E9WIJ1', ['M', 'Y', 'I', 'F', 'L', 'T', 'L', ... 'A', 'K', 'L', 'D', 'K', 'N', 'D'])]
# Protein sequence alphabets
alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K',
'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V',
'W', 'X', 'Y', 'U', 'O']  # List of amino acids

'''
Spark approach.
In this approach the alphabets argument has to
be passed to the SGT class definition.
The SGT.fit() is then called in parallel.
'''
sgt_ = sgt.SGT(alphabets=alphabets,
kappa=1,
lengthsensitive=True,
flatten=True)
rdd = processeddata.map(lambda x: (x[0], list(sgt_.fit(x[1]))))
sgtembeddings = rdd.collect()
# Command took 29.66 seconds -- by cranjan@processminer.com at 4/22/2020, 12:31:23 PM on databricks


### Compare with the default SGT mode

# Loading data
data = pd.read_csv('data/protein-uniprot-reviewed-Ano-10k.tab', sep='\t')

# Data preprocessing
corpus = data.loc[:,['Entry','Sequence']]
corpus.columns = ['id', 'sequence']
corpus['sequence'] = corpus['sequence'].map(list)

sgt_ = sgt.SGT(alphabets=alphabets,
lengthsensitive=True,
kappa=1,
flatten=True,
mode='default')

sgtembedding_df = sgt_.fit_transform(corpus)
# Command took 13.08 minutes -- by cranjan@processminer.com at 4/22/2020, 1:48:02 PM on databricks


### 1M Protein Database

Protein 1M sequence data set is embedded here. The data set is available here.

'''
'''
data = sc.textFile('data/protein-uniprot-reviewed-Ano-1M.tab')

data = data.filter(lambda row: row != header)   #filter out header
data.take(1)  # See one sample

# Repartition for increasing the parallel processes
data = data.repartition(10000)

processeddata = data.map(lambda line: preprocessing(line))
processeddata.take(1)  # See one sample

# [('A0A2E9WIJ1',
#   ['M','Y','I','F','L','T','L','A','L','F','S',...,'F','S','I','F','A','K','L','D','K','N','D'])]

# Protein sequence alphabets
alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K',
'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V',
'W', 'X', 'Y', 'U', 'O']  # List of amino acids

'''
Spark approach.
In this approach the alphabets argument has to
be passed to the SGT class definition.
The SGT.fit() is then called in parallel.
'''
sgt_ = sgt.SGT(alphabets=alphabets,
kappa=1,
lengthsensitive=True,
flatten=True)
rdd = processeddata.map(lambda x: (x[0], list(sgt_.fit(x[1]))))
sgtembeddings = rdd.collect()
# Command took 28.98 minutes -- by cranjan@processminer.com at 4/22/2020, 3:16:41 PM on databricks

'''OPTIONAL.
Save the embeddings for future use or
production deployment.'''
# Save for deployment
# pickle.dump(sgtembeddings,
#             open("data/protein-sgt-embeddings-1M.pkl", "wb"))
# The pickle dump is shared at https://mega.nz/file/hiAxAAoI#SStAIn_FZjAHvXSpXfdy8VpISG6rusHRf9HlUSqwcsw
# sgtembeddings = pickle.load(open("data/protein-sgt-embeddings-1M.pkl", "rb"))


The pickle dump is shared here.

### Sequence Search using SGT - Spark

Since sgtembeddings on the 1M data set is large it is recommended to use distributed computing to find similar proteins during a search.

sgtembeddings_rdd = sc.parallelize(list(dict(sgtembeddings).items()))
sgtembeddings_rdd = sgtembeddings_rdd.repartition(10000)

'''
Search proteins similar to a query protein.
The approach is to find the SGT embedding of the
query protein and find its similarity with the
embeddings of the protein database.
'''

# Step 1. Compute sgt embedding for the query protein.
query_protein_sgt_embedding = sgt_.fit(list(query_protein))

# Step 2. Broadcast the embedding to the cluster.

# Step 3. Compute similarity between each sequence embedding and the query.
similarity = sgtembeddings_rdd.map(lambda x: (x[0],
x[1]))).collect()

# Step 4. Show the most similar sequences with the query.
pd.DataFrame(similarity).sort_values(by=1, ascending=False)


## Release history Release notifications | RSS feed

Uploaded source
Uploaded py3