Skip to main content

SDOstreamclust is an algorithm for clustering data streams

Reason this release was yanked:

fails to build

Project description

pysdoclust-stream

SDOstreamclust

Incremental stream clustering (and outlier detection) algorithm based on Sparse Data Observers (SDO).

SDOstreamclust is suitable for large, multi-dimensional datasets where clusters are statistically well represented.


Dependencies

SDOstreamclust requires numpy.


Installation

SDOstreamclust can be installed from the main branch:

    pip3 install git+https://github.com/CN-TU/pysdoclust-stream

or simply:

    pip3 install pysdoclust-stream

Example

SDOstreamclust is a straighforward algorithm and very easy to configure. The main parameters are the number of observers k, which determines the size of the model and the parameter T, which defines the memory of the algorithm.

Setting the right k (default=300) depends on the variability of the data and the expected number of clusters, but this is quite a robust parameter that gives proper performances with values between [200,500] in most scenarios. On the other hand, T (default=500) sets the model dynamics and inertia. Intuitively, it is the number of points processed that results in a fully replaced model (on average). Low T is recommended when the data show very fast dynamics, while if data evolution is slow and retaining old clusters is dedired, T should be set with high values.

Additionally, input_buffer (default=0) establishes how many points are necessary for the observers to update the internal clustering. This fundamentally affects the processing speed. Most scenarios commonly tolerate high values in the input_buffer without significantly affecting the accuracy performance. Beyond the mentioned ones, other parameters are inherited from SDOclust and SDOstream and do not usually require adjustment. They are described in python/clustering.py file.

The following example code retrieves a data stream and initialize SDOstreamclust.

from SDOstreamclust import clustering
import numpy as np
import pandas as pd

df = pd.read_csv('example/dataset.csv')
t = df['timestamp'].to_numpy()
x = df[['f0','f1']].to_numpy()
y = df['label'].to_numpy()

k = 200 # Model size
T = 400 # Time Horizon
ibuff = 10 # input buffer
classifier = clustering.SDOstreamclust(k=k, T=T, input_buffer=ibuff)

In the piece of code below the stream data is processed point by point. SDOstreamclust provides a clustering label and an outlierness score per point. It can also perform outlier thresholding internally by giving the label -1 to outliers. To do this, outlier_handling=True must be set and the outlier_threshold (default=5) adjusted.

all_predic = []
all_scores = []

block_size = 1 # per-point processing
for i in range(0, x.shape[0], block_size):
    chunk = x[i:i + block_size, :]
    chunk_time = t[i:i + block_size]
    labels, outlier_scores = classifier.fit_predict(chunk, chunk_time)
    all_predic.append(labels)
    all_scores.append(outlier_scores)
p = np.concatenate(all_predic) # clustering labels
s = np.concatenate(all_scores) # outlierness scores
s = -1/(s+1) # norm. to avoid inf scores

# Thresholding top outliers based on Chebyshev's inequality (88.9%)
th = np.mean(s)+3*np.std(s)
p[s>th]=-1

# Evaluation metrics
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.metrics import roc_auc_score
print("Adjusted Rand Index (clustering):", adjusted_rand_score(y,p))
print("ROC AUC score (outlier/anomaly detection):", roc_auc_score(y<0,s))

Giving ARI=0.97 and ROC-AUC=0.99. Note how SDOstreamclust assigns high outlierness scores to the first points of emerging clusters.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysdoclust-stream-0.1.tar.gz (15.1 MB view details)

Uploaded Source

File details

Details for the file pysdoclust-stream-0.1.tar.gz.

File metadata

  • Download URL: pysdoclust-stream-0.1.tar.gz
  • Upload date:
  • Size: 15.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.17

File hashes

Hashes for pysdoclust-stream-0.1.tar.gz
Algorithm Hash digest
SHA256 8265e59a36bfe7857e6dfe58a6446299214c9f8ab85a94d8770b3944c890bac7
MD5 1e8632365cbc052cd56a4f08667e2a14
BLAKE2b-256 ff91cfbd03588afa46ea9f04223aba5bb547a4b5f1c420a140b643a9d1f57ccc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page