Skip to main content

Stream Novelty Detection for River

Project description

documentation pypi bsd_3_license PyPI - Downloads

Stream Novelty Detection for River (StreamNDR) is a Python library for online novelty detection. StreamNDR aims to enable novelty detection in data streams for Python. It is based on the river API and is currently in early stage of development. Contributors are welcome.

📚 Documentation

StreamNDR implements in Python various algorithms for novelty detection that have been proposed in the literature. It follows river implementation and format. At this stage, the following algorithms are implemented:

  • MINAS [1]
  • ECSMiner [2]
  • ECSMiner-WF (Version of ECSMiner [2] without feedback, as proposed in [1])

Full documentation is available here.

🛠 Installation

Note: StreamNDR is intended to be used with Python 3.6 or above and requires the package ClusOpt-Core which requires a C/C++ compiler (such as gcc) and the Boost.Thread library to build. To install the Boost.Thread library on Debian systems, the following command can be used:

sudo apt install libboost-thread-dev

The package can be installed simply with pip :

pip install streamndr

⚡️ Quickstart

As a quick example, we'll train two models (MINAS and ECSMiner-WF) to classify a synthetic dataset created using RandomRBF. The models are trained on only two of the four generated classes ([0,1]) and will try to detect the other classes ([2,3]) as novelty patterns in the dataset in an online fashion.

Let's first generate the dataset.

import numpy as np
from river.datasets import synth

ds = synth.RandomRBF(seed_model=42, seed_sample=42, n_classes=4, n_features=5, n_centroids=10)

offline_size = 1000
online_size = 5000
X_train = []
y_train = []
X_test = []
y_test = []

for x,y in ds.take(10*(offline_size+online_size)):
    
    #Create our training data (known classes)
    if len(y_train) < offline_size:
        if y == 0 or y == 1: #Only showing two first classes in the training set
            X_train.append(np.array(list(x.values())))
            y_train.append(y)
    
    #Create our online stream of data
    elif len(y_test) < online_size:
        X_test.append(x)
        y_test.append(y)
        
    else:
        break

X_train = np.array(X_train)
y_train = np.array(y_train)

MINAS

Let's train our MINAS model on the offline (known) data.

from streamndr.model import Minas
clf = Minas(kini=100, cluster_algorithm='clustream', 
            window_size=600, threshold_strategy=1, threshold_factor=1.1, 
            min_short_mem_trigger=100, min_examples_cluster=20, verbose=1, random_state=42)

clf.learn_many(np.array(X_train), np.array(y_train)) #learn_many expects numpy arrays or pandas dataframes

Let's now test our algorithm in an online fashion, note that our unsupervised clusters are automatically updated with the call to predict_one.

from streamndr.metrics import ConfusionMatrixNovelty, MNew, FNew, ErrRate

known_classes = [0,1]

conf_matrix = ConfusionMatrixNovelty(known_classes)
m_new = MNew(known_classes)
f_new = FNew(known_classes)
err_rate = ErrRate(known_classes)

i = 1
for x, y_true in zip(X_test, y_test):

    y_pred = clf.predict_one(x) #predict_one takes python dictionaries as per River API
    
    if y_pred is not None: #Update our metrics
        conf_matrix.update(y_true, y_pred[0])
        m_new.update(y_true, y_pred[0])
        f_new.update(y_true, y_pred[0])
        err_rate.update(y_true, y_pred[0])


    #Show progress
    if i % 100 == 0:
        print(f"{i}/{len(X_test)}")
    i += 1

Let's look at the results, of course, the hyperparameters of the model can be tuned to get better results.

#print(conf_matrix) #Shows the confusion matrix of the given problem, can be very wide due to one class being detected as multiple Novelty Patterns
print(m_new) #Percentage of novel class instances misclassified as known.
print(f_new) #Percentage of known classes misclassified as novel.
print(err_rate) #Total misclassification error percentage

MNew: 17.15%
FNew: 40.11%
ErrRate: 36.80%

ECSMiner-WF

Let's train our model on the offline (known) data.

from streamndr.model import ECSMinerWF
clf = ECSMinerWF(K=50, min_examples_cluster=10, verbose=1, random_state=42, ensemble_size=7, init_algorithm="kmeans")
clf.learn_many(np.array(X_train), np.array(y_train))

Once again, let's use our model in an online fashion.

conf_matrix = ConfusionMatrixNovelty(known_classes)
m_new = MNew(known_classes)
f_new = FNew(known_classes)
err_rate = ErrRate(known_classes)

for x, y_true in zip(X_test, y_test):

    y_pred = clf.predict_one(x) #predict_one takes python dictionaries as per River API

    if y_pred is not None: #Update our metrics
        conf_matrix.update(y_true, y_pred[0])
        m_new.update(y_true, y_pred[0])
        f_new.update(y_true, y_pred[0])
        err_rate.update(y_true, y_pred[0])
#print(conf_matrix) #Shows the confusion matrix of the given problem, can be very wide due to one class being detected as multiple Novelty Patterns
print(m_new) #Percentage of novel class instances misclassified as known.
print(f_new) #Percentage of known classes misclassified as novel.
print(err_rate) #Total misclassification error percentage

MNew: 28.98%
FNew: 30.26%
ErrRate: 32.40%

Special Thanks

Special thanks goes to Vítor Bernardes, from which some of the code for MINAS is based on their implementation.

💬 References

[1] de Faria, E.R., Ponce de Leon Ferreira Carvalho, A.C. & Gama, J. MINAS: multiclass learning algorithm for novelty detection in data streams. Data Min Knowl Disc 30, 640–680 (2016). https://doi.org/10.1007/s10618-015-0433-y

[2] M. Masud, J. Gao, L. Khan, J. Han and B. M. Thuraisingham, "Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints," in IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, pp. 859-874, June 2011, doi: 10.1109/TKDE.2010.61.

🏫 Affiliations

FZI Logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

streamndr-0.1.6.tar.gz (28.0 kB view details)

Uploaded Source

Built Distribution

streamndr-0.1.6-py3-none-any.whl (34.6 kB view details)

Uploaded Python 3

File details

Details for the file streamndr-0.1.6.tar.gz.

File metadata

  • Download URL: streamndr-0.1.6.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for streamndr-0.1.6.tar.gz
Algorithm Hash digest
SHA256 d8f8555852bc261ac652382f591795e5274e3aaa9a14740e9719db6e220f7677
MD5 8cf784b07af37d406594e68d62bcbfd9
BLAKE2b-256 99760a45047b6ef94e2b5c35ff0b53eef49011ee204f4a780cd3c4304d0cdb64

See more details on using hashes here.

File details

Details for the file streamndr-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: streamndr-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 34.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for streamndr-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 45b725066640bcf47aa816c5ff0e0fefa90509a4f7faf1746ec2b4cb942f80e8
MD5 ccf3124f46c18c4d5a45705858edcdab
BLAKE2b-256 07b5bfe8f99ca4d826a785a9e7c2caab098cdf02d40278a15246db5dd5e1b53e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page