Stream Novelty Detection for River
Project description
Stream Novelty Detection for River (StreamNDR) is a Python library for online novelty detection. StreamNDR aims to enable novelty detection in data streams for Python. It is based on the river API and is currently in early stage of development. Contributors are welcome.
📚 Documentation
StreamNDR implements in Python various algorithms for novelty detection that have been proposed in the literature. It follows river implementation and format. At this stage, the following algorithms are implemented:
- MINAS [1]
- ECSMiner [2]
- ECSMiner-WF (Version of ECSMiner [2] without feedback, as proposed in [1])
- ECHO [3]
Full documentation is available here.
🛠 Installation
Note: StreamNDR is intended to be used with Python 3.6 or above and requires the package ClusOpt-Core which requires a C/C++ compiler (such as gcc) and the Boost.Thread library to build. To install the Boost.Thread library on Debian systems, the following command can be used:
sudo apt install libboost-thread-dev
The package can be installed simply with pip :
pip install streamndr
⚡️ Quickstart
As a quick example, we'll train three models (MINAS, ECSMiner-WF, and ECHO) to classify a synthetic dataset created using RandomRBF. The models are trained on only two of the four generated classes ([0,1]) and will try to detect the other classes ([2,3]) as novelty patterns in the dataset in an online fashion.
Let's first generate the dataset.
import numpy as np
from river.datasets import synth
ds = synth.RandomRBF(seed_model=42, seed_sample=42, n_classes=4, n_features=5, n_centroids=10)
offline_size = 1000
online_size = 5000
X_train = []
y_train = []
X_test = []
y_test = []
for x,y in ds.take(10*(offline_size+online_size)):
#Create our training data (known classes)
if len(y_train) < offline_size:
if y == 0 or y == 1: #Only showing two first classes in the training set
X_train.append(np.array(list(x.values())))
y_train.append(y)
#Create our online stream of data
elif len(y_test) < online_size:
X_test.append(x)
y_test.append(y)
else:
break
X_train = np.array(X_train)
y_train = np.array(y_train)
MINAS
Let's train our MINAS model on the offline (known) data.
from streamndr.model import Minas
clf = Minas(kini=100, cluster_algorithm='clustream',
window_size=600, threshold_strategy=1, threshold_factor=1.1,
min_short_mem_trigger=100, min_examples_cluster=20, verbose=1, random_state=42)
clf.learn_many(np.array(X_train), np.array(y_train)) #learn_many expects numpy arrays or pandas dataframes
Let's now test our algorithm in an online fashion, note that our unsupervised clusters are automatically updated with the call to predict_one.
from streamndr.metrics import ConfusionMatrixNovelty, MNew, FNew, ErrRate
known_classes = [0,1]
conf_matrix = ConfusionMatrixNovelty(known_classes)
m_new = MNew(known_classes)
f_new = FNew(known_classes)
err_rate = ErrRate(known_classes)
i = 1
for x, y_true in zip(X_test, y_test):
y_pred = clf.predict_one(x) #predict_one takes python dictionaries as per River API
if y_pred is not None: #Update our metrics
conf_matrix.update(y_true, y_pred[0])
m_new.update(y_true, y_pred[0])
f_new.update(y_true, y_pred[0])
err_rate.update(y_true, y_pred[0])
#Show progress
if i % 100 == 0:
print(f"{i}/{len(X_test)}")
i += 1
Let's look at the results, of course, the hyperparameters of the model can be tuned to get better results.
#print(conf_matrix) #Shows the confusion matrix of the given problem, can be very wide due to one class being detected as multiple Novelty Patterns
print(m_new) #Percentage of novel class instances misclassified as known.
print(f_new) #Percentage of known classes misclassified as novel.
print(err_rate) #Total misclassification error percentage
MNew: 17.15%
FNew: 40.11%
ErrRate: 36.80%
ECSMiner-WF
Let's train our model on the offline (known) data.
from streamndr.model import ECSMinerWF
clf = ECSMinerWF(K=50, min_examples_cluster=10, verbose=1, random_state=42, ensemble_size=7, init_algorithm="kmeans")
clf.learn_many(np.array(X_train), np.array(y_train))
Once again, let's use our model in an online fashion.
conf_matrix = ConfusionMatrixNovelty(known_classes)
m_new = MNew(known_classes)
f_new = FNew(known_classes)
err_rate = ErrRate(known_classes)
for x, y_true in zip(X_test, y_test):
y_pred = clf.predict_one(x) #predict_one takes python dictionaries as per River API
if y_pred is not None: #Update our metrics
conf_matrix.update(y_true, y_pred[0])
m_new.update(y_true, y_pred[0])
f_new.update(y_true, y_pred[0])
err_rate.update(y_true, y_pred[0])
#print(conf_matrix) #Shows the confusion matrix of the given problem, can be very wide due to one class being detected as multiple Novelty Patterns
print(m_new) #Percentage of novel class instances misclassified as known.
print(f_new) #Percentage of known classes misclassified as novel.
print(err_rate) #Total misclassification error percentage
MNew: 60.93%
FNew: 26.78%
ErrRate: 39.40%
ECHO
Let's train our ECHO model on the offline (known) data. Note that ECHO requires the true label during the online phase.
from streamndr.model import Echo
clf = Echo(K=50, min_examples_cluster=10, verbose=1, random_state=42, ensemble_size=7, W=500, tau=0.9, init_algorithm="kmeans")
clf.learn_many(np.array(X_train), np.array(y_train))
Once again, let's use our model in an online fashion.
conf_matrix = ConfusionMatrixNovelty(known_classes)
m_new = MNew(known_classes)
f_new = FNew(known_classes)
err_rate = ErrRate(known_classes)
for x, y_true in zip(X_test, y_test):
y_pred = clf.predict_one(x, y_true) #predict_one takes a python dictionary and the true label
if y_pred is not None: #Update our metrics
conf_matrix.update(y_true, y_pred[0])
m_new.update(y_true, y_pred[0])
f_new.update(y_true, y_pred[0])
err_rate.update(y_true, y_pred[0])
#print(conf_matrix) #Shows the confusion matrix of the given problem, can be very wide due to one class being detected as multiple Novelty Patterns
print(m_new) #Percentage of novel class instances misclassified as known.
print(f_new) #Percentage of known classes misclassified as novel.
print(err_rate) #Total misclassification error percentage
MNew: 24.20%
FNew: 16.16%
ErrRate: 22.74%
Special Thanks
Special thanks goes to Vítor Bernardes, from which some of the code for MINAS is based on their implementation.
💬 References
[1] de Faria, E.R., Ponce de Leon Ferreira Carvalho, A.C. & Gama, J. MINAS: multiclass learning algorithm for novelty detection in data streams. Data Min Knowl Disc 30, 640–680 (2016). https://doi.org/10.1007/s10618-015-0433-y
[2] M. Masud, J. Gao, L. Khan, J. Han and B. M. Thuraisingham, "Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints," in IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, pp. 859-874, June 2011, doi: 10.1109/TKDE.2010.61.
[3] A. Haque, L. Khan, M. Baron, B. Thuraisingham and C. Aggarwal, "Efficient handling of concept drift and concept evolution over stream data," 2016 IEEE 32nd International Conference on Data Engineering (ICDE), Helsinki, Finland, 2016, pp. 481-492, doi: 10.1109/ICDE.2016.7498264.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file streamndr-0.2.0.tar.gz.
File metadata
- Download URL: streamndr-0.2.0.tar.gz
- Upload date:
- Size: 32.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
069875d375b3fe22cc043dac85be0f20a328447cb26f23294a5365ea51ece984
|
|
| MD5 |
e8ff349dd11de1ec5a20ca8b10c82b85
|
|
| BLAKE2b-256 |
f0726963cf2d54446f2a4f544a428beb881baa523427f30606219c5a5496afad
|
File details
Details for the file streamndr-0.2.0-py3-none-any.whl.
File metadata
- Download URL: streamndr-0.2.0-py3-none-any.whl
- Upload date:
- Size: 41.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28a4d6ffa951c1ec10ac224c1c7982af2cf82946eecdf9943d7717348d7f7ec3
|
|
| MD5 |
19af8ee040dd73b989077c9835ebbf42
|
|
| BLAKE2b-256 |
7aec1972e074c790bb0e8b2fe6c10c8a1ea975ecb2705957bb5b9aff560f467e
|