Skip to main content

The stream-learn module is a set of tools necessary for processing data streams using scikit-learn estimators.

Project description

stream-learn

Travis Status Coverage Status CircleCI Status PyPI version

The stream-learn module is a set of tools necessary for processing data streams using scikit-learn estimators. The batch processing approach is used here, where the dataset is passed to the classifier in smaller, consecutive subsets called chunks. The module consists of five sub-modules:

  • streams - containing a data stream generator that allows obtaining both stationary and dynamic distributions in accordance with various types of concept drift (also in the field of a priori probability, i.e. dynamically unbalanced data) and a parser of the standard ARFF file format.
  • evaluators - containing classes for running experiments on stream data in accordance with the Test-Then-Train and Prequential methodology.
  • classifiers - containing sample stream classifiers,
  • ensembles - containing standard team models of stream data classification,
  • utils - containing typical classification quality metrics in data streams.

You can read more about each module in the documentation page.

Quick start guide

Installation

To use the stream-learn package, it will be absolutely useful to install it. Fortunately, it is available in the PyPI repository, so you may install it using pip:

pip install -U stream-learn

You can also install the module cloned from Github using the setup.py file if you have a strange, but perhaps legitimate need:

git clone https://github.com/w4k2/stream-learn.git
cd stream-learn
make install

Preparing experiments

1. Classifier

In order to conduct experiments, a declaration of four elements is necessary. The first is the estimator, which must be compatible with the scikit-learn API and, in addition, implement the partial_fit() method, allowing you to re-fit the already built model. For example, we'll use the standard Gaussian Naive Bayes algorithm:

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

2. Data Stream

The next element is the data stream that we aim to process. In the example we will use a synthetic stream consisting of shocking number of 30 chunks and containing precisely one concept drift. We will prepare it using the StreamGenerator() class of the stream-learn module:

from strlearn.streams import StreamGenerator
stream = StreamGenerator(n_chunks=30, n_drifts=1)

3. Metrics

The third requirement of the experiment is to specify the metrics used in the evaluation of the methods. In the example, we will use the accuracy metric available in scikit-learn and the balanced accuracy from the stream-learn module:

from sklearn.metrics import accuracy_score
from strlearn.utils.metrics import bac
metrics = [accuracy_score, bac]

4. Evaluator

The last necessary element of processing is the evaluator, i.e. the method of conducting the experiment. For example, we will choose the Test-Then-Train paradigm, described in more detail in User Guide. It is important to note, that we need to provide the metrics that we will use in processing at the point of initializing the evaluator. In the case of none metrics given, it will use default pair of accuracy and balanced accuracy scores:

from strlearn.evaluators import TestThenTrain
evaluator = TestThenTrain(metrics)

Processing and understanding results

Once all processing requirements have been met, we can proceed with the evaluation. To start processing, call the evaluator's process method, feeding it with the stream and classifier::

evaluator.process(stream, clf)

The results obtained are stored in the scores atribute of evaluator. If we print it on the screen, we may be able to observe that it is a three-dimensional numpy array with dimensions (1, 29, 2).

  • The first dimension is the index of a classifier submitted for processing. In the example above, we used only one model, but it is also possible to pass a tuple or list of classifiers that will be processed in parallel (See User Guide).
  • The second dimension specifies the instance of evaluation, which in the case of Test-Then-Train methodology directly means the index of the processed chunk.
  • The third dimension indicates the metric used in the processing.

Using this knowledge, we may finally try to illustrate the results of our simple experiment in the form of a plot::

import matplotlib.pyplot as plt

plt.figure(figsize=(6,3))

for m, metric in enumerate(metrics):
    plt.plot(evaluator.scores[0, :, m], label=metric.__name__)

plt.title("Basic example of stream processing")
plt.ylim(0, 1)
plt.ylabel('Quality')
plt.xlabel('Chunk')

plt.legend()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stream-learn-0.8.5.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stream_learn-0.8.5-py3-none-any.whl (41.6 kB view details)

Uploaded Python 3

File details

Details for the file stream-learn-0.8.5.tar.gz.

File metadata

  • Download URL: stream-learn-0.8.5.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.0

File hashes

Hashes for stream-learn-0.8.5.tar.gz
Algorithm Hash digest
SHA256 629ab7866cd18bc1fd7fd26fe9f38c1e0b642915510aa27b9971a709a4e144bc
MD5 1398fcf08de78a21ddd74be23ec982fe
BLAKE2b-256 35327e02d89eab732a52a716b310a57338f48a640ca37012f6c052fb6f73a8de

See more details on using hashes here.

File details

Details for the file stream_learn-0.8.5-py3-none-any.whl.

File metadata

  • Download URL: stream_learn-0.8.5-py3-none-any.whl
  • Upload date:
  • Size: 41.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.0

File hashes

Hashes for stream_learn-0.8.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a7003de1beca88219d4a857d77107dbd926442868f4302547fbd94e829709609
MD5 30dba0d31a5b49a45a08688eb9f68578
BLAKE2b-256 24e8b55beeeeb1b13d32f6f529dcaf5344028125ddd5db23667b09d4ceba7864

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page