Skip to main content

Our package scikit-activeml is a Python library for active learning on top of SciPy and scikit-learn.

Project description


https://raw.githubusercontent.com/scikit-activeml/scikit-activeml/master/docs/logos/scikit-activeml-logo.png

scikit-activeml: A Library and Toolbox for Active Learning Algorithms

Doc Codecov PythonVersion PyPi Black Downloads Paper

Machine learning models often need large amounts of training data to perform well. Whereas unlabeled data can be easily gathered, the labeling process is difficult, time-consuming, or expensive in most applications. Active learning can help solve this problem by querying labels for those data samples improving the performance the most. Thereby, the goal is that the learning algorithm performs sufficiently well with fewer labels. With this goal in mind, scikit-activeml has been developed as a Python library for active learning on top of scikit-learn.

User Installation

The easiest way of installing scikit-activeml is using pip:

pip install -U scikit-activeml

The installation via pip defines only minimum requirements to avoid potential package downgrades withing your installation. If you encounter any incompatibility issues, you can downgrade packages by installing the maximum requirements, we tested at the release of the current package version:

pip install -r requirements_max.txt

Examples

We provide a broad overview of different use-cases in our tutorial section offering

Two simple code snippets illustrating the straightforwardness of implementing active learning cycles with our Python package skactiveml are given in the following.

Pool-based Active Learning

The following code snippet implements an active learning cycle with 20 iterations using a Gaussian process classifier and uncertainty sampling. To use other classifiers, you can simply wrap classifiers from sklearn or use classifiers provided by skactiveml. Note that the main difficulty using active learning with sklearn is the ability to handle unlabeled data, which we denote as a specific value (MISSING_LABEL) in the label vector y. More query strategies can be found in the documentation.

import numpy as np
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.datasets import make_blobs
from skactiveml.pool import UncertaintySampling
from skactiveml.utils import unlabeled_indices, MISSING_LABEL
from skactiveml.classifier import SklearnClassifier

# Generate data set.
X, y_true = make_blobs(n_samples=200, centers=4, random_state=0)
y = np.full(shape=y_true.shape, fill_value=MISSING_LABEL)

# Use the first 10 instances as initial training data.
y[:10] = y_true[:10]

# Create classifier and query strategy.
clf = SklearnClassifier(
    GaussianProcessClassifier(random_state=0),
    classes=np.unique(y_true),
    random_state=0
)
qs = UncertaintySampling(method='entropy')

# Execute active learning cycle.
n_cycles = 20
for c in range(n_cycles):
    query_idx = qs.query(X=X, y=y, clf=clf)
    y[query_idx] = y_true[query_idx]

# Fit final classifier.
clf.fit(X, y)

As a result, we obtain an actively trained Gaussian process classifier. A corresponding visualization of its decision boundary (black line) and the sample utilities (greenish contours) is given below.

https://raw.githubusercontent.com/scikit-activeml/scikit-activeml/master/docs/logos/pal-example-output.png

Stream-based Active Learning

The following code snippet implements an active learning cycle with 200 data points and the default budget of 10% using a pwc classifier and split uncertainty sampling. Like in the pool-based example you can wrap other classifiers from sklearn, sklearn compatible classifiers or like the example classifiers provided by skactiveml.

import numpy as np
from sklearn.datasets import make_blobs
from skactiveml.classifier import ParzenWindowClassifier
from skactiveml.stream import Split
from skactiveml.utils import MISSING_LABEL

# Generate data set.
X, y_true = make_blobs(n_samples=200, centers=4, random_state=0)

# Create classifier and query strategy.
clf = ParzenWindowClassifier(random_state=0, classes=np.unique(y_true))
qs = Split(random_state=0)

# Initializing the training data as an empty array.
X_train = []
y_train = []

# Initialize the list that stores the result of the classifier's prediction.
correct_classifications = []

# Execute active learning cycle.
for x_t, y_t in zip(X, y_true):
    X_cand = x_t.reshape([1, -1])
    y_cand = y_t
    clf.fit(X_train, y_train)
    correct_classifications.append(clf.predict(X_cand)[0] == y_cand)
    sampled_indices = qs.query(candidates=X_cand, clf=clf)
    qs.update(candidates=X_cand, queried_indices=sampled_indices)
    X_train.append(x_t)
    y_train.append(y_cand if len(sampled_indices) > 0 else MISSING_LABEL)

As a result, we obtain an actively trained Parzen window classifier. A corresponding visualization of its accuracy curve accross the active learning cycle is given below.

https://raw.githubusercontent.com/scikit-activeml/scikit-activeml/master/docs/logos/stream-example-output.png

Query Strategy Overview

For better orientation, we provide an overview (incl. paper references and visual examples) of the query strategies implemented by skactiveml.

Overview Visualization

Citing

If you use skactiveml in one of your research projects and find it helpful, please cite the following:

@article{skactiveml2021,
    title={scikit-activeml: {A} {L}ibrary and {T}oolbox for {A}ctive {L}earning {A}lgorithms},
    author={Daniel Kottke and Marek Herde and Tuan Pham Minh and Alexander Benz and Pascal Mergard and Atal Roghman and Christoph Sandrock and Bernhard Sick},
    journal={Preprints},
    doi={10.20944/preprints202103.0194.v1},
    year={2021},
    url={https://github.com/scikit-activeml/scikit-activeml}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit-activeml-0.5.2.tar.gz (212.8 kB view details)

Uploaded Source

Built Distribution

scikit_activeml-0.5.2-py3-none-any.whl (304.0 kB view details)

Uploaded Python 3

File details

Details for the file scikit-activeml-0.5.2.tar.gz.

File metadata

  • Download URL: scikit-activeml-0.5.2.tar.gz
  • Upload date:
  • Size: 212.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for scikit-activeml-0.5.2.tar.gz
Algorithm Hash digest
SHA256 62f82f6bef22d7873aeb264b22be4c2a0040ec923fd25ad43e0ef2a5e78eaa61
MD5 23525ab01534fad7182fa46a4add54ba
BLAKE2b-256 1efa707f9b62991d3abba5098aa14e501696a3f5f426745bedadc0c230ef1886

See more details on using hashes here.

File details

Details for the file scikit_activeml-0.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for scikit_activeml-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bee568e3627e8b82c0a6059b72e8b6a6cfd6ab75375bf4d0908f3b05bed82fbb
MD5 cd22836f12a5442aef8589aae32b69c9
BLAKE2b-256 7f73b911fa38b509aaff15fa8670226404a77488d1e58a0e436e894a02351f0f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page