Interactive machine learning supervision.
Project description
Superintendent
This package is designed to provide a ipywidget
-based interactive labelling tool for your data.
Installation
pip install superintendent
If you want to also use the keyboard shortcuts for labelling faster, you will also have to enable the ipyevents jupyter extension:
jupyter nbextension enable --py --sys-prefix ipyevents
Side note:
For these examples, you will also need to install the following packages:
pip install requests bs4 wordcloud
These are not required for superintendent itself, so they won't be installed when you install superintendent.
Use case 1: Labelling individual data points
Let's assume we have a text dataset that contains some labelled sentences and some unlabelled sentences. For example, we could get the headlines for a bunch of UK news websites (the code for this comes from the github project compare-headlines by isobelweinberg):
import requests
from bs4 import BeautifulSoup
import datetime
headlines = []
labels = []
r = requests.get('https://www.theguardian.com/uk').text #get html
soup = BeautifulSoup(r, 'html5lib') #run html through beautiful soup
headlines += [headline.text for headline in
soup.find_all('span', class_='js-headline-text')][:10]
labels += ['guardian'] * (len(headlines) - len(labels))
soup = BeautifulSoup(requests.get('http://www.dailymail.co.uk/home/index.html').text, 'html5lib')
headlines += [headline.text.replace('\n', '').replace('\xa0', '').strip()
for headline in soup.find_all(class_="linkro-darkred")][:10]
labels += ['daily mail'] * (len(headlines) - len(labels))
Now let's assume that instead of wanting to know about the source of the
article, we actually want to know about how professional the headline is. But we
don't have labels for the two! We can use superintendent
to start creating
some. To make sure it's nice and easy on the eyes, we'll also use a custom
display function to make the text readable.
from superintendent.semisupervisor import SemiSupervisor
import pandas as pd
from IPython import display
labelling_widget = SemiSupervisor(headlines, labels = [None] * len(headlines),
display_func=lambda txt, n_samples: display.display(display.HTML(txt[0])))
labelling_widget.annotate(options=['professional', 'not professional'])
labelling_widget.new_labels
Use case 2: Labelling clusters
Another common task is labelling clusters of points. Let's say, for example, that we've k-means-clustered the above data and assigned one of
from superintendent.clustersupervisor import ClusterSupervisor
import numpy as np
labelling_widget = ClusterSupervisor(headlines, np.random.choice([1, 2, 3], size=len(headlines)))
labelling_widget.annotate(chunk_size=30)
Again, we can get the labels from the object itself:
labelling_widget.new_labels
We can also get the cluster index -> cluster labels mapping.
labelling_widget.new_clusters
Now, often when we label text clusters, we probably want to not look at all the text individually, but instead want to look at a wordcloud. We can do this by passing a word-cloud generating function to our labeller. We'll use one from the word_cloud package. We'll need to write a little wrapper around it to actually display it:
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import IPython.display
def show_wordcloud(text, n_samples=None):
text = ' '.join(text.ravel())
IPython.display.display(
WordCloud().generate(text).to_image()
)
labelling_widget = ClusterSupervisor(
headlines, np.random.choice([1, 2, 3], size=len(headlines)),
display_func = show_wordcloud
)
Because we want the wordcloud to be drawn for the entire data set, we need to
modify the chunk_size argument for our annotate
call:
labelling_widget.annotate(chunk_size=np.inf, )
labelling_widget.new_labels
Use case 3: labelling images
For labelling images, there is a special factory method that sets the right display functions.
from sklearn.datasets import load_digits
from superintendent.semisupervisor import SemiSupervisor
import numpy as np
import matplotlib.pyplot as plt
digits = load_digits().data
widget = SemiSupervisor.from_images(digits[:10, :])
widget.annotate(options=list(range(10)))
Use case 4: labelling clusters of images
The same can be done for clustered images:
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from matplotlib import pyplot as plt
import numpy as np
from superintendent.clustersupervisor import ClusterSupervisor
digits = load_digits()
embedding = TSNE(
metric='correlation'
).fit_transform(digits.data)
clusters = KMeans(n_clusters=10, n_jobs=-1).fit_predict(embedding)
cluster_labeller = ClusterSupervisor.from_images(digits.data, clusters)
cluster_labeller.annotate(chunk_size=36)
Once you've done that, you can check how our clustering worked!
(digits.target == cluster_labeller.cluster_labels).mean()
Use case 5: Active learning
Often, we have a rough idea of an algorithm that might do well on a given task, even if we don't have any labels at all. For example, I know that for a simple image set like MNIST, logistic regression actually does surprisingly well.
In this case, we want to do two things:
- We want to keep track of our algorithm's performance
- We want to leverage our algorithm's predictions to decide what data point to label.
Both of these things can be done with superintendent. For point one, all we need to do is pass an object that conforms to the fit / predict syntax of sklearn as the classifier
keyword argument.
For the second point, we can choose any function that takes in probabilities of labels (in shape n_samples, n_classes
), sorts them, and returns the sorted integer index from most in need of labelling to least in need of labelling. Superintendent provides some functions, described in the superintendent.prioritisation
submodule, that can achieve this. One of these is the entropy
function, which calculates the entropy of predicted probabilities and prioritises high-entropy samples.
As an example:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from superintendent import SemiSupervisor
digits = load_digits()
data_labeller = SemiSupervisor.from_images(
digits.data[:500, :],
classifier=LogisticRegression(),
reorder='entropy'
)
data_labeller.annotate(options=range(10))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for superintendent-0.2.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a6b3ac2359760533fb5d26d62112ca4926a6a68bb46a8a04bb7ce5fd9db872d |
|
MD5 | be438f18040ccc2abaa78db7bdb390b9 |
|
BLAKE2b-256 | e1e1b5f58efa249515e532ae615e347dd35919d1e44558bab2a9b114d8851fa4 |