Skip to main content

Interactive machine learning supervision.

Project description

superintendent

Build Status

want to semi-supervise your machine learning?

This package is designed to provide a ipywidget-based interactive labelling tool for your data.

Installation

pip install superintendent

If you want to also use the keyboard shortcuts for labelling faster, you will also have to enable the ipyevents jupyter extension:

jupyter nbextension enable --py --sys-prefix ipyevents

Use case 1: Labelling individual data points

Let's assume we have a text dataset that contains some labelled sentences and some unlabelled sentences. For example, we could get the headlines for a bunch of UK news websites (the code for this comes from the amazing github project compare-headlines by isobelweinberg):

import requests
from bs4 import BeautifulSoup
import datetime

headlines = []
labels = []

r = requests.get('https://www.theguardian.com/uk').text #get html
soup = BeautifulSoup(r, 'html5lib') #run html through beautiful soup
headlines += [headline.text for headline in
              soup.find_all('span', class_='js-headline-text')][:10]
labels += ['guardian'] * (len(headlines) - len(labels))

soup = BeautifulSoup(requests.get('http://www.dailymail.co.uk/home/index.html').text, 'html5lib')
headlines += [headline.text.replace('\n', '').replace('\xa0', '').strip()
              for headline in soup.find_all(class_="linkro-darkred")][:10]
labels += ['daily mail'] * (len(headlines) - len(labels))

Now let's assume that instead of wanting to know about the source of the article, we actually want to know about how professional the headline is. But we don't have labels for the two! We can use superintendent to start creating some. To make sure it's nice and easy on the eyes, we'll also use a custom display function to make the text readable.

from superintendent.semisupervisor import SemiSupervisor
import pandas as pd
from IPython import display

labelling_widget = SemiSupervisor(headlines, labels = [None] * len(headlines),
                                  display_func=lambda txt, n_samples: display.display(display.HTML(txt[0])))
labelling_widget.annotate(options=['professional', 'not professional'])
labelling_widget.new_labels

Use case 2: Labelling clusters

Another common task is labelling clusters of points. Let's say, for example, that we've k-means-clustered the above data and assigned one of

from superintendent.clustersupervisor import ClusterSupervisor
import numpy as np
labelling_widget = ClusterSupervisor(headlines, np.random.choice([1, 2, 3], size=len(headlines)))
labelling_widget.annotate(chunk_size=30)

Again, we can get the labels from the object itself:

labelling_widget.new_labels

We can also get the cluster index -> cluster labels mapping.

labelling_widget.new_clusters

Now, often when we label text clusters, we probably want to not look at all the text individually, but instead want to look at a wordcloud. We can do this by passing a word-cloud generating function to our labeller. We'll use one from the word_cloud package. We'll need to write a little wrapper around it to actually display it:

import matplotlib.pyplot as plt
from wordcloud import WordCloud
import IPython.display

def show_wordcloud(text, n_samples=None):
    text = ' '.join(text.ravel())
    IPython.display.display(
        WordCloud().generate(text).to_image()
    )
labelling_widget = ClusterSupervisor(
    headlines, np.random.choice([1, 2, 3], size=len(headlines)),
    display_func = show_wordcloud
)

Because we want the wordcloud to be drawn for the entire data set, we need to modify the chunk_size argument for our annotate call:

labelling_widget.annotate(chunk_size=np.inf, )
labelling_widget.new_labels

Use case 3: labelling images

For labelling images, there is a special factory method that sets the right display functions.

from sklearn.datasets import load_digits
from superintendent.semisupervisor import SemiSupervisor
import numpy as np
import matplotlib.pyplot as plt

digits = load_digits().data
widget = SemiSupervisor.from_images(digits[:10, :])
widget.annotate(options=list(range(10)))

Use case 3: labelling images

The same can be done for clustered images:

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from matplotlib import pyplot as plt
import numpy as np
from superintendent.clustersupervisor import ClusterSupervisor

digits = load_digits()
embedding = TSNE(
    metric='correlation'
).fit_transform(digits.data)
clusters = KMeans(n_clusters=10, n_jobs=-1).fit_predict(embedding)
cluster_labeller = ClusterSupervisor.from_images(digits.data, clusters)
cluster_labeller.annotate(chunk_size=36)

Once you've done that, you can check how our clustering worked!

(digits.target == cluster_labeller.cluster_labels).mean()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

superintendent-0.2.1.tar.gz (10.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page