Skip to main content

Practical Machine Learning for NLP

Project description

Thinc is the machine learning library powering spaCy. It features a battle-tested linear model designed for large sparse learning problems, and a flexible neural network model under development for spaCy v2.0.

Thinc is a practical toolkit for implementing models that follow the “Embed, encode, attend, predict” architecture. It’s designed to be easy to install, efficient for CPU usage and optimised for NLP and deep learning with text – in particular, hierarchically structured input and variable-length sequences.

🔮 Version 6.0 out now! Read the release notes here.

Build Status Current Release Version pypi Version spaCy / thinc on Gitter

Quickstart

If you have Fabric installed, you can use the shortcut:

git clone https://github.com/explosion/thinc
cd thinc
fab clean env make test

You can then run the examples as follows:

fab eg.mnist
fab eg.basic_tagger

Otherwise, you can build and test explicitly with:

git clone https://github.com/explosion/thinc
cd thinc

virtualenv .env
source .env/bin/activate

pip install -r requirements.txt
python setup.py build_ext --inplace
py.test thinc/

And then run the examples as follows:

python examples/mnist.py
python examples/basic_tagger.py

Design

Thinc is implemented in pure Python at the moment, using Chainer’s cupy for GPU and numpy for CPU computations. Thinc doesn’t use autodifferentiation. Instead, we just use callbacks.

Let’s say you have a batch of data, of shape (B, I). You want to use this to update a model. To do that, you need to compute the model’s output for that input, and also the gradient with respect to that output. Like so:

x__BO, finish_update = model.begin_update(x__BI)
dx__BO = compute_gradient(dx__BO, y__B)
dx__BI = finish_update(dx__BO)

To backprop through multiple layers, we simply accumulate the callbacks:

class Chain(list):
    def predict(self, X):
        for layer in self:
            X = layer(X)
        return X

    def begin_update(self, X, dropout=0.0):
        callbacks = []
        for layer in self.layers:
            X, callback = layer.begin_update(X, dropout=dropout)
        callbacks.append(callback)

        def finish_update(gradient, optimizer):
            for backprop in reversed(callbacks):
                gradient = backprop(gradient, optimizer)
            return gradient
        return X, finish_update

The differentiation rules are pretty easy to work with, so long as every layer is a good citizen.

Adding layers

To add layers, you usually implement a subclass of base.Model or base.Network. Use Network for layers which don’t own weights data directly, but instead, chain together a sequence of models.

class ReLuMLP(Network):
    Hidden = ReLu
    Output = Softmax
    width = 128
    depth = 3

    def setup(self, nr_out, nr_in, **kwargs):
        for i in range(self.depth):
            self.layers.append(self.Hidden(nr_out=self.width, nr_in=nr_in,
                name='hidden-%d' % i))
            nr_in = self.width
        self.layers.append(self.Output(nr_out=nr_out, nr_in=nr_in))
        self.set_weights(initialize=True)
        self.set_gradient()

When you implement a layer, there are two simple rules to follow to make sure it’s well-behaved:

  1. Don’t add side-effects to begin_update. Aside from the obvious concurrency problems, it’s not nice to make the API silently produce incorrect results if the user calls the functions out of order.

  2. Keep the interfaces to begin_update and finish_update uniform. We want to write generic functions to sum, concatenate, average, etc different layers. If your layer has a special interface, those generic functions won’t work.

Project details


Release history Release notifications | RSS feed

This version

6.1.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thinc-6.1.1.tar.gz (729.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page