Skip to main content

Practical Machine Learning for NLP

Project description

Thinc: Practical Machine Learning for NLP in Python

Thinc is the machine learning library powering spaCy. It features a battle-tested linear model designed for large sparse learning problems, and a flexible neural network model under development for spaCy v2.0.

Thinc is a practical toolkit for implementing models that follow the "Embed, encode, attend, predict" architecture. It's designed to be easy to install, efficient for CPU usage and optimised for NLP and deep learning with text – in particular, hierarchically structured input and variable-length sequences.

🔮 Read the release notes here.

Azure Pipelines Current Release Version PyPi Version conda Version Python wheels Follow us on Twitter

What's where (as of v7.0.0)

Module Description
thinc.v2v.Model Base class.
thinc.v2v Layers transforming vectors to vectors.
thinc.i2v Layers embedding IDs to vectors.
thinc.t2v Layers pooling tensors to vectors.
thinc.t2t Layers transforming tensors to tensors (e.g. CNN, LSTM).
thinc.api Higher-order functions, for building networks. Will be renamed.
thinc.extra Datasets and utilities.
thinc.neural.ops Container classes for mathematical operations. Will be reorganized.
thinc.linear.avgtron Legacy efficient Averaged Perceptron implementation.

Development status

Thinc's deep learning functionality is still under active development: APIs are unstable, and we're not yet ready to provide usage support. However, if you're already quite familiar with neural networks, there's a lot here you might find interesting. Thinc's conceptual model is quite different from TensorFlow's. Thinc also implements some novel features, such as a small DSL for concisely wiring up models, embedding tables that support pre-computation and the hashing trick, dynamic batch sizes, a concatenation-based approach to variable-length sequences, and support for model averaging for the Adam solver (which performs very well).

No computational graph – just higher order functions

The central problem for a neural network implementation is this: during the forward pass, you compute results that will later be useful during the backward pass. How do you keep track of this arbitrary state, while making sure that layers can be cleanly composed?

Most libraries solve this problem by having you declare the forward computations, which are then compiled into a graph somewhere behind the scenes. Thinc doesn't have a "computational graph". Instead, we just use the stack, because we put the state from the forward pass into callbacks.

All nodes in the network have a simple signature:

f(inputs) -> {outputs, f(d_outputs)->d_inputs}

To make this less abstract, here's a ReLu activation, following this signature:

def relu(inputs):
    mask = inputs > 0
    def backprop_relu(d_outputs, optimizer):
        return d_outputs * mask
    return inputs * mask, backprop_relu

When you call the relu function, you get back an output variable, and a callback. This lets you calculate a gradient using the output, and then pass it into the callback to perform the backward pass.

This signature makes it easy to build a complex network out of smaller pieces, using arbitrary higher-order functions you can write yourself. To make this clearer, we need a function for a weights layer. Usually this will be implemented as a class — but let's continue using closures, to keep things concise, and to keep the simplicity of the interface explicit.

The main complication for the weights layer is that we now have a side-effect to manage: we would like to update the weights. There are a few ways to handle this. In Thinc we currently pass a callable into the backward pass. (I'm not convinced this is best.)

import numpy

def create_linear_layer(n_out, n_in):
    W = numpy.zeros((n_out, n_in))
    b = numpy.zeros((n_out, 1))

    def forward(X):
        Y = W @ X + b
        def backward(dY, optimizer):
            dX = W.T @ dY
            dW = numpy.einsum('ik,jk->ij', dY, X)
            db = dY.sum(axis=0)

            optimizer(W, dW)
            optimizer(b, db)

            return dX
        return Y, backward
    return forward

If we call Wb = create_linear_layer(5, 4), the variable Wb will be the forward() function, implemented inside the body of create_linear_layer(). The Wb instance will have access to the W and b variable defined in its outer scope. If we invoke create_linear_layer() again, we get a new instance, with its own internal state.

The Wb instance and the relu function have exactly the same signature. This makes it easy to write higher order functions to compose them. The most obvious thing to do is chain them together:

def chain(*layers):
    def forward(X):
        backprops = []
        Y = X
        for layer in layers:
            Y, backprop = layer(Y)
            backprops.append(backprop)
        def backward(dY, optimizer):
            for backprop in reversed(backprops):
                dY = backprop(dY, optimizer)
            return dY
        return Y, backward
    return forward

We could now chain our linear layer together with the relu activation, to create a simple feed-forward network:

Wb1 = create_linear_layer(10, 5)
Wb2 = create_linear_layer(3, 10)

model = chain(Wb1, relu, Wb2)

X = numpy.random.uniform(size=(5, 4))

y, bp_y = model(X)

dY = y - truth
dX = bp_y(dY, optimizer)

This conceptual model makes Thinc very flexible. The trade-off is that Thinc is less convenient and efficient at workloads that fit exactly into what TensorFlow etc. are designed for. If your graph really is static, and your inputs are homogenous in size and shape, Keras will likely be faster and simpler. But if you want to pass normal Python objects through your network, or handle sequences and recursions of arbitrary length or complexity, you might find Thinc's design a better fit for your problem.

Quickstart

Thinc should install cleanly with both pip and conda, for Pythons 2.7+ and 3.5+, on Linux, macOS / OSX and Windows. Its only system dependency is a compiler tool-chain (e.g. build-essential) and the Python development headers (e.g. python-dev).

pip install thinc

For GPU support, we're grateful to use the work of Chainer's cupy module, which provides a numpy-compatible interface for GPU arrays. However, installing Chainer when no GPU is available currently causes an error. We therefore do not list Chainer as an explicit dependency — so building Thinc for GPU requires some extra steps:

export CUDA_HOME=/usr/local/cuda-8.0 # Or wherever your CUDA is
export PATH=$PATH:$CUDA_HOME/bin
pip install chainer
python -c "import cupy; assert cupy" # Check it installed
pip install thinc_gpu_ops thinc # Or `thinc[cuda]`
python -c "import thinc_gpu_ops" # Check the GPU ops were built

The rest of this section describes how to build Thinc from source. If you have Fabric installed, you can use the shortcut:

git clone https://github.com/explosion/thinc
cd thinc
fab clean env make test

You can then run the examples as follows:

fab eg.mnist
fab eg.basic_tagger
fab eg.cnn_tagger

Otherwise, you can build and test explicitly with:

git clone https://github.com/explosion/thinc
cd thinc

virtualenv .env
source .env/bin/activate

pip install -r requirements.txt
python setup.py build_ext --inplace
py.test thinc/

And then run the examples as follows:

python examples/mnist.py
python examples/basic_tagger.py
python examples/cnn_tagger.py

Usage

The Neural Network API is still subject to change, even within minor versions. You can get a feel for the current API by checking out the examples. Here are a few quick highlights.

1. Shape inference

Models can be created with some dimensions unspecified. Missing dimensions are inferred when pre-trained weights are loaded or when training begins. This eliminates a common source of programmer error:

# Invalid network — shape mismatch
model = chain(ReLu(512, 748), ReLu(512, 784), Softmax(10))

# Leave the dimensions unspecified, and you can't be wrong.
model = chain(ReLu(512), ReLu(512), Softmax())

2. Operator overloading

The Model.define_operators() classmethod allows you to bind arbitrary binary functions to Python operators, for use in any Model instance. The method can (and should) be used as a context-manager, so that the overloading is limited to the immediate block. This allows concise and expressive model definition:

with Model.define_operators({'>>': chain}):
    model = ReLu(512) >> ReLu(512) >> Softmax()

The overloading is cleaned up at the end of the block. A fairly arbitrary zoo of functions are currently implemented. Some of the most useful:

  • chain(model1, model2): Compose two models f(x) and g(x) into a single model computing g(f(x)).
  • clone(model1, int): Create n copies of a model, each with distinct weights, and chain them together.
  • concatenate(model1, model2): Given two models with output dimensions (n,) and (m,), construct a model with output dimensions (m+n,).
  • add(model1, model2): add(f(x), g(x)) = f(x)+g(x)
  • make_tuple(model1, model2): Construct tuples of the outputs of two models, at the batch level. The backward pass expects to receive a tuple of gradients, which are routed through the appropriate model, and summed.

Putting these things together, here's the sort of tagging model that Thinc is designed to make easy.

with Model.define_operators({'>>': chain, '**': clone, '|': concatenate}):
    model = (
        add_eol_markers('EOL')
        >> flatten
        >> memoize(
            CharLSTM(char_width)
            | (normalize >> str2int >> Embed(word_width)))
        >> ExtractWindow(nW=2)
        >> BatchNorm(ReLu(hidden_width)) ** 3
        >> Softmax()
    )

Not all of these pieces are implemented yet, but hopefully this shows where we're going. The memoize function will be particularly important: in any batch of text, the common words will be very common. It's therefore important to evaluate models such as the CharLSTM once per word type per minibatch, rather than once per token.

3. Callback-based backpropagation

Most neural network libraries use a computational graph abstraction. This takes the execution away from you, so that gradients can be computed automatically. Thinc follows a style more like the autograd library, but with larger operations. Usage is as follows:

def explicit_sgd_update(X, y):
    sgd = lambda weights, gradient: weights - gradient * 0.001
    yh, finish_update = model.begin_update(X, drop=0.2)
    finish_update(y-yh, sgd)

Separating the backpropagation into three parts like this has many advantages. The interface to all models is completely uniform — there is no distinction between the top-level model you use as a predictor and the internal models for the layers. We also make concurrency simple, by making the begin_update() step a pure function, and separating the accumulation of the gradient from the action of the optimizer.

4. Class annotations

To keep the class hierarchy shallow, Thinc uses class decorators to reuse code for layer definitions. Specifically, the following decorators are available:

  • describe.attributes(): Allows attributes to be specified by keyword argument. Used especially for dimensions and parameters.
  • describe.on_init(): Allows callbacks to be specified, which will be called at the end of the __init__.py.
  • describe.on_data(): Allows callbacks to be specified, which will be called on Model.begin_training().

🛠 Changelog

Version Date Description
v7.0.4 2019-03-19 Don't require thinc_gpu_ops
v7.0.3 2019-03-15 Fix pruning in beam search
v7.0.2 2019-02-23 Fix regression in linear model class
v7.0.1 2019-02-16 Fix import errors
v7.0.0 2019-02-15 Overhaul package dependencies
v6.12.1 2018-11-30 Fix msgpack pin
v6.12.0 2018-10-15 Wheels and separate GPU ops
v6.10.3 2018-07-21 Python 3.7 support and dependency updates
v6.11.2 2018-05-21 Improve GPU installation
v6.11.1 2018-05-20 Support direct linkage to BLAS libraries
v6.11.0 2018-03-16 n/a
v6.10.2 2017-12-06 Efficiency improvements and bug fixes
v6.10.1 2017-11-15 Fix GPU install and minor memory leak
v6.10.0 2017-10-28 CPU efficiency improvements, refactoring
v6.9.0 2017-10-03 Reorganize layers, bug fix to Layer Normalization
v6.8.2 2017-09-26 Fix packaging of gpu_ops
v6.8.1 2017-08-23 Fix Windows support
v6.8.0 2017-07-25 SELU layer, attention, improved GPU/CPU compatibility
v6.7.3 2017-06-05 Fix convolution on GPU
v6.7.2 2017-06-02 Bug fixes to serialization
v6.7.1 2017-06-02 Improve serialization
v6.7.0 2017-06-01 Fixes to serialization, hash embeddings and flatten ops
v6.6.0 2017-05-14 Improved GPU usage and examples
v6.5.2 2017-03-20 n/a
v6.5.1 2017-03-20 Improved linear class and Windows fix
v6.5.0 2017-03-11 Supervised similarity, fancier embedding and improvements to linear model
v6.4.0 2017-02-15 n/a
v6.3.0 2017-01-25 Efficiency improvements, argument checking and error messaging
v6.2.0 2017-01-15 Improve API and introduce overloaded operators
v6.1.3 2017-01-10 More neural network functions and training continuation
v6.1.2 2017-01-09 n/a
v6.1.1 2017-01-09 n/a
v6.1.0 2017-01-09 n/a
v6.0.0 2016-12-31 Add thinc.neural for NLP-oriented deep learning

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thinc-7.0.5.tar.gz (1.9 MB view details)

Uploaded Source

Built Distributions

thinc-7.0.5-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

thinc-7.0.5-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.7m

thinc-7.0.5-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

thinc-7.0.5-cp36-cp36m-manylinux1_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.6m

thinc-7.0.5-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

thinc-7.0.5-cp35-cp35m-manylinux1_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.5m

thinc-7.0.5-cp27-cp27mu-manylinux1_x86_64.whl (2.1 MB view details)

Uploaded CPython 2.7mu

thinc-7.0.5-cp27-cp27m-manylinux1_x86_64.whl (2.1 MB view details)

Uploaded CPython 2.7m

File details

Details for the file thinc-7.0.5.tar.gz.

File metadata

  • Download URL: thinc-7.0.5.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.6

File hashes

Hashes for thinc-7.0.5.tar.gz
Algorithm Hash digest
SHA256 5e3af9fb39508abad8109c3274ac3cfe3ea3e1e23d774b81b62670fa1681f618
MD5 f22ca7f81d88cf59d0addc3b532ddd1a
BLAKE2b-256 b87eca6af7767e1564697a790486e4911e8969d93c72cb933ed3d13334b16836

See more details on using hashes here.

File details

Details for the file thinc-7.0.5-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: thinc-7.0.5-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for thinc-7.0.5-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 29009cab3f940d070be41a5d4560af216895ffaf0f9c672c8c02b34219e928bd
MD5 790955e93ed2317af5cbdc3a77b2f6cd
BLAKE2b-256 12a3f6b3b7ed2108012be51e3393fe410e034954be0bb21687c75390a6c94138

See more details on using hashes here.

File details

Details for the file thinc-7.0.5-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: thinc-7.0.5-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for thinc-7.0.5-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c6cf94dc67bb903dde2726be9bb808bb56e31677fba80141a75463b9a921362c
MD5 ebb8c9b09a7282146494060236f429aa
BLAKE2b-256 42358e4b9bbca2a9ca431e9b7840b1825fb0ad9577be735465110c07e6ee1807

See more details on using hashes here.

File details

Details for the file thinc-7.0.5-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: thinc-7.0.5-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for thinc-7.0.5-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 2ab4fa86e7c279918cc19803e2cf138317e81ce1ad0bea4b45c4f8b7664574e8
MD5 f5a35f367b8d92148c024e8c9500c1eb
BLAKE2b-256 cb8adae71a32db075ef3f772d774fedb5efc0e4c2bdae18cdcaf250947501de1

See more details on using hashes here.

File details

Details for the file thinc-7.0.5-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: thinc-7.0.5-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for thinc-7.0.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 40e202274943cdbd3b72a1cf554a2b5b8f3e7055d0ccd5b0f810eb422fbd27c3
MD5 d1bf4cf8f299bf9846596086305bb9dd
BLAKE2b-256 7c34d5930aae5460c118b547ac9fc9e65f303b7b85f60083d65cbaf303d161ed

See more details on using hashes here.

File details

Details for the file thinc-7.0.5-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: thinc-7.0.5-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for thinc-7.0.5-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 12cc82e16dbd9578de78c5f6ad424670f8bb0aac44bbfbfe7c2aef379f383028
MD5 db74a18d0b1d9db2d4075c354729cbf6
BLAKE2b-256 8cce3ae95713cf9a281682ea4ce80d5d64f4c22e6037bc42aae80a3709155af4

See more details on using hashes here.

File details

Details for the file thinc-7.0.5-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: thinc-7.0.5-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for thinc-7.0.5-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6a7b107884afc53d5b45fd951e77bae626ae90ce24d340f0ddf00678b4bd44b4
MD5 0a38f3cb4fd83b5ccd0faaffd6bbbaec
BLAKE2b-256 ed7462249ec8be0748579c486cc3e88a19e26603056210eb0a3a80667bc92b94

See more details on using hashes here.

File details

Details for the file thinc-7.0.5-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: thinc-7.0.5-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for thinc-7.0.5-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 00c74dec34e82877832816782bfe743d06860e1fe7717e3ecbc3e857a34a5b92
MD5 f96027051f75093b1874819ca147aea5
BLAKE2b-256 30ac12309b88d71fc094e7c93ae3d09898bdf1fd942301a53f82b565634e09e8

See more details on using hashes here.

File details

Details for the file thinc-7.0.5-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: thinc-7.0.5-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for thinc-7.0.5-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 dd10f407267a8cc441d3a1adcbdf754876792a3933f2ed8977d0717abae48fd6
MD5 07f87f600ceff33b71c917e1063f7b9e
BLAKE2b-256 0a296437c53358b8cc31cf43f19893212582f2474556c966c8ddc471ada40757

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page