spaCy pipelines for pre-trained BERT and other transformers

These details have not been verified by PyPI

Project links

Homepage

Project description

spacy-transformers

This package (previously spacy-pytorch-transformers) provides spaCy model pipelines that wrap Hugging Face's transformers package, so you can use them in spaCy. The result is convenient access to state-of-the-art transformer architectures, such as BERT, GPT-2, XLNet, etc.

Features

Use pretrained transformer models like BERT, RoBERTa and XLNet to power your spaCy pipeline.
Easy multi-task learning: backprop to one transformer model from several pipeline components.
Train using spaCy v3's powerful and extensible config system.
Automatic alignment of transformer output to spaCy's tokenization.
Easily customize what transformer data is saved in the Doc object.
Easily customize how long documents are processed.
Out-of-the-box serialization and model packaging.

🚀 Installation

Installing the package from pip will automatically install all dependencies, including PyTorch and spaCy. Make sure you install this package before you install the models. Also note that this package requires Python 3.6+ and spaCy v3.

pip install spacy-transformers

For GPU installation, find your CUDA version using nvcc --version and add the version in brackets, e.g. spacy-transformers[cuda92] for CUDA9.2 or spacy-transformers[cuda100] for CUDA10.0.

If you are having trouble installing PyTorch, follow the instructions on the official website for your specific operation system and requirements.

📖 Usage

⚠️ Important note: This package has been extensively refactored to take advantage of spaCy v3. Previous versions that were built for spaCy v2 worked considerably differently. Please see previous tagged versions of this readme for documentation on prior versions.

spaCy v3 lets you use almost any statistical model to power your pipeline. You can use models implemented in a variety of frameworks, including Tensorflow, PyTorch and MXNet. To keep things sane, spaCy expects models from these frameworks to be wrapped with a common interface, using our machine learning library Thinc. A transformer model is just a statistical model, so the spacy-transformers package actually has very little work to do: we just have to provide a few functions that do the required plumbing. We also provide a pipeline component, Transformer, that lets you do multi-task learning and lets you save the transformer outputs for later use.

Training usage

The recommended workflow for training is to use spaCy v3's new config system, usually via the spacy train-from-config command-line command. See here for an end-to-end example. The config system lets you describe a tree of objects by referring to creation functions, including functions you register yourself. Here's a config snippet for the Transformer component, along with matching Python code.

[nlp.pipeline.transformer]
factory = "transformer"
extra_annotation_setter = null
max_batch_size = 32

[nlp.pipeline.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-cased"
tokenizer_config = {"use_fast": true}

[nlp.pipeline.transformer.model.get_spans]
@span_getters = "get_doc_spans.v1"

trf = Transformer(
    nlp.vocab,
    TransformerModel(
        "bert-base-cased",
        get_spans=get_doc_spans,
        tokenizer_config={"use_fast": True},
    ),
    annotation_setter=null_annotation_setter,
    max_batch_size=32,
)
nlp.add_pipe("transformer", trf, first=True)

The nlp.pipeline.transformer block adds the transformer component to the pipeline, and the nlp.pipeline.transformer.model block describes the creation of a Thinc Model object that will be passed into the component. The block names a function registered in the @architectures registry. This function will be looked up and called using the provided arguments. You're not limited to just that function --- you can write your own or use someone else's. The only limitation is that it must return an object of type Model[List[Doc], FullTransformerBatch]: that is, a Thinc model that takes a list of Doc objects, and returns a FullTransformerBatch object with the transformer data.

The same idea applies to task models that power the downstream components. Most of spaCy's built-in model creation functions support a tok2vec argument, which should be a Thinc layer of type Model[List[Doc], List[Floats2d]]. This is where we'll plug in our transformer model, using the Tok2VecTransformer layer, which sneakily delegates to the Transformer pipeline component.

[nlp.pipeline.ner]
factory = "ner"

[nlp.pipeline.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 3
hidden_width = 128
maxout_pieces = 3
use_upper = false

[nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecListener.v1"
grad_factor = 1.0

[nlp.pipeline.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

The Tok2VecListener layer expects a pooling layer, which needs to be of type Model[Ragged, Floats2d]. This layer determines how the vector for each spaCy token will be computed from the zero or more source rows the token is aligned against. Here we use the reduce_mean layer, which averages the wordpiece rows. We could instead use reduce_last, reduce_max, or a custom function you write yourself.

You can have multiple components all listening to the same transformer model, and all passing gradients back to it. By default, all of the gradients will be equally weighted. You can control this with the grad_factor setting, which lets you reweight the gradients from the different listeners. For instance, setting grad_factor = 0 would disable gradients from one of the listeners, while grad_factor = 2.0 would multiply them by 2. This is similar to having a custom learning rate for each component. Instead of a constant, you can also provide a schedule, allowing you to freeze the shared parameters at the start of training.

Runtime usage

Transformer models can be used as drop-in replacements for other types of neural networks, so your spaCy pipeline can include them in a way that's completely invisible to the user. Users will download, load and use the model in the standard way, like any other spaCy pipeline.

Instead of using the transformers as subnetworks directly, you can also use them via the Transformer pipeline component. This sets the doc._.trf_data extension attribute, which lets you access the transformers outputs at runtime via the doc._.trf_data extension attribute. You can also customize how the Transformer object sets annotations onto the Doc, by customizing the Transformer.annotation_setter object. This callback will be called with the raw input and output data for the whole batch, along with the batch of Doc objects, allowing you to implement whatever you need.

import spacy

nlp = spacy.load("en_core_trf_lg")
for doc in nlp.pipe(["some text", "some other text"]):
    doc._.trf_data.tensors
    tokvecs = doc._.trf_data.tensors[-1]

The nlp object in this example is just like any other spaCy pipeline, so see the spaCy docs for more details about what you can do.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.3.5

Apr 25, 2024

1.3.4

Dec 19, 2023

1.3.3

Nov 8, 2023

1.3.2

Oct 12, 2023

1.3.1

Sep 26, 2023

1.3.0 yanked

Aug 1, 2023

1.2.5

Jun 12, 2023

1.2.4

May 22, 2023

1.2.3

Apr 17, 2023

1.2.2

Feb 18, 2023

1.2.1

Jan 26, 2023

1.2.0

Jan 12, 2023

1.2.0.dev0 pre-release

Jan 11, 2023

1.1.9

Dec 19, 2022

1.1.8

Aug 12, 2022

1.1.7

Jul 5, 2022

1.1.6

Jun 2, 2022

1.1.5

Mar 15, 2022

1.1.4

Jan 14, 2022

1.1.3

Dec 7, 2021

1.1.2

Oct 28, 2021

1.1.1

Oct 19, 2021

1.1.0

Oct 18, 2021

1.1.0.dev4 pre-release

Oct 14, 2021

1.1.0.dev3 pre-release

Oct 5, 2021

1.1.0.dev2 pre-release

Sep 17, 2021

1.1.0.dev1 pre-release

Sep 10, 2021

1.1.0.dev0 pre-release

Sep 10, 2021

1.0.6

Sep 2, 2021

1.0.5

Aug 26, 2021

1.0.4

Aug 12, 2021

1.0.3

Jun 14, 2021

1.0.2

Apr 21, 2021

1.0.1

Feb 2, 2021

1.0.1.dev0 pre-release

Feb 2, 2021

1.0.0

Feb 1, 2021

1.0.0rc4 pre-release

Jan 29, 2021

1.0.0rc3 pre-release

Jan 26, 2021

1.0.0rc3.dev4 pre-release

Jan 26, 2021

1.0.0rc3.dev3 pre-release

Jan 25, 2021

1.0.0rc3.dev2 pre-release

Jan 24, 2021

1.0.0rc3.dev1 pre-release

Jan 24, 2021

1.0.0rc3.dev0 pre-release

Jan 24, 2021

1.0.0rc2 pre-release

Jan 19, 2021

1.0.0rc0 pre-release

Oct 14, 2020

1.0.0a24 pre-release

Oct 14, 2020

1.0.0a23 pre-release

Oct 9, 2020

1.0.0a22 pre-release

Oct 8, 2020

1.0.0a20 pre-release

Oct 8, 2020

1.0.0a19 pre-release

Oct 7, 2020

1.0.0a18 pre-release

Oct 7, 2020

1.0.0a17 pre-release

Oct 1, 2020

1.0.0a15 pre-release

Sep 29, 2020

1.0.0a14 pre-release

Sep 17, 2020

1.0.0a13 pre-release

Sep 17, 2020

1.0.0a12 pre-release

Sep 17, 2020

1.0.0a11 pre-release

Sep 12, 2020

1.0.0a10 pre-release

Sep 11, 2020

1.0.0a9 pre-release

Sep 11, 2020

1.0.0a8 pre-release

Aug 31, 2020

1.0.0a7 pre-release

Aug 30, 2020

1.0.0a6 pre-release

Jul 11, 2020

This version

1.0.0a5 pre-release

Jul 10, 2020

1.0.0a4 pre-release

Jul 1, 2020

1.0.0a3 pre-release

Jul 1, 2020

1.0.0a1 pre-release

Jun 29, 2020

1.0.0a0 pre-release

Jun 29, 2020

0.6.2

Jun 29, 2020

0.6.1

Jun 18, 2020

0.6.0 yanked

May 24, 2020

Reason this release was yanked:

Broken, shouldn't have been pushed. Sorry!

0.5.3

Jun 18, 2020

0.5.2

May 24, 2020

0.5.1

Oct 28, 2019

0.5.0

Oct 8, 2019

0.5.0.dev1 pre-release

Oct 8, 2019

0.5.0.dev0 pre-release

Oct 8, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy-transformers-1.0.0a5.tar.gz (24.7 kB view hashes)

Uploaded Jul 10, 2020 Source

Hashes for spacy-transformers-1.0.0a5.tar.gz

Hashes for spacy-transformers-1.0.0a5.tar.gz
Algorithm	Hash digest
SHA256	`308e9d6a13ba32fb0c7c910c9bc17a87681777312422367f356f1b61cf041ec4`
MD5	`ef149a0d99f7ca23327ef993edfdc9e6`
BLAKE2b-256	`66d75d61d2edfcd15b0f1238ab1c3f8c1b4a1728184ebea6fe42174f9e8f687c`