Skip to main content

spaCy pipeline component for adding Bert embedding meta data to Doc, Token and Span objects.

Project description

spacybert: Bert inference for spaCy

spaCy v2.0 extension and pipeline component for loading BERT sentence / document embedding meta data to Doc, Span and Token objects. The Bert backend itself is supported by the Hugging Face transformers library.

Installation

spacybert requires spacy v2.0.0 or higher.

Usage

Getting BERT embeddings for single language dataset

import spacy
from spacybert import BertInference
nlp = spacy.load('en')

Then either use BertInference as part of a pipeline,

bert = BertInference(
    from_pretrained='path/to/pretrained_bert_weights_dir',
    set_extension=False)
nlp.add_pipe(bert, last=True)

Or not...

bert = BertInference(
    from_pretrained='path/to/pretrained_bert_weights_dir',
    set_extension=True)

The difference is that when set_extension=True, bert_repr is set as a property extension for the Doc, Span and Token spacy objects. If set_extension=False, the bert_repr is set as an attribute extension with a default value (=None). The attribute computes the correct value when doc._.bert_repr is called.

Get the Bert representation / embedding.

doc = nlp("This is a test")
print(doc._.bert_repr)  # <-- torch.Tensor

Getting BERT embeddings for multiple languages dataset.

import spacy
from spacy_langdetect import LanguageDetector
from spacybert import MultiLangBertInference

nlp = spacy.load('en')
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)
bert = MultiLangBertInference(
    from_pretrained={
        'en': 'path/to/en_pretrained_bert_weights_dir',
        'nl': 'path/to/nl_pretrained_bert_weights_dir'
    },
    set_extension=False)
nlp.add_pipe(bert, after='language_detector')

texts = [
    "This is a test",  # English
    "Dit is een test"  # Dutch
]
for doc in nlp.pipe(texts):
    print(doc._.bert_repr)  # <-- torch.Tensor

When language_detector detects languages other than the ones for which pre-trained weights is specified, by default doc._.bert_repr = None.

Available attributes

The extension sets attributes on the Doc, Span and Token. You can change the attribute name on initializing the extension.

Doc._.bert_repr torch.Tensor Document BERT embedding
Span._.bert_repr torch.Tensor Span BERT embedding
Token._.bert_repr torch.Tensor Token BERT embedding

Settings

On initialization of BertInference, you can define the following:

name type default description
from_pretrained str None Path to Bert model directory or name of HuggingFace transformers pre-trained Bert weights, e.g., bert-base-uncased
attr_name str 'bert_repr' Name of the BERT embedding attribute to set to the ._ property
max_seq_len int 512 Max sequence length for input to Bert
pooling_strategy str 'REDUCE_MEAN' Strategy to generate single sentence embedding from multiple word embeddings. See below for the various pooling strategies available.
set_extension bool True If True, then 'bert_repr' is set as a property extension for the Doc, Span and Token spacy objects. If False, the 'bert_repr' is set as an attribute extension with a default value (None) which gets filled correctly when called in a pipeline. Set it to False if you want to use this extension in a spacy pipeline.
force_extension bool True A boolean value to create the same 'Extension Attribute' upon being executed again

On initialization of MultiLangBertInference, you can define the following:

name type default description
from_pretrained Dict[LANG_ISO_639_1, str] None Mapping between two-letter language codes to path to model directory or HuggingFace transformers pre-trained Bert weights
attr_name str 'bert_repr' Same as in BertInference
max_seq_len int 512 Same as in BertInference
pooling_strategy str 'REDUCE_MEAN' Same as in BertInference
set_extension bool True Same as in BertInference
force_extension bool True Same as in BertInference

Pooling strategies

strategy description
REDUCE_MEAN Element-wise average the word embeddings
REDUCE_MAX Element-wise maximum of the word embeddings
REDUCE_MEAN_MAX Apply both 'REDUCE_MEAN' and 'REDUCE_MAX' and concatenate. So if the original word embedding is of dimensions (768,), then the output will have shape (1536,)
CLS_TOKEN, FIRST_TOKEN Take the embedding of only the first [CLS] token
SEP_TOKEN, LAST_TOKEN Take the embedding of only the last [SEP] token
None No reduction is applied and a matrix of embeddings per word in the sentence is returned

Roadmap

This extension is still experimental. Possible future updates include:

  • Getting document representation from other state-of-the-art NLP models other than Google's BERT.
  • Method for computing similarity between Doc, Span and Token objects using the bert_repr tensor.
  • Getting representation from multiple / other layers in the models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacybert-1.0.1.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

spacybert-1.0.1-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file spacybert-1.0.1.tar.gz.

File metadata

  • Download URL: spacybert-1.0.1.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0.post20200209 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for spacybert-1.0.1.tar.gz
Algorithm Hash digest
SHA256 7e199e1197a2d977da2e1f37d90a115169b9be0d43de88b0449b33f495556e60
MD5 74de37b23b23d3f5767b765244c2acb1
BLAKE2b-256 76174197535abe3ab19fcb3e01126c87fdfb4aeb7657c55ac0646c8407aa0145

See more details on using hashes here.

File details

Details for the file spacybert-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: spacybert-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0.post20200209 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for spacybert-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c758337ba969e425d0ecdb9a3fadfc29bf4ecfc344f00367c2ac97bb8ae4d1b4
MD5 c2c7899542b9a51c13b3173f9309fd11
BLAKE2b-256 f2ecf693d08a1d2c16901123c7a07247df202b3b95375263feae501bb39ffe88

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page