spaCy pipeline component for adding Bert embedding meta data to Doc, Token and Span objects.
Project description
spacybert: Bert inference for spaCy
spaCy v2.0 extension and pipeline component for loading BERT sentence / document embedding meta data to Doc
, Span
and Token
objects. The Bert backend itself is supported by the Hugging Face transformers library.
Installation
spacybert
requires spacy
v2.0.0 or higher.
Usage
Getting BERT embeddings for single language dataset
import spacy
from spacybert import BertInference
nlp = spacy.load('en')
Then either use BertInference as part of a pipeline,
bert = BertInference(
from_pretrained='path/to/pretrained_bert_weights_dir',
set_extension=False)
nlp.add_pipe(bert, last=True)
Or not...
bert = BertInference(
from_pretrained='path/to/pretrained_bert_weights_dir',
set_extension=True)
The difference is that when set_extension=True
, bert_repr
is set as a property extension for the Doc, Span and Token spacy objects. If set_extension=False
, the bert_repr
is set as an attribute extension with a default value (=None
). The attribute computes the correct value when doc._.bert_repr
is called.
Get the Bert representation / embedding.
doc = nlp("This is a test")
print(doc._.bert_repr) # <-- torch.Tensor
Getting BERT embeddings for multiple languages dataset.
import spacy
from spacy_langdetect import LanguageDetector
from spacybert import MultiLangBertInference
nlp = spacy.load('en')
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)
bert = MultiLangBertInference(
from_pretrained={
'en': 'path/to/en_pretrained_bert_weights_dir',
'nl': 'path/to/nl_pretrained_bert_weights_dir'
},
set_extension=False)
nlp.add_pipe(bert, after='language_detector')
texts = [
"This is a test", # English
"Dit is een test" # Dutch
]
for doc in nlp.pipe(texts):
print(doc._.bert_repr) # <-- torch.Tensor
When language_detector detects languages other than the ones for which pre-trained weights is specified, by default doc._.bert_repr = None
.
Available attributes
The extension sets attributes on the Doc
, Span
and Token
. You can change the attribute name on initializing the extension.
Doc._.bert_repr |
torch.Tensor |
Document BERT embedding |
Span._.bert_repr |
torch.Tensor |
Span BERT embedding |
Token._.bert_repr |
torch.Tensor |
Token BERT embedding |
Settings
On initialization of BertInference
, you can define the following:
name | type | default | description |
---|---|---|---|
from_pretrained |
str |
None |
Path to Bert model directory or name of HuggingFace transformers pre-trained Bert weights, e.g., bert-base-uncased |
attr_name |
str |
'bert_repr' |
Name of the BERT embedding attribute to set to the ._ property |
max_seq_len |
int |
512 | Max sequence length for input to Bert |
pooling_strategy |
str |
'REDUCE_MEAN' |
Strategy to generate single sentence embedding from multiple word embeddings. See below for the various pooling strategies available. |
set_extension |
bool |
True |
If True , then 'bert_repr' is set as a property extension for the Doc , Span and Token spacy objects. If False , the 'bert_repr' is set as an attribute extension with a default value (None ) which gets filled correctly when called in a pipeline. Set it to False if you want to use this extension in a spacy pipeline. |
force_extension |
bool |
True |
A boolean value to create the same 'Extension Attribute' upon being executed again |
On initialization of MultiLangBertInference
, you can define the following:
name | type | default | description |
---|---|---|---|
from_pretrained |
Dict[LANG_ISO_639_1, str] |
None |
Mapping between two-letter language codes to path to model directory or HuggingFace transformers pre-trained Bert weights |
attr_name |
str |
'bert_repr' |
Same as in BertInference |
max_seq_len |
int |
512 | Same as in BertInference |
pooling_strategy |
str |
'REDUCE_MEAN' |
Same as in BertInference |
set_extension |
bool |
True |
Same as in BertInference |
force_extension |
bool |
True |
Same as in BertInference |
Pooling strategies
strategy | description |
---|---|
REDUCE_MEAN |
Element-wise average the word embeddings |
REDUCE_MAX |
Element-wise maximum of the word embeddings |
REDUCE_MEAN_MAX |
Apply both 'REDUCE_MEAN' and 'REDUCE_MAX' and concatenate. So if the original word embedding is of dimensions (768,) , then the output will have shape (1536,) |
CLS_TOKEN , FIRST_TOKEN |
Take the embedding of only the first [CLS] token |
SEP_TOKEN , LAST_TOKEN |
Take the embedding of only the last [SEP] token |
None |
No reduction is applied and a matrix of embeddings per word in the sentence is returned |
Roadmap
This extension is still experimental. Possible future updates include:
- Getting document representation from other state-of-the-art NLP models other than Google's BERT.
- Method for computing similarity between
Doc
,Span
andToken
objects using thebert_repr
tensor. - Getting representation from multiple / other layers in the models.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file spacybert-1.0.1.tar.gz
.
File metadata
- Download URL: spacybert-1.0.1.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0.post20200209 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e199e1197a2d977da2e1f37d90a115169b9be0d43de88b0449b33f495556e60 |
|
MD5 | 74de37b23b23d3f5767b765244c2acb1 |
|
BLAKE2b-256 | 76174197535abe3ab19fcb3e01126c87fdfb4aeb7657c55ac0646c8407aa0145 |
File details
Details for the file spacybert-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: spacybert-1.0.1-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0.post20200209 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c758337ba969e425d0ecdb9a3fadfc29bf4ecfc344f00367c2ac97bb8ae4d1b4 |
|
MD5 | c2c7899542b9a51c13b3173f9309fd11 |
|
BLAKE2b-256 | f2ecf693d08a1d2c16901123c7a07247df202b3b95375263feae501bb39ffe88 |