SpaCy models for using sentence-BERT
Project description
Sentence-BERT for spaCy
This package wraps sentence-transformers (also known as sentence-BERT) directly in spaCy. You can substitute the vectors provided in any spaCy model with vectors that have been tuned specifically for semantic similarity.
The models below are suggested for analysing sentence similarity, as the STS benchmark indicates.
Keep in mind that sentence-transformers
are configured with a maximum sequence length of 128. Therefore for longer texts it may be more suitable to work with other models (e.g. Universal Sentence Encoder).
Install
Compatibility:
- spaCy>=3.0.0,<4.0.0, tested on version 3.0.3
- sentence-transformers: tested on version 0.1.4
To install this package, you can run one of the following:
pip install spacy_sentence_bert
pip install git+https://github.com/MartinoMensio/spacy-sentence-bert.git
You can install standalone spaCy packages from GitHub with pip. If you install standalone packages, you will be able to load a language model directly by using the spacy.load
API, without need to add a pipeline stage.
This table takes the models listed on the Sentence Transformers documentation and shows some statistics along with the instruction to install the standalone models.
If you don't want to install the standalone models, you can still use them by adding a pipeline stage (see below).
sentence-BERT name | spacy model name | dimensions | language | STS benchmark | standalone install |
---|---|---|---|---|---|
paraphrase-distilroberta-base-v1 |
en_paraphrase_distilroberta_base_v1 |
768 | en | 81.81 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_paraphrase_distilroberta_base_v1-0.1.2.tar.gz#en_paraphrase_distilroberta_base_v1-0.1.2 |
paraphrase-xlm-r-multilingual-v1 |
xx_paraphrase_xlm_r_multilingual_v1 |
768 | 50+ | 83.50 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_paraphrase_xlm_r_multilingual_v1-0.1.2.tar.gz#xx_paraphrase_xlm_r_multilingual_v1-0.1.2 |
stsb-roberta-large |
en_stsb_roberta_large |
1024 | en | 86.39 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_roberta_large-0.1.2.tar.gz#en_stsb_roberta_large-0.1.2 |
stsb-roberta-base |
en_stsb_roberta_base |
768 | en | 85.44 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_roberta_base-0.1.2.tar.gz#en_stsb_roberta_base-0.1.2 |
stsb-bert-large |
en_stsb_bert_large |
1024 | en | 85.29 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_bert_large-0.1.2.tar.gz#en_stsb_bert_large-0.1.2 |
stsb-distilbert-base |
en_stsb_distilbert_base |
768 | en | 85.16 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_distilbert_base-0.1.2.tar.gz#en_stsb_distilbert_base-0.1.2 |
stsb-bert-base |
en_stsb_bert_base |
768 | en | 85.14 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_bert_base-0.1.2.tar.gz#en_stsb_bert_base-0.1.2 |
nli-bert-large |
en_nli_bert_large |
1024 | en | 79.19 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_large-0.1.2.tar.gz#en_nli_bert_large-0.1.2 |
nli-distilbert-base |
en_nli_distilbert_base |
768 | en | 78.69 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_distilbert_base-0.1.2.tar.gz#en_nli_distilbert_base-0.1.2 |
nli-roberta-large |
en_nli_roberta_large |
1024 | en | 78.69 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_roberta_large-0.1.2.tar.gz#en_nli_roberta_large-0.1.2 |
nli-bert-large-max-pooling |
en_nli_bert_large_max_pooling |
1024 | en | 78.41 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_large_max_pooling-0.1.2.tar.gz#en_nli_bert_large_max_pooling-0.1.2 |
nli-bert-large-cls-pooling |
en_nli_bert_large_cls_pooling |
1024 | en | 78.29 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_large_cls_pooling-0.1.2.tar.gz#en_nli_bert_large_cls_pooling-0.1.2 |
nli-distilbert-base-max-pooling |
en_nli_distilbert_base_max_pooling |
768 | en | 77.61 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_distilbert_base_max_pooling-0.1.2.tar.gz#en_nli_distilbert_base_max_pooling-0.1.2 |
nli-roberta-base |
en_nli_roberta_base |
768 | en | 77.49 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_roberta_base-0.1.2.tar.gz#en_nli_roberta_base-0.1.2 |
nli-bert-base-max-pooling |
en_nli_bert_base_max_pooling |
768 | en | 77.21 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_base_max_pooling-0.1.2.tar.gz#en_nli_bert_base_max_pooling-0.1.2 |
nli-bert-base |
en_nli_bert_base |
768 | en | 77.12 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_base-0.1.2.tar.gz#en_nli_bert_base-0.1.2 |
nli-bert-base-cls-pooling |
en_nli_bert_base_cls_pooling |
768 | en | 76.30 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_base_cls_pooling-0.1.2.tar.gz#en_nli_bert_base_cls_pooling-0.1.2 |
average_word_embeddings_glove.6B.300d |
en_average_word_embeddings_glove.6B.300d |
768 | en | 61.77 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_glove.6B.300d-0.1.2.tar.gz#en_average_word_embeddings_glove.6B.300d-0.1.2 |
average_word_embeddings_komninos |
en_average_word_embeddings_komninos |
768 | en | 61.56 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_komninos-0.1.2.tar.gz#en_average_word_embeddings_komninos-0.1.2 |
average_word_embeddings_levy_dependency |
en_average_word_embeddings_levy_dependency |
768 | en | 59.22 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_levy_dependency-0.1.2.tar.gz#en_average_word_embeddings_levy_dependency-0.1.2 |
average_word_embeddings_glove.840B.300d |
en_average_word_embeddings_glove.840B.300d |
768 | en | 52.54 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_glove.840B.300d-0.1.2.tar.gz#en_average_word_embeddings_glove.840B.300d-0.1.2 |
quora-distilbert-base |
en_quora_distilbert_base |
768 | en | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_quora_distilbert_base-0.1.2.tar.gz#en_quora_distilbert_base-0.1.2 |
quora-distilbert-multilingual |
xx_quora_distilbert_multilingual |
768 | 50+ | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_quora_distilbert_multilingual-0.1.2.tar.gz#xx_quora_distilbert_multilingual-0.1.2 |
msmarco-distilroberta-base-v2 |
en_msmarco_distilroberta_base_v2 |
768 | en | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_msmarco_distilroberta_base_v2-0.1.2.tar.gz#en_msmarco_distilroberta_base_v2-0.1.2 |
msmarco-roberta-base-v2 |
en_msmarco_roberta_base_v2 |
768 | en | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_msmarco_roberta_base_v2-0.1.2.tar.gz#en_msmarco_roberta_base_v2-0.1.2 |
msmarco-distilbert-base-v2 |
en_msmarco_distilbert_base_v2 |
768 | en | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_msmarco_distilbert_base_v2-0.1.2.tar.gz#en_msmarco_distilbert_base_v2-0.1.2 |
nq-distilbert-base-v1 |
en_nq_distilbert_base_v1 |
768 | en | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nq_distilbert_base_v1-0.1.2.tar.gz#en_nq_distilbert_base_v1-0.1.2 |
distiluse-base-multilingual-cased-v2 |
xx_distiluse_base_multilingual_cased_v2 |
512 | 50+ | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_distiluse_base_multilingual_cased_v2-0.1.2.tar.gz#xx_distiluse_base_multilingual_cased_v2-0.1.2 |
stsb-xlm-r-multilingual |
xx_stsb_xlm_r_multilingual |
768 | 50+ | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_stsb_xlm_r_multilingual-0.1.2.tar.gz#xx_stsb_xlm_r_multilingual-0.1.2 |
T-Systems-onsite/cross-en-de-roberta-sentence-transformer |
xx_cross_en_de_roberta_sentence_transformer |
768 | en,de | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_cross_en_de_roberta_sentence_transformer-0.1.2.tar.gz#xx_cross_en_de_roberta_sentence_transformer-0.1.2 |
LaBSE |
xx_LaBSE |
768 | 109 | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_LaBSE-0.1.2.tar.gz#xx_LaBSE-0.1.2 |
allenai-specter |
en_allenai_specter |
768 | en | N/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_allenai_specter-0.1.2.tar.gz#en_allenai_specter-0.1.2 |
If your model is not in this list (e.g., xlm-r-base-en-ko-nli-ststb
), you can still use it with this library but not as a standalone language. You will need to add a pipeline stage properly configured (see below the nlp.add_pipe
API).
Usage
There are different ways to load the models of sentence-bert
.
spacy.load
API: you need to have installed one of the models from the table abovespacy_sentence_bert.load_model
: you can load one of the models from the table above without having installed the standalone packagesnlp.add_pipe
API: you can load any of thesentence-bert
models on top of yournlp
object
spacy.load
API
Standalone model installed from GitHub (e.g., from the table above, pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_roberta_large-0.1.2.tar.gz#en_stsb_roberta_large-0.1.2
), you can load directly the model with the spaCy API:
import spacy
nlp = spacy.load('en_stsb_roberta_large')
spacy_sentence_bert.load_model
API
You can obtain the same result without having to install the standalone model, by using this method:
import spacy_sentence_bert
nlp = spacy_sentence_bert.load_model('en_stsb_roberta_large')
nlp.add_pipe
API
If you want to use one of the sentence embeddings over an existing Language object, you can use the nlp.add_pipe
method.
This also works if you want to use a language model that is not listed in the table above. Just make sure that sentence-transformers supports it.
import spacy
nlp = spacy.blank('en')
nlp.add_pipe('sentence_bert', config={'model_name': 'allenai-specter'})
nlp.pipe_names
The models, when first used, download sentence-BERT to the folder defined with TORCH_HOME
in the environment variables (default ~/.cache/torch
).
Once you have loaded the model, use it through the vector
property and the similarity
method of spaCy:
# get two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# get the vector of the Doc, Span or Token
print(doc_1.vector.shape)
print(doc_1[3].vector.shape)
print(doc_1[2:4].vector.shape)
# or use the similarity method that is based on the vectors, on Doc, Span or Token
print(doc_1.similarity(doc_2[0:7]))
Utils
To build and upload
VERSION=0.1.2
# build the standalone models (17)
./build_models.sh
# build the archive at dist/spacy_sentence_bert-${VERSION}.tar.gz
python setup.py sdist
# upload to pypi
twine upload dist/spacy_sentence_bert-${VERSION}.tar.gz
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file spacy_sentence_bert-0.1.2.tar.gz
.
File metadata
- Download URL: spacy_sentence_bert-0.1.2.tar.gz
- Upload date:
- Size: 12.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60ea1e2d3dcd2f53f7d3865e13d669079ea7db2584bae63478d433b6408a9e2f |
|
MD5 | 12e53c7cd61a83ab07f67622f91dda52 |
|
BLAKE2b-256 | b1491be66adaa30d7fb513c8b2b470d1cc386e3f013d0d1bfde5a4fdd3b758eb |