Skip to main content

SpaCy models for using sentence-BERT

Project description

Sentence-BERT for spaCy

This package wraps sentence-transformers (also known as sentence-BERT) directly in spaCy. You can substitute the vectors provided in any spaCy model with vectors that have been tuned specifically for semantic similarity.

The models below are suggested for analysing sentence similarity, as the STS benchmark indicates. Keep in mind that sentence-transformers are configured with a maximum sequence length of 128. Therefore for longer texts it may be more suitable to work with other models (e.g. Universal Sentence Encoder).

Install

To install this package, you can run one of the following:

  • pip install spacy_sentence_bert
  • pip install git+https://github.com/MartinoMensio/spacy-sentence-bert.git

You can install standalone spaCy packages from GitHub with pip. From the full list of models this table describes the models available.

sentence-BERT name spacy model name dimensions language STS benchmark standalone install
bert-base-nli-mean-tokens en_bert_base_nli_mean_tokens 768 en 77.12 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_mean_tokens-0.0.4.tar.gz#en_bert_base_nli_mean_tokens-0.0.4
bert-base-nli-max-tokens en_bert_base_nli_max_tokens 768 en 77.21 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_max_tokens-0.0.4.tar.gz#en_bert_base_nli_max_tokens-0.0.4
bert-base-nli-cls-token en_bert_base_nli_cls_token 768 en 76.30 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_cls_token-0.0.4.tar.gz#en_bert_base_nli_cls_token-0.0.4
bert-large-nli-mean-tokens en_bert_large_nli_mean_tokens 1024 en 79.19 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_mean_tokens-0.0.4.tar.gz#en_bert_large_nli_mean_tokens-0.0.4
bert-large-nli-max-tokens en_bert_large_nli_max_tokens 1024 en 78.41 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_max_tokens-0.0.4.tar.gz#en_bert_large_nli_max_tokens-0.0.4
bert-large-nli-cls-token en_bert_large_nli_max_tokens 1024 en 78.29 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_max_tokens-0.0.4.tar.gz#en_bert_large_nli_max_tokens-0.0.4
roberta-base-nli-mean-tokens en_roberta_base_nli_mean_tokens 768 en 77.49 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_base_nli_mean_tokens-0.0.4.tar.gz#en_roberta_base_nli_mean_tokens-0.0.4
roberta-large-nli-mean-tokens en_roberta_large_nli_mean_tokens 1024 en 78.69 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_large_nli_mean_tokens-0.0.4.tar.gz#en_roberta_large_nli_mean_tokens-0.0.4
distilbert-base-nli-mean-tokens en_distilbert_base_nli_mean_tokens 768 en 76.97 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_distilbert_base_nli_mean_tokens-0.0.4.tar.gz#en_distilbert_base_nli_mean_tokens-0.0.4
bert-base-nli-stsb-mean-tokens en_bert_base_nli_stsb_mean_tokens 768 en 85.14 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_stsb_mean_tokens-0.0.4.tar.gz#en_bert_base_nli_stsb_mean_tokens-0.0.4
bert-large-nli-stsb-mean-tokens en_bert_large_nli_stsb_mean_tokens 1024 en 85.29 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_stsb_mean_tokens-0.0.4.tar.gz#en_bert_large_nli_stsb_mean_tokens-0.0.4
roberta-base-nli-stsb-mean-tokens en_roberta_base_nli_stsb_mean_tokens 768 en 85.40 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_base_nli_stsb_mean_tokens-0.0.4.tar.gz#en_roberta_base_nli_stsb_mean_tokens-0.0.4
roberta-large-nli-stsb-mean-tokens en_roberta_large_nli_stsb_mean_tokens 1024 en 86.31 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_large_nli_stsb_mean_tokens-0.0.4.tar.gz#en_roberta_large_nli_stsb_mean_tokens-0.0.4
distilbert-base-nli-stsb-mean-tokens en_distilbert_base_nli_stsb_mean_tokens 768 en 84.38 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_distilbert_base_nli_stsb_mean_tokens-0.0.4.tar.gz#en_distilbert_base_nli_stsb_mean_tokens-0.0.4
distiluse-base-multilingual-cased xx_distiluse_base_multilingual_cased 512 Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish 80.10 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/xx_distiluse_base_multilingual_cased-0.0.4.tar.gz#xx_distiluse_base_multilingual_cased-0.0.4
xlm-r-base-en-ko-nli-ststb xx_xlm_r_base_en_ko_nli_ststb 768 en,ko 81.47 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/xx_xlm_r_base_en_ko_nli_ststb-0.0.4.tar.gz#xx_xlm_r_base_en_ko_nli_ststb-0.0.4
xlm-r-large-en-ko-nli-ststb xx_xlm_r_base_en_ko_nli_ststb 1024 en,ko 84.05 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/xx_xlm_r_base_en_ko_nli_ststb-0.0.4.tar.gz#xx_xlm_r_base_en_ko_nli_ststb-0.0.4

The models, when first used, download to the folder defined with TORCH_HOME in the environment variables (default ~/.cache/torch).

Usage

With this package installed you can obtain a Language model with:

import spacy_sentence_bert
nlp = spacy_sentence_bert.load_model('en_roberta_large_nli_stsb_mean_tokens')

Or if a specific standalone model is installed from GitHub, you can load it from spaCy:

pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/en_roberta_large_nli_stsb_mean_tokens-0.0.4/en_roberta_large_nli_stsb_mean_tokens-0.0.4.tar.gz
import spacy
nlp = spacy.load('en_roberta_large_nli_stsb_mean_tokens')

Or if you want to use one of the sentence embeddings over an existing Language object, you can use the create_from method:

import spacy
import spacy_sentence_bert
nlp_base = spacy.load('en')
nlp = spacy_sentence_bert.create_from(nlp_base, 'en_bert_base_nli_cls_token')
nlp.pipe_names

Once you have loaded the model, simply use it to obtain vectors and using the similarity method of spaCy:

# get two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# get the vector of the Doc, Span or Token
print(doc_1.vector.shape)
print(doc_1[3].vector.shape)
print(doc_1[2:4].vector.shape)
# or use the similarity method that is based on the vectors, on Doc, Span or Token
print(doc_1.similarity(doc_2[0:7]))

Utils

To build and upload

VERSION=0.0.4
# build the standalone models
./build_models.sh
# build dist/spacy_sentence_bert-${VERSION}.tar.gz
python setup.py sdist
# upload to pypi
twine upload dist/spacy_sentence_bert-${VERSION}.tar.gz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_sentence_bert-0.0.4.tar.gz (8.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page