SpaCy models for using sentence-BERT

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Sentence-BERT for spaCy

This package wraps sentence-transformers (also known as sentence-BERT) directly in spaCy. You can substitute the vectors provided in any spaCy model with vectors that have been tuned specifically for semantic similarity.

The models below are suggested for analysing sentence similarity, as the STS benchmark indicates. Keep in mind that sentence-transformers are configured with a maximum sequence length of 128. Therefore for longer texts it may be more suitable to work with other models (e.g. Universal Sentence Encoder).

Install

To install this package, you can run one of the following:

pip install spacy_sentence_bert
pip install git+https://github.com/MartinoMensio/spacy-sentence-bert.git

You can install standalone spaCy packages from GitHub with pip. From the full list of models this table describes the models available.

sentence-BERT name	spacy model name	dimensions	language	STS benchmark	standalone install
`bert-base-nli-mean-tokens`	`en_bert_base_nli_mean_tokens`	768	en	77.12	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_mean_tokens-0.0.4.tar.gz#en_bert_base_nli_mean_tokens-0.0.4`
`bert-base-nli-max-tokens`	`en_bert_base_nli_max_tokens`	768	en	77.21	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_max_tokens-0.0.4.tar.gz#en_bert_base_nli_max_tokens-0.0.4`
`bert-base-nli-cls-token`	`en_bert_base_nli_cls_token`	768	en	76.30	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_cls_token-0.0.4.tar.gz#en_bert_base_nli_cls_token-0.0.4`
`bert-large-nli-mean-tokens`	`en_bert_large_nli_mean_tokens`	1024	en	79.19	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_mean_tokens-0.0.4.tar.gz#en_bert_large_nli_mean_tokens-0.0.4`
`bert-large-nli-max-tokens`	`en_bert_large_nli_max_tokens`	1024	en	78.41	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_max_tokens-0.0.4.tar.gz#en_bert_large_nli_max_tokens-0.0.4`
`bert-large-nli-cls-token`	`en_bert_large_nli_max_tokens`	1024	en	78.29	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_max_tokens-0.0.4.tar.gz#en_bert_large_nli_max_tokens-0.0.4`
`roberta-base-nli-mean-tokens`	`en_roberta_base_nli_mean_tokens`	768	en	77.49	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_base_nli_mean_tokens-0.0.4.tar.gz#en_roberta_base_nli_mean_tokens-0.0.4`
`roberta-large-nli-mean-tokens`	`en_roberta_large_nli_mean_tokens`	1024	en	78.69	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_large_nli_mean_tokens-0.0.4.tar.gz#en_roberta_large_nli_mean_tokens-0.0.4`
`distilbert-base-nli-mean-tokens`	`en_distilbert_base_nli_mean_tokens`	768	en	76.97	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_distilbert_base_nli_mean_tokens-0.0.4.tar.gz#en_distilbert_base_nli_mean_tokens-0.0.4`
`bert-base-nli-stsb-mean-tokens`	`en_bert_base_nli_stsb_mean_tokens`	768	en	85.14	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_stsb_mean_tokens-0.0.4.tar.gz#en_bert_base_nli_stsb_mean_tokens-0.0.4`
`bert-large-nli-stsb-mean-tokens`	`en_bert_large_nli_stsb_mean_tokens`	1024	en	85.29	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_stsb_mean_tokens-0.0.4.tar.gz#en_bert_large_nli_stsb_mean_tokens-0.0.4`
`roberta-base-nli-stsb-mean-tokens`	`en_roberta_base_nli_stsb_mean_tokens`	768	en	85.40	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_base_nli_stsb_mean_tokens-0.0.4.tar.gz#en_roberta_base_nli_stsb_mean_tokens-0.0.4`
`roberta-large-nli-stsb-mean-tokens`	`en_roberta_large_nli_stsb_mean_tokens`	1024	en	86.31	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_large_nli_stsb_mean_tokens-0.0.4.tar.gz#en_roberta_large_nli_stsb_mean_tokens-0.0.4`
`distilbert-base-nli-stsb-mean-tokens`	`en_distilbert_base_nli_stsb_mean_tokens`	768	en	84.38	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_distilbert_base_nli_stsb_mean_tokens-0.0.4.tar.gz#en_distilbert_base_nli_stsb_mean_tokens-0.0.4`
`distiluse-base-multilingual-cased`	`xx_distiluse_base_multilingual_cased`	512	Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish	80.10	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/xx_distiluse_base_multilingual_cased-0.0.4.tar.gz#xx_distiluse_base_multilingual_cased-0.0.4`
`xlm-r-base-en-ko-nli-ststb`	`xx_xlm_r_base_en_ko_nli_ststb`	768	en,ko	81.47	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/xx_xlm_r_base_en_ko_nli_ststb-0.0.4.tar.gz#xx_xlm_r_base_en_ko_nli_ststb-0.0.4`
`xlm-r-large-en-ko-nli-ststb`	`xx_xlm_r_base_en_ko_nli_ststb`	1024	en,ko	84.05	`pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/xx_xlm_r_base_en_ko_nli_ststb-0.0.4.tar.gz#xx_xlm_r_base_en_ko_nli_ststb-0.0.4`

The models, when first used, download to the folder defined with TORCH_HOME in the environment variables (default ~/.cache/torch).

Usage

With this package installed you can obtain a Language model with:

import spacy_sentence_bert
nlp = spacy_sentence_bert.load_model('en_roberta_large_nli_stsb_mean_tokens')

Or if a specific standalone model is installed from GitHub, you can load it from spaCy:

pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/en_roberta_large_nli_stsb_mean_tokens-0.0.4/en_roberta_large_nli_stsb_mean_tokens-0.0.4.tar.gz

import spacy
nlp = spacy.load('en_roberta_large_nli_stsb_mean_tokens')

Or if you want to use one of the sentence embeddings over an existing Language object, you can use the create_from method:

import spacy
import spacy_sentence_bert
nlp_base = spacy.load('en')
nlp = spacy_sentence_bert.create_from(nlp_base, 'en_bert_base_nli_cls_token')
nlp.pipe_names

Once you have loaded the model, simply use it to obtain vectors and using the similarity method of spaCy:

# get two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# get the vector of the Doc, Span or Token
print(doc_1.vector.shape)
print(doc_1[3].vector.shape)
print(doc_1[2:4].vector.shape)
# or use the similarity method that is based on the vectors, on Doc, Span or Token
print(doc_1.similarity(doc_2[0:7]))

Utils

To build and upload

VERSION=0.0.4
# build the standalone models
./build_models.sh
# build dist/spacy_sentence_bert-${VERSION}.tar.gz
python setup.py sdist
# upload to pypi
twine upload dist/spacy_sentence_bert-${VERSION}.tar.gz

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.2

Mar 16, 2021

0.1.1

Mar 7, 2021

0.1.0

Mar 7, 2021

This version

0.0.4

Jul 24, 2020

0.0.3

Jul 24, 2020

0.0.2

Jul 24, 2020

0.0.1

Jul 24, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_sentence_bert-0.0.4.tar.gz (8.2 kB view hashes)

Uploaded Jul 24, 2020 Source

Hashes for spacy_sentence_bert-0.0.4.tar.gz

Hashes for spacy_sentence_bert-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`5452773ec24d24bb86071af281982d2927550dacf441bf5d799ec39a465bf69c`
MD5	`d2c535e18821a759578cfe0821a1aecc`
BLAKE2b-256	`49cb89933d82a4743daeab49df2a28e76fc45f75519eb67b7d7b4c1349367424`