SpaCy models for using sentence-BERT
Project description
Sentence-BERT for spaCy
This package wraps sentence-transformers (also known as sentence-BERT) directly in spaCy. You can substitute the vectors provided in any spaCy model with vectors that have been tuned specifically for semantic similarity.
The models below are suggested for analysing sentence similarity, as the STS benchmark indicates.
Keep in mind that sentence-transformers
are configured with a maximum sequence length of 128. Therefore for longer texts it may be more suitable to work with other models (e.g. Universal Sentence Encoder).
Install
To install this package, you can run one of the following:
pip install spacy_sentence_bert
pip install git+https://github.com/MartinoMensio/spacy-sentence-bert.git
Usage
With this package installed
import spacy_sentence_bert
nlp = spacy_sentence_bert.load_model('en_bert_base_nli_cls_token')
Or if a specific model is installed (e.g. pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/en_bert_base_nli_cls_token-0.1.0/en_bert_base_nli_cls_token-0.1.0.tar.gz
)
import spacy
nlp = spacy.load('en_bert_base_nli_cls_token')
import spacy
import spacy_sentence_bert
nlp_base = spacy.load('en')
nlp = spacy_sentence_bert.create_from(nlp_base, 'en_bert_base_nli_cls_token')
nlp.pipe_names
Full list of models https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0
sentence-BERT name | spacy model name | dimensions | language | STS benchmark |
---|---|---|---|---|
bert-base-nli-mean-tokens |
en_bert_base_nli_mean_tokens |
768 | en | 77.12 |
bert-base-nli-max-tokens |
en_bert_base_nli_max_tokens |
768 | en | 77.21 |
bert-base-nli-cls-token |
en_bert_base_nli_cls_token |
768 | en | 76.30 |
bert-large-nli-mean-tokens |
en_bert_large_nli_mean_tokens |
1024 | en | 79.19 |
bert-large-nli-max-tokens |
en_bert_large_nli_max_tokens |
1024 | en | 78.41 |
bert-large-nli-cls-token |
en_bert_large_nli_max_tokens |
1024 | en | 78.29 |
roberta-base-nli-mean-tokens |
en_roberta_base_nli_mean_tokens |
768 | en | 77.49 |
roberta-large-nli-mean-tokens |
en_roberta_large_nli_mean_tokens |
1024 | en | 78.69 |
distilbert-base-nli-mean-tokens |
en_distilbert_base_nli_mean_tokens |
768 | en | 76.97 |
bert-base-nli-stsb-mean-tokens |
en_bert_base_nli_stsb_mean_tokens |
768 | en | 85.14 |
bert-large-nli-stsb-mean-tokens |
en_bert_large_nli_stsb_mean_tokens |
1024 | en | 85.29 |
roberta-base-nli-stsb-mean-tokens |
en_roberta_base_nli_stsb_mean_tokens |
768 | en | 85.40 |
roberta-large-nli-stsb-mean-tokens |
en_roberta_large_nli_stsb_mean_tokens |
1024 | en | 86.31 |
distilbert-base-nli-stsb-mean-tokens |
en_distilbert_base_nli_stsb_mean_tokens |
768 | en | 84.38 |
distiluse-base-multilingual-cased |
xx_distiluse_base_multilingual_cased |
512 | Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish | 80.10 |
xlm-r-base-en-ko-nli-ststb |
xx_xlm_r_base_en_ko_nli_ststb |
768 | en,ko | 81.47 |
xlm-r-large-en-ko-nli-ststb |
xx_xlm_r_base_en_ko_nli_ststb |
1024 | en,ko | 84.05 |
The models, when first used, download to the folder defined with TORCH_HOME
in the environment variables (default ~/.cache/torch
).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for spacy_sentence_bert-0.0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 129650ee4a6687abdfe266f01adab703f0dfc7a18723e24e1ba43a16875105e1 |
|
MD5 | 46046342ed320269378b1a3b01ebd97b |
|
BLAKE2b-256 | 784ac7f28cb252c27ec9e0969e2bbc106db38943239a569c54588eaeba781705 |