Skip to main content

Convert LaBSE model from TensorFlow to PyTorch.

Project description


language:

  • af
  • am
  • ar
  • as
  • az
  • be
  • bg
  • bn
  • bo
  • bs
  • ca
  • ceb
  • co
  • cs
  • cy
  • da
  • de
  • el
  • en
  • eo
  • es
  • et
  • eu
  • fa
  • fi
  • fr
  • fy
  • ga
  • gd
  • gl
  • gu
  • ha
  • haw
  • he
  • hi
  • hmn
  • hr
  • ht
  • hu
  • hy
  • id
  • ig
  • is
  • it
  • ja
  • jv
  • ka
  • kk
  • km
  • kn
  • ko
  • ku
  • ky
  • la
  • lb
  • lo
  • lt
  • lv
  • mg
  • mi
  • mk
  • ml
  • mn
  • mr
  • ms
  • mt
  • my
  • ne
  • nl
  • no
  • ny
  • or
  • pa
  • pl
  • pt
  • ro
  • ru
  • rw
  • si
  • sk
  • sl
  • sm
  • sn
  • so
  • sq
  • sr
  • st
  • su
  • sv
  • sw
  • ta
  • te
  • tg
  • th
  • tk
  • tl
  • tr
  • tt
  • ug
  • uk
  • ur
  • uz
  • vi
  • wo
  • xh
  • yi
  • yo
  • zh
  • zu tags:
  • bert
  • sentence_embedding
  • multilingual
  • google license: Apache-2.0 datasets:
  • CommonCrawl
  • Wikipedia

LaBSE

Project

This project is an implementation to convert LaBSE from TensorFlow to PyTorch.

Model description

Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.

Usage

Using the model:

import torch
from transformers import BertModel, BertTokenizerFast


tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE")
model = BertModel.from_pretrained("setu4993/LaBSE")
model = model.eval()

english_sentences = [
    "dog",
    "Puppies are nice.",
    "I enjoy taking long walks along the beach with my dog.",
]
english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    english_outputs = model(**english_inputs)

To get the sentence embeddings, use the pooler output:

english_embeddings = english_outputs.pooler_output

Output for other languages:

italian_sentences = [
    "cane",
    "I cuccioli sono carini.",
    "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    italian_outputs = model(**italian_inputs)
    japanese_outputs = model(**japanese_inputs)

italian_embeddings = italian_outputs.pooler_output
japanese_embeddings = japanese_outputs.pooler_output

For similarity between sentences, an L2-norm is recommended before calculating the similarity:

import torch.nn.functional as F


def similarity(embeddings_1, embeddings_2):
    normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
    normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
    return torch.matmul(
        normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
    )


print(similarity(english_embeddings, italian_embeddings))
print(similarity(english_embeddings, japanese_embeddings))
print(similarity(italian_embeddings, japanese_embeddings))

Details

Details about data, training, evaluation and performance metrics are available in the original paper.

BibTeX entry and citation info

@misc{feng2020languageagnostic,
      title={Language-agnostic BERT Sentence Embedding},
      author={Fangxiaoyu Feng and Yinfei Yang and Daniel Cer and Naveen Arivazhagan and Wei Wang},
      year={2020},
      eprint={2007.01852},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This repository and the conversion code is licensed under the MIT license, but the model is distributed with an Apache-2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

convert-labse-tf-pt-1.0.0.tar.gz (7.6 kB view hashes)

Uploaded Source

Built Distribution

convert_labse_tf_pt-1.0.0-py3-none-any.whl (8.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page