Skip to main content

Production-ready LASER multilingual embeddings

Project description

LASER embeddings

Travis (.org) branch PyPI - Python Version PyPI PyPI - License

laserembeddings is a pip-packaged, production-ready port of Facebook Research's LASER (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings.

🎁 Version 0.1.2 is out. What's new?

  • Korean is now fully supported ✅
  • A bug in Python 3.7 (<= 3.7.4) and 3.8 (<= 3.8.0 beta 3) affecting the tokenizer performance was patched as a temporary solution until next Python releases 🐛

Context

LASER is a collection of scripts and models created by Facebook Research to compute multilingual sentence embeddings for zero-shot cross-lingual transfer.

What does it mean? LASER is able to transform sentences into language-independent vectors. Similar sentences get mapped to close vectors (in terms of cosine distance), regardless of the input language.

That is great, especially if you don't have training sets for the language(s) you want to process: you can build a classifier on top of LASER embeddings, train it on whatever language(s) you have in your training data, and let it classify texts in any language.

The aim of the package is to make LASER as easy-to-use and easy-to-deploy as possible: zero-config, production-ready, etc., just a two-liner to install.

👉 👉 👉 For detailed information, have a look at the amazing LASER repository, read its presentation article and its research paper. 👈 👈 👈

Getting started

You'll need Python 3.6 or higher.

Installation

pip install laserembeddings

Downloading the pre-trained models

python -m laserembeddings download-models

This will download the models to the default data directory next to the source code of the package. Use python -m laserembeddings download-models path/to/model/directory to download the models to a specific location.

Usage

from laserembeddings import Laser

laser = Laser()

embeddings = laser.embed_sentences(
    ['let your neural network be polyglot',
     'use multilingual embeddings!'],
    lang='en')  # lang is used for tokenization

# embeddings is a N*1024 (N = number of sentences) NumPy array

If you downloaded the models into a specific directory:

from laserembeddings import Laser

path_to_bpe_codes = ...
path_to_bpe_vocab = ...
path_to_encoder = ...

laser = Laser(path_to_bpe_codes, path_to_bpe_vocab, path_to_encoder)

# you can also supply file objects instead of file paths

If you want to pull the models from S3:

from io import BytesIO, StringIO
from laserembeddings import Laser
import boto3

s3 = boto3.resource('s3')
MODELS_BUCKET = ...

f_bpe_codes = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_codes.fcodes').get()['Body'].read().decode('utf-8'))
f_bpe_vocab = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_vocabulary.fvocab').get()['Body'].read().decode('utf-8'))
f_encoder = BytesIO(s3.Object(MODELS_BUCKET, 'path_to_encoder.pt').get()['Body'].read())

laser = Laser(f_bpe_codes, f_bpe_vocab, f_encoder)

What are the differences with the original implementation?

Some dependencies of the original project have been replaced with pure-python dependencies, to make this package easy to install and deploy.

Here's a summary of the differences:

Part of the pipeline LASER dependency (original project) laserembeddings dependency (this package) Reason
Normalization / tokenization Moses Sacremoses Moses is implemented in Perl
BPE encoding fastBPE subword-nmt fastBPE cannot be installed via pip and requires compiling C++ code

The following features have not been implemented yet:

  • romanize, needed to process Greek (el)
  • Chinese text segmentation, needed to process Chinese (zh, cmn, wuu and yue)
  • Japanese text segmentation, needed to process Japanese (ja, jpn)

Will I get the exact same embeddings?

For most languages, in most of the cases, yes.

Some slight (and not so slight 🙄) differences exist for some languages due to differences in the implementation of the Tokenizer.

An exhaustive comparison of the embeddings generated with LASER and laserembeddings is automatically generated and will be updated for each new release.

FAQ

How can I train the encoder?

You can't. LASER models are pre-trained and do not need to be fine-tuned. The embeddings are generic and perform well without fine-tuning. See https://github.com/facebookresearch/LASER/issues/3#issuecomment-404175463.

Credits

Thanks a lot to the creators of LASER for open-sourcing the code of LASER and releasing the pre-trained models. All the kudos should go to them 👏.

A big thanks to the creators of Sacremoses and Subword Neural Machine Translation for their great packages.

Testing

First you'll need to checkout this repository and install it (in a virtual environment if you want). Also make sure to have Poetry installed.

peotry install

Then, to run the tests:

poetry run pytest

Testing the similarity between the embeddings computed with LASER and laserembeddings

First, download the test data.

python -m laserembeddings download-test-data

👉 If you want to know more about the contents and the generation of the test data, check out the laserembeddings-test-data repository.

Then, run the test with SIMILARITY_TEST env. variable set to 1.

SIMILARITY_TEST=1 poetry run pytest tests/test_laser.py

Now, have a coffee ☕️ and wait for the test to finish.

The similarity report will be generated here: tests/report/comparison-with-LASER.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for laserembeddings, version 0.1.2
Filename, size File type Python version Upload date Hashes
Filename, size laserembeddings-0.1.2-py3-none-any.whl (13.9 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size laserembeddings-0.1.2.tar.gz (14.5 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page