Production-ready LASER multilingual embeddings
Project description
LASER embeddings
Out-of-the-box multilingual sentence embeddings.
laserembeddings is a pip-packaged, production-ready port of Facebook Research's LASER (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings.
Have a look at the project's repo (master branch) for the full documentation.
Getting started
You'll need Python 3.6 or higher.
Installation
pip install laserembeddings
To install laserembeddings with extra dependencies:
# if you need Chinese support:
pip install laserembeddings[zh]
# if you need Japanese support (not available on Windows):
pip install laserembeddings[ja]
# or both:
pip install laserembeddings[zh,ja]
Downloading the pre-trained models
python -m laserembeddings download-models
This will download the models to the default data
directory next to the source code of the package. Use python -m laserembeddings download-models path/to/model/directory
to download the models to a specific location.
Usage
from laserembeddings import Laser
laser = Laser()
# if all sentences are in the same language:
embeddings = laser.embed_sentences(
['let your neural network be polyglot',
'use multilingual embeddings!'],
lang='en') # lang is only used for tokenization
# embeddings is a N*1024 (N = number of sentences) NumPy array
If the sentences are not in the same language, you can pass a list of languages
embeddings = laser.embed_sentences(
['I love pasta.',
"J'adore les pâtes.",
'Ich liebe Pasta.'],
lang=['en', 'fr', 'de'])
If you downloaded the models into a specific directory:
from laserembeddings import Laser
path_to_bpe_codes = ...
path_to_bpe_vocab = ...
path_to_encoder = ...
laser = Laser(path_to_bpe_codes, path_to_bpe_vocab, path_to_encoder)
# you can also supply file objects instead of file paths
If you want to pull the models from S3:
from io import BytesIO, StringIO
from laserembeddings import Laser
import boto3
s3 = boto3.resource('s3')
MODELS_BUCKET = ...
f_bpe_codes = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_codes.fcodes').get()['Body'].read().decode('utf-8'))
f_bpe_vocab = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_vocabulary.fvocab').get()['Body'].read().decode('utf-8'))
f_encoder = BytesIO(s3.Object(MODELS_BUCKET, 'path_to_encoder.pt').get()['Body'].read())
laser = Laser(f_bpe_codes, f_bpe_vocab, f_encoder)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for laserembeddings-1.0.1a1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1364c7b5927617c7b9728023187f778de55719decbf2bf7ea86b57f1699e439d |
|
MD5 | cafc2fc98fa3aaf5a8400ec7939ba11f |
|
BLAKE2b-256 | a7cebea110a875b7b96d3627d6f5b88af4e40a5fd23b51f1068b9899e8f7033b |