Production-ready LASER multilingual embeddings
Project description
LASER embeddings
Out-of-the-box multilingual sentence embeddings.
laserembeddings is a pip-packaged, production-ready port of Facebook Research's LASER (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings.
Have a look at the project's repo (master branch) for the full documentation.
Getting started
You'll need Python 3.6 or higher.
Installation
pip install laserembeddings
To install laserembeddings with extra dependencies:
# if you need Chinese support:
pip install laserembeddings[zh]
# if you need Japanese support (not available on Windows):
pip install laserembeddings[ja]
# or both:
pip install laserembeddings[zh,ja]
Downloading the pre-trained models
python -m laserembeddings download-models
This will download the models to the default data
directory next to the source code of the package. Use python -m laserembeddings download-models path/to/model/directory
to download the models to a specific location.
Usage
from laserembeddings import Laser laser = Laser() # if all sentences are in the same language: embeddings = laser.embed_sentences( ['let your neural network be polyglot', 'use multilingual embeddings!'], lang='en') # lang is only used for tokenization # embeddings is a N*1024 (N = number of sentences) NumPy array
If the sentences are not in the same language, you can pass a list of languages
embeddings = laser.embed_sentences( ['I love pasta.', "J'adore les pâtes.", 'Ich liebe Pasta.'], lang=['en', 'fr', 'de'])
If you downloaded the models into a specific directory:
from laserembeddings import Laser path_to_bpe_codes = ... path_to_bpe_vocab = ... path_to_encoder = ... laser = Laser(path_to_bpe_codes, path_to_bpe_vocab, path_to_encoder) # you can also supply file objects instead of file paths
If you want to pull the models from S3:
from io import BytesIO, StringIO from laserembeddings import Laser import boto3 s3 = boto3.resource('s3') MODELS_BUCKET = ... f_bpe_codes = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_codes.fcodes').get()['Body'].read().decode('utf-8')) f_bpe_vocab = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_vocabulary.fvocab').get()['Body'].read().decode('utf-8')) f_encoder = BytesIO(s3.Object(MODELS_BUCKET, 'path_to_encoder.pt').get()['Body'].read()) laser = Laser(f_bpe_codes, f_bpe_vocab, f_encoder)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size laserembeddings-1.0.1a1-py3-none-any.whl (13.1 kB) | File type Wheel | Python version py3 | Upload date | Hashes View |
Filename, size laserembeddings-1.0.1a1.tar.gz (12.2 kB) | File type Source | Python version None | Upload date | Hashes View |
Hashes for laserembeddings-1.0.1a1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1364c7b5927617c7b9728023187f778de55719decbf2bf7ea86b57f1699e439d |
|
MD5 | cafc2fc98fa3aaf5a8400ec7939ba11f |
|
BLAKE2-256 | a7cebea110a875b7b96d3627d6f5b88af4e40a5fd23b51f1068b9899e8f7033b |