A small example package
Project description
LASERDATO
This package is created to use simply LASER from MetaAI to create embeddings. It uses list of string as input and returns list of numpy arrays as output instead of using files. It also does not require external tools to be installed. The package automatically downloads the required laser models.
Usage
Simple embeddings creation
from datolaser import Laser
sentences = ["This is a sentence", "this is another sentences."]
laser = Laser()
embeddings = laser.embed_sentences(sentences=sentences)
Alignement
from datolaser import Laser
english_sentences = ["A cat","This is a sentence", "this is another sentences."]
french_sentences = ["C'est une phrase", "Un chat","c'est une autre phrase."]
laser = Laser()
aligned_sentences = laser.align_sentences(english_sentences, french_sentences)
If remove_bad_matched is False, it keep sentence with no match as (sentence_1, "",0), if set to True it removes them.
Embeddings creation with multiple GPUs
from datolaser import Laser
def main():
english_sentences = ["A cat","This is a sentence", "this is another sentences."]
french_sentences = ["C'est une phrase", "Un chat","c'est une autre phrase."]
laser = Laser()
gpu_ids = [0,1,2,3]
laser.activateMultiGpuEncoder(gpu_ids)
english_embeddings = laser.embed_sentences(sentences=sentences)
aligned_sentences = laser.align_sentences(english_sentences, french_sentences)
laser.deactivateMultiGpuEncoder()
if __name__ == '__main__':
main()
Laser 3
from datolaser import Laser
sentence = ["Is abairt é seo."]
laser = Laser(lang="gle_Latn")
embeddings = laser.embed_sentences(sentences=sentence)
Issues
-
Because of an issue with faiss this package cannot go above pyhton 3.10.
-
If you encounter the following error:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
You might need to use this strutucture to used embed_sentences with multiple GPUs
def main()
# do something here
if __name__ == '__main__':
main()
Supported languages
The original LASER model was trained on the following languages:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.
It has also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.
Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.
You can also use laser on other languages in the list laser3_langs in lib/constants.py by using the lang parameter. (see Usage Laser)
License
LASER is BSD-licensed, as found in the LICENSE
file in the root directory of this source tree.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file laserdato-0.1.2.tar.gz
.
File metadata
- Download URL: laserdato-0.1.2.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2210b2587c2872d2011832530642191ad2ba44d2519add18a6503922e98d20d5 |
|
MD5 | b58a6c417d888cd31cf0d1abc75346b5 |
|
BLAKE2b-256 | f0edfda8781d59fbde4ac790d70392c097510fd3ecbad049e571cb6c43f4b93c |
File details
Details for the file laserdato-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: laserdato-0.1.2-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f886be6e8f30f9a2cd1d1b763302aa57145b21fc19677ea8fd2b9962a5aa691e |
|
MD5 | 445c92da50a865a1c350815d32cd49fe |
|
BLAKE2b-256 | b8f93d64ae484112ebff858e43a8fb2bf5f62e360d8aacb0924ef05fddc284fc |