Skip to main content

A small example package

Project description

LASERDATO

This package is created to use simply LASER from MetaAI to create embeddings. It uses list of string as input and returns list of numpy arrays as output instead of using files. It also does not require external tools to be installed. The package automatically downloads the required laser models.

Usage

Simple embeddings creation

from datolaser import Laser
sentences = ["This is a sentence", "this is another sentences."]
laser = Laser()
embeddings = laser.embed_sentences(sentences=sentences)

Alignement

from datolaser import Laser
english_sentences = ["A cat","This is a sentence", "this is another sentences."]
french_sentences = ["C'est une phrase", "Un chat","c'est une autre phrase."]
laser = Laser()
aligned_sentences = laser.align_sentences(english_sentences, french_sentences)

If remove_bad_matched is False, it keep sentence with no match as (sentence_1, "",0), if set to True it removes them.

Embeddings creation with multiple GPUs

from datolaser import Laser

def main():
    english_sentences = ["A cat","This is a sentence", "this is another sentences."]
    french_sentences = ["C'est une phrase", "Un chat","c'est une autre phrase."]
    laser = Laser()
    gpu_ids = [0,1,2,3]
    laser.activateMultiGpuEncoder(gpu_ids)
    english_embeddings = laser.embed_sentences(sentences=sentences)
    aligned_sentences = laser.align_sentences(english_sentences, french_sentences)
    laser.deactivateMultiGpuEncoder()

if __name__ == '__main__':
    main()

Laser 3

from datolaser import Laser
sentence = ["Is abairt é seo."]
laser = Laser(lang="gle_Latn")
embeddings = laser.embed_sentences(sentences=sentence)

Issues

  • Because of an issue with faiss this package cannot go above pyhton 3.10.

  • If you encounter the following error:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

You might need to use this strutucture to used embed_sentences with multiple GPUs

def main()
    # do something here

if __name__ == '__main__':
    main()

Supported languages

The original LASER model was trained on the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

It has also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.

You can also use laser on other languages in the list laser3_langs in lib/constants.py by using the lang parameter. (see Usage Laser)

License

LASER is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

laserdato-0.1.2.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

laserdato-0.1.2-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file laserdato-0.1.2.tar.gz.

File metadata

  • Download URL: laserdato-0.1.2.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.7

File hashes

Hashes for laserdato-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2210b2587c2872d2011832530642191ad2ba44d2519add18a6503922e98d20d5
MD5 b58a6c417d888cd31cf0d1abc75346b5
BLAKE2b-256 f0edfda8781d59fbde4ac790d70392c097510fd3ecbad049e571cb6c43f4b93c

See more details on using hashes here.

File details

Details for the file laserdato-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: laserdato-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.7

File hashes

Hashes for laserdato-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f886be6e8f30f9a2cd1d1b763302aa57145b21fc19677ea8fd2b9962a5aa691e
MD5 445c92da50a865a1c350815d32cd49fe
BLAKE2b-256 b8f93d64ae484112ebff858e43a8fb2bf5f62e360d8aacb0924ef05fddc284fc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page