Skip to main content

An easy to use interface to LASER

Project description

EasyLaser

This package is created to use simply LASER from MetaAI to create embeddings. It uses list of string as input and returns list of numpy arrays as output instead of using files. It also does not require external tools to be installed. The package automatically downloads the required laser models.

Get started

Install with pip

pip install easylaser

Build from source

git clone https://gitlab.com/linguacustodia/easylaser.git
cd easylaser
pip install .

Simple embeddings creation

from easylaser import Laser
sentences = ["This is a sentence", "this is another sentences."]
with Laser() as laser:
    embeddings = laser.embed_sentences(sentences=sentences)

By default, it will try to run on the first gpu if it's available, if you don't have a gpu it will switch back to CPU.

Embeddings

So as we have seen you can use it with a context manager or without

from easylaser import Laser
sentences = ["This is a sentence", "this is another sentences."]

# with context manager
with Laser() as laser:
    embeddings = laser.embed_sentences(sentences=sentences)

# without
laser = Laser()
laser.is_encoder_active() # return False
laser.start_encoder()
embeddings = laser.embed_sentences(sentences=sentences)
laser.is_encoder_active() # return True
laser.stop_encoder()

Multi GPU

You can specify the hardware you want to run on :

from easylaser import Laser

laser = Laser(device="cpu")
laser = Laser(device="cuda")
laser = Laser(device=["cuda:0", "cuda:1"])

If you specify multiple graphic card, the inference will be multi-processed, leading to speed gain.

CAUTION : There is know bug, see Issues below, with multiple graphic card, one of the parent function which use Laser should be called from if __name__ == '__main__':

From our test the relation between the number of gpu and the speed is sub-linear. If you have some ideas to improve the speed, please contact us.

Alignement

Embeddings and Alignement

from datolaser import Laser
english_sentences = ["A cat","This is a sentence", "this is another sentences."]
french_sentences = ["C'est une phrase", "Un chat","c'est une autre phrase."]
with Laser() as laser:
    aligned_sentences = laser.align_sentences(
                    english_sentences,
                    french_sentences,
                    threshold_score = 0,
                    keep_bad_matched = False
                    )

Every sentences, below the threshold will be considered as bad_matched.

If keep_bad_matched is True, it keep sentence with no match as (sentence_1, None,0), if set to False it removes them.

Only alignement

You can use align_with_embeddings if you have embeddings and just want to align sentences

from easylaser import align_with_embeddings

align_with_embeddings(
    embeddings_lang0,
    embeddings_lang1,
    sentences_lang0,
    sentences_lang1,
    threshold_score=0,
    keep_bad_matched=False,
)

Issues

  • Because of an issue with faiss this package cannot go above pyhton 3.10.

  • If you encounter the following error:

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

You might need to use this strutucture to used embed_sentences with multiple GPUs

def main()
    # do something here

if __name__ == '__main__':
    main()

Supported languages

LASER2

The LASER2 model was trained on the following languages, so you don't need to specify a lang for these languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

It has also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.

LASER3

You can also use laser on other languages in the list laser3_langs in lib/constants.py by using the lang parameter.

with Laser(lang="zul_Latn") as laser:
    embeddings = laser.embed_sentences(sentences=sentences)

You might have issue with Laser3, we haven't properly tested it as we don't need it.

Remove models

If you want to delete laser models they are here, run :

rm -r $HOME/.cache/laser-models

License

LASER is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easylaser-0.1.5.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

easylaser-0.1.5-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file easylaser-0.1.5.tar.gz.

File metadata

  • Download URL: easylaser-0.1.5.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for easylaser-0.1.5.tar.gz
Algorithm Hash digest
SHA256 580910a2d16318db9776d3cfa98c552c9c71c05d9700200689216432037d25fa
MD5 a1f95d285d8db9b08072b1fc86cbb26d
BLAKE2b-256 c16d5c01e22c3f8cc841b0550f65b3018fa29d1371762e98305097258d5ac100

See more details on using hashes here.

File details

Details for the file easylaser-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: easylaser-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for easylaser-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 e6c8215b356fd22eb0eb4e346bd2199b3840c3201dfbf742ff2f0b2320ee0543
MD5 ecbc7f6977a04edab61d94f57e7fe7df
BLAKE2b-256 f8d7b2b06a35773152772e294e9ff5f3c64cfbcd4e1f4f9b79cc0777acb3205a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page