Skip to main content

Handy library to work with pre-trained ELMo embeddings in TensorFlow

Project description

Simple ELMo

simple_elmo is a Python library to work with pre-trained ELMo embeddings in TensorFlow.

This is a significantly updated wrapper to the original ELMo implementation. The main changes are:

  • more convenient and transparent data loading (including from compressed files)
  • code adapted to modern TensorFlow versions (including TensorFlow 2).

Usage

pip install simple_elmo

model = ElmoModel()

model.load(PATH_TO_ELMO)

elmo_vectors = model.get_elmo_vectors(SENTENCES)

averaged_vectors = model.get_elmo_vector_average(SENTENCES)

PATH_TO_ELMO is a ZIP archive downloaded from the NLPL vector repository, OR a directory containing 3 files extracted from such an archive:

  • model.hdf5, pre-trained ELMo weights in HDF5 format;
  • options.json, description of the model architecture in JSON;
  • vocab.txt/vocab.txt.gz, one-word-per-line vocabulary of the most frequent words you would like to cache during inference (not really necessary, the model will infer embeddings for OOV words from their characters).

SENTENCES is a list of sentences (lists of words).

Use the elmo_vectors and averaged_vectors tensors for your downstream tasks.

elmo_vectors contains contextualized word embeddings. Its shape is: (number of sentences, the length of the longest sentence, ELMo dimensionality).

averaged_vectors contains one vector per each input sentence, constructed by averaging individual contextualized word embeddings. It is a list of vectors (the shape is (ELMo dimensionality)).

Example scripts

We provide two example scripts to make it easier to start using simple_elmo right away:

python3 get_elmo_vectors.py -i test.txt -e ~/PATH_TO_ELMO/

python3 text_classification.py -i paraphrases_lemm.tsv.gz -e ~/PATH_TO_ELMO/

The second script can be used to perform document pair classification (like in text entailment or paraphrase detection).

Simple average of ELMo embeddings for all words in a document is used; then, the cosine similarity between two documents is calculated and used as a classifier feature.

Example paraphrase datasets for Russian (adapted from http://paraphraser.ru/):

The library requires Python >= 3.6.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_elmo-0.2.0.tar.gz (19.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page