Skip to main content

ELMo, updated to be usable with models for many languages

Project description

Pre-trained ELMo Representations for Many Languages

We release our ELMo representations trained on many languages which helps us win the CoNLL 2018 shared task on Universal Dependencies Parsing according to LAS.

Technique Details

We use the same hyperparameter settings as Peters et al. (2018) for the biLM and the character CNN. We train their parameters on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl) for each language. We largely based ourselves on the code of AllenNLP, but made the following changes:

  • We support unicode characters;
  • We use the sample softmax technique to make training on large vocabulary feasible (Jean et al., 2015). However, we use a window of words surrounding the target word as negative samples and it shows better performance in our preliminary experiments.

The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.

Downloads

Arabic Bulgarian Catalan Czech
Old Church Slavonic Danish German Greek
English Spanish Estonian Basque
Persian Finnish French Irish
Galician Ancient Greek Hebrew Hindi
Croatian Hungarian Indonesian Italian
Japanese Korean Latin Latvian
Norwegian Bokmål Dutch Norwegian Nynorsk Polish
Portuguese Romanian Russian Slovak
Slovene Swedish Turkish Uyghur
Ukrainian Urdu Vietnamese Chinese

The models are hosted on the NLPL Vectors Repository.

ELMo for Simplified Chinese

We also provided simplified-Chinese ELMo. It was trained on xinhua proportion of Chinese gigawords-v5, which is different from the Wikipedia for traditional Chinese ELMo.

Pre-requirements

Usage

Install the package

You need to install the package to use the embeddings with the following commends

python setup.py install

Set up the config_path

After unzip the model, you will find a JSON file ${lang}.model/config.json. Please change the "config_path" field to the relative path to the model configuration cnn_50_100_512_4096_sample.json. For example, if your ELMo model is zht.model/config.json and your model configuration is zht.model/cnn_50_100_512_4096_sample.json, you need to change "config_path" in zht.model/config.json to cnn_50_100_512_4096_sample.json.

If there is no configuration cnn_50_100_512_4096_sample.json under ${lang}.model, you can copy the elmoformanylangs/configs/cnn_50_100_512_4096_sample.json into ${lang}.model, or change the "config_path" into elmoformanylangs/configs/cnn_50_100_512_4096_sample.json.

See issue 27 for more details.

Use ELMoForManyLangs in command line

Prepare your input file in the conllu format, like

1   Sue    Sue    _   _   _   _   _   _   _
2   likes  like   _   _   _   _   _   _   _
3   coffee coffee _   _   _   _   _   _   _
4   and    and    _   _   _   _   _   _   _
5   Bill   Bill   _   _   _   _   _   _   _
6   tea    tea    _   _   _   _   _   _   _

Fileds should be separated by '\t'. We only use the second column and space (' ') is supported in this field (for Vietnamese, a word can contains spaces). Do remember tokenization!

When it's all set, run

$ python -m elmoformanylangs test \
    --input_format conll \
    --input /path/to/your/input \
    --model /path/to/your/model \
    --output_prefix /path/to/your/output \
    --output_format hdf5 \
    --output_layer -1

It will dump an hdf5 encoded dict onto the disk, where the key is '\t' separated words in the sentence and the value is it's 3-layer averaged ELMo representation. You can also dump the cnn encoded word with --output_layer 0, the first layer of the LsTM with --output_layer 1 and the second layer of the LSTM with --output_layer 2.
We are actively changing the interface to make it more adapted to the AllenNLP ELMo and more programmatically friendly.

Use ELMoForManyLangs programmatically

Thanks @voidism for contributing the API. By using Embedder python object, you can use ELMo into your own code like this:

from elmoformanylangs import Embedder

e = Embedder('/path/to/your/model/')

sents = [['今', '天', '天氣', '真', '好', '阿'],
['潮水', '退', '了', '就', '知道', '誰', '沒', '穿', '褲子']]
# the list of lists which store the sentences 
# after segment if necessary.

e.sents2elmo(sents)
# will return a list of numpy arrays 
# each with the shape=(seq_len, embedding_size)

the parameters to init Embedder:

class Embedder(model_dir='/path/to/your/model/', batch_size=64):
  • model_dir: the absolute path from the repo top dir to you model dir.
  • batch_size: the batch_size you want when the model inference, you can specify it properly according to your gpu/cpu ram size. (default: 64)

the parameters of the function sents2elmo:

def sents2elmo(sents, output_layer=-1):
  • sents: the list of lists which store the sentences after segment if necessary.
  • output_layer: the target layer to output.
    • 0 for the word encoder
    • 1 for the first LSTM hidden layer
    • 2 for the second LSTM hidden layer
    • -1 for an average of 3 layers. (default)
    • -2 for all 3 layers

Training Your Own ELMo

Please run

$ python -m elmoformanylangs.biLM train -h

to get more details about the ELMo training.

Here is an example for training English ELMo.

$ less data/en.raw
... (snip) ...
Notable alumni
Aris Kalafatis ( Acting )
Labour Party
They build an open nest in a tree hole , or man - made nest - boxes .
Legacy
... (snip) ...

$ python -m elmoformanylangs.biLM train \
    --train_path data/en.raw \
    --config_path elmoformanylangs/configs/cnn_50_100_512_4096_sample.json \
    --model output/en \
    --optimizer adam \
    --lr 0.001 \
    --lr_decay 0.8 \
    --max_epoch 10 \
    --max_sent_len 20 \
    --max_vocab_size 150000 \
    --min_count 3

However, we need to add that the training process is not very stable. In some cases, we end up with a loss of nan. We are actively working on that and hopefully improve it in the future.

Citation

If our ELMo gave you nice improvements, please cite us.

@InProceedings{che-EtAl:2018:K18-2,
  author    = {Che, Wanxiang  and  Liu, Yijia  and  Wang, Yuxuan  and  Zheng, Bo  and  Liu, Ting},
  title     = {Towards Better {UD} Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {55--64},
  url       = {http://www.aclweb.org/anthology/K18-2005}
}

Please also cite the NLPL Vectors Repository for hosting the models.

@InProceedings{fares-EtAl:2017:NoDaLiDa,
  author    = {Fares, Murhaf  and  Kutuzov, Andrey  and  Oepen, Stephan  and  Velldal, Erik},
  title     = {Word vectors, reuse, and replicability: Towards a community repository of large-text resources},
  booktitle = {Proceedings of the 21st Nordic Conference on Computational Linguistics},
  month     = {May},
  year      = {2017},
  address   = {Gothenburg, Sweden},
  publisher = {Association for Computational Linguistics},
  pages     = {271--276},
  url       = {http://www.aclweb.org/anthology/W17-0237}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elmoformanylangs-0.0.4.post2.tar.gz (33.0 kB view details)

Uploaded Source

Built Distribution

elmoformanylangs-0.0.4.post2-py3-none-any.whl (42.8 kB view details)

Uploaded Python 3

File details

Details for the file elmoformanylangs-0.0.4.post2.tar.gz.

File metadata

  • Download URL: elmoformanylangs-0.0.4.post2.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.6.8

File hashes

Hashes for elmoformanylangs-0.0.4.post2.tar.gz
Algorithm Hash digest
SHA256 4f79bae6fb21321c1f6d0207322fef2fd14f0d63f8e96d3f283e054a9ba18c2f
MD5 581b07b8b61a1273dc5233dc1a57104f
BLAKE2b-256 aefdb922d343625cb6c6b112d27b472f70e731f78fd0433a3910edf459cb6700

See more details on using hashes here.

File details

Details for the file elmoformanylangs-0.0.4.post2-py3-none-any.whl.

File metadata

  • Download URL: elmoformanylangs-0.0.4.post2-py3-none-any.whl
  • Upload date:
  • Size: 42.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.6.8

File hashes

Hashes for elmoformanylangs-0.0.4.post2-py3-none-any.whl
Algorithm Hash digest
SHA256 28b10636934075304446325611974adba726d3266c7821d35377697a21cada25
MD5 3c430658629bc488abc0db30ffd31940
BLAKE2b-256 ee844d8dcfaaece62c420254c1d860d02d3f79f7ed15206a73171ac2bbc8e57e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page