Command-line interface with state-of-the-art neural network language models

Project description

Language Model Zoo

zoo-logo

This folder contains scripts for obtaining surprisals from the following pre-trained language models:

GRNN
JRNN
RNNG
Transformer-XL
Tiny LSTM
5-gram with Kneser-Ney smoothing
coming soon: BERT

models

The models use the following tokenizers:

Model	Tokenizer
GRNN	TreeTagger
JRNN	WMT11 tokenizer
RNNG	PTB tokenizer
Tiny	PTB tokenizer
Trans	Moses (one implementation)
ngram	TreeTagger

The parameters are taken from the standard published version of each model unless stated otherwise.

Scripts

Surprisals can be obtained from each model using the script eval_<MODEL>.sh in the scripts folder. Each script expects two arguments: $1 is the input file containing the sentences, and $2 is the output file to save the surprisals.

Input file format

The input file should have each sentence on a new line, and each sentence should be tokenized.

There are also some model-specific constraints, although I may try to streamline these later:

For every model except RNNG and Tiny LSTM, the sentence should end with an <eos> token.
The n-gram model is uncased, so you'll have to convert your input file to lowercase to avoid getting unks. I am working on adding a script to do this.
For RNNG and Tiny LSTM, the input must be unkified. An unkify function is provided in rnng-incremental/get_raw.py, which can be used in the following way:

python2 get_raw.py train.02-21 \
    RAW.txt > UNKIFIED.txt

Output file format

The output file will have the following format:

token1 0.0
token2 ...
.      ...
<eos>  0.0

where the second column (separated by \t) gives the surprisal in bits of the token.

When you run eval_ngram.sh, you will also get an extra .raw output file that has the raw SRILM output with details about word probabilities and backoff.

Dependencies

LSTMs and Transformers

The GRNN, JRNN, Transformer-XL, and Tiny LSTM models require pytorch and other dependencies that can be found in their source folders. If you don't feel like creating your own environments, feel free to "steal" mine: /om2/user/jennhu/conda/envs/neural-nlp (credit to Martin Schrimpf) works for GRNN, JRNN, and Tiny LSTM, and /om2/user/jennhu/conda/envs/transXL was custom-built for Transformer-XL.

RNNG

The dependencies for RNNG should already be set in the source code. If problems arise, I may make a Singularity image available with the relevant C++ libraries.

n-gram

The dependencies for n-gram (SRILM) are also set in a Singularity image called in the script. However, by default, you will also need numpy to convert the raw SRILM output to the standard format. If you don't already have an active conda environment (which has numpy), simply use the command module add openmind/anaconda before running the n-gram script.

Note that I did not add this line to the top of the eval_ngram.sh file because users may want to run the n-gram model in their own preferred environments.

Other tips

When submitting jobs to SLURM, keep in mind that different models have different memory/time requirements. The following settings have worked for me in the past:

Model	Suggested memory	Speed	GPU
GRNN	`5G`	Medium	Yes
JRNN	`20G`	Medium	No
RNNG	`12G`	Slow	No
Tiny	`5G`	Fast	No
Trans	`5G`	Fast	Yes
ngram	`5G`	Fast	No

The speed is relative to the other models; for reference, Tiny LSTM takes under 1 minute to calculate surprisal for 900 simple sentences (~7 words each), while RNNG takes several hours.

If using GPU, remember to request the appropriate resources in your sbatch call.

I may also add sample SLURM scripts if that would be helpful.

TODO

Adding models

BERT (currently have working pipeline, but pre-processing is a little more involved)
action LSTM / stack-only ablated RNNG (Kuncoro et al. 2017) - see Issue #17
Ordered-Neurons LSTM
MomLSTM
PCFG
add models trained on non-English data

Improving existing models

add GPU functionality

Ease of use

add README to each model folder with hyperparameters, etc.
add script for converting file to lowercase (for n-gram)
add script for tokenization
add SLURM script to submit all models as job array
add environments to shared folder

Project details

Release history Release notifications | RSS feed

1.4a3 pre-release

Jul 26, 2022

1.4a2 pre-release

Jan 13, 2022

1.4a1 pre-release

Dec 17, 2021

1.3

Dec 3, 2021

1.2.3

Mar 26, 2021

1.2.2

Jun 29, 2020

1.2.1

Jun 4, 2020

1.2

May 27, 2020

1.1.1

May 25, 2020

1.1

May 21, 2020

1.1b0 pre-release

May 14, 2020

1.0.0

May 10, 2020

0.1rc5 pre-release

May 1, 2020

0.1rc4 pre-release

May 1, 2020

0.1rc3 pre-release

May 1, 2020

This version

0.1rc2 pre-release

Apr 28, 2020

0.1rc1 pre-release

Apr 21, 2020

0.1rc0 pre-release

Apr 21, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lm-zoo-0.1rc2.tar.gz (6.2 kB view hashes)

Uploaded Apr 28, 2020 Source

Hashes for lm-zoo-0.1rc2.tar.gz

Hashes for lm-zoo-0.1rc2.tar.gz
Algorithm	Hash digest
SHA256	`7ef647c234dded5fe20654ce233e0fa7416b64f5ee80de3d11960e3d90981f2e`
MD5	`76e8a2e9fe516bdf5e5c9762149b3256`
BLAKE2b-256	`565b1f678ca29793a6a52d6f1ac74a0658aa399ebb07ddad63e9b891f2dca27a`