Skip to main content

Train the Bi-LM model and use it as a feature extraction method

Project description

Travis Coverage

Introduction

The repository contains a class for training a bidirectional language model which could be used as a feature extraction method for each position in a sentence.

Install

pip install keras-bi-lm

Usage

Train and save the Bi-LM model

Before using it as a feature extraction method, the language model must be trained on large corpora.

from keras_bi_lm import BiLM

sentences = [
    ['All', 'work', 'and', 'no', 'play'],
    ['makes', 'Jack', 'a', 'dull', 'boy', '.'],
]
token_dict = {
    '': 0, '<UNK>': 1, '<EOS>': 2,
    'all': 3, 'work': 4, 'and': 5, 'no': 6, 'play': 7,
    'makes': 8, 'a': 9, 'dull': 10, 'boy': 11, '.': 12,
}
token_dict_rev = {v: k for k, v in token_dict.items()}
# BiLM.get_batch is a static helper function for
inputs, outputs = BiLM.get_batch(sentences,
                                 token_dict,
                                 ignore_case=True,
                                 unk_index=token_dict['<UNK>'],
                                 eos_index=token_dict['<EOS>'])

bi_lm = BiLM(token_num=len(token_dict), embedding_dim=10, rnn_units=10)
bi_lm.model.summary()
bi_lm.fit(np.repeat(inputs, 2 ** 12, axis=0),
          [
              np.repeat(outputs[0], 2 ** 12, axis=0),
              np.repeat(outputs[1], 2 ** 12, axis=0),
          ],
          epochs=5)
bi_lm.save_model('bi_lm.h5')

BiLM()

The core class that contains the model to be trained and used. Key parameters:

  • token_num: Number of words or characters.

  • embedding_dim: The dimension of embedding layer.

  • rnn_layer_num: The number of stacked bidirectional RNNs.

  • rnn_units: An integer or a list representing the number of units of RNNs in one direction.

  • rnn_keep_num: How many layers are used for predicting the probabilities of the next word.

  • rnn_type: Type of RNN, ‘gru’ or ‘lstm’.

BiLM.get_batch()

A helper function that converts sentences to batch inputs and outputs for training the model.

  • sentences: A list of list of tokens.

  • token_dict: The dict that maps a token to an integer. <UNK> and <EOS> should be preserved.

  • ignore_case: Whether ignoring the case of the token.

  • unk_index: The index for unknown token.

  • eos_index: The index for ending of sentence.

Load and use the Bi-LM model

from keras_bi_lm import BiLM

bi_lm = BiLM(model_path='bi_lm.h5')  # or `bi_lm.load_model('bi_lm.h5')`
input_layer, output_layer = bi_lm.get_feature_layers()
model = keras.models.Model(inputs=input_layer, outputs=output_layer)
model.summary()

The output_layer is the time-distributed feature and all the parameters in the layers of the model are not trainable.

Demo

See demo directory:

cd demo
./get_data.sh
pip install -r requirements.txt
python setiment_analysis.py

Citation

Just cite the paper you’ve seen.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keras-bi-lm-0.0.3.tar.gz (4.5 kB view details)

Uploaded Source

File details

Details for the file keras-bi-lm-0.0.3.tar.gz.

File metadata

  • Download URL: keras-bi-lm-0.0.3.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/28.8.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.4

File hashes

Hashes for keras-bi-lm-0.0.3.tar.gz
Algorithm Hash digest
SHA256 f163c3d7cd00904405197f3fb397d70f9140b7b4668055efa01804e182cc6300
MD5 2bd28aacf2342dc504f353d1c7756b9a
BLAKE2b-256 329dfef1c36851af4c2f8c1c19a1438dc5568f43edd91d50ba83e7679c0cf17e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page