keras-bi-lm

Train the Bi-LM model and use it as a feature extraction method

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Introduction

The repository contains a class for training a bidirectional language model that extracts features for each position in a sentence.

Install

pip install keras-bi-lm

Usage

Train and save the Bi-LM model

Before using it as a feature extraction method, the language model must be trained on a large corpora.

from keras_bi_lm import BiLM

sentences = [
    ['All', 'work', 'and', 'no', 'play'],
    ['makes', 'Jack', 'a', 'dull', 'boy', '.'],
]
token_dict = {
    '': 0, '<UNK>': 1, '<EOS>': 2,
    'all': 3, 'work': 4, 'and': 5, 'no': 6, 'play': 7,
    'makes': 8, 'a': 9, 'dull': 10, 'boy': 11, '.': 12,
}
token_dict_rev = {v: k for k, v in token_dict.items()}
inputs, outputs = BiLM.get_batch(sentences,
                                 token_dict,
                                 ignore_case=True,
                                 unk_index=token_dict['<UNK>'],
                                 eos_index=token_dict['<EOS>'])

bi_lm = BiLM(token_num=len(token_dict), embedding_dim=10, rnn_units=10)
bi_lm.model.summary()
bi_lm.fit(np.repeat(inputs, 2 ** 12, axis=0),
          [
              np.repeat(outputs[0], 2 ** 12, axis=0),
              np.repeat(outputs[1], 2 ** 12, axis=0),
          ],
          epochs=5)
bi_lm.save_model('bi_lm.h5')

BiLM()

The core class that contains the model to be trained and used. Key parameters:

token_num: Number of words or characters.
embedding_dim: The dimension of embedding layer.
rnn_layer_num: The number of stacked bidirectional RNNs.
rnn_units: An integer or a list representing the number of units of RNNs in one direction.
rnn_keep_num: How many layers are used for predicting the probabilities of the next word.
rnn_type: Type of RNN, ‘gru’ or ‘lstm’.

BiLM.get_batch()

A helper function that converts sentences to batch inputs and outputs for training the model.

sentences: A list of list of tokens.
token_dict: The dict that maps a token to an integer. <UNK> and <EOS> should be preserved.
ignore_case: Whether ignoring the case of the token.
unk_index: The index for unknown token.
eos_index: The index for ending of sentence.

Load and use the Bi-LM model

from keras_bi_lm import BiLM

bi_lm = BiLM(model_path='bi_lm.h5')  # or `bi_lm.load_model('bi_lm.h5')`
input_layer, output_layer = bi_lm.get_feature_layers()
model = keras.models.Model(inputs=input_layer, outputs=output_layer)
model.summary()

The output_layer is the time-distributed feature and all the parameters in the layers of the model are not trainable.

Use ELMo-like Weighted Sum of Trained Layers

from keras_bi_lm import BiLM

bi_lm = BiLM(token_num=20000,
             embedding_dim=300,
             rnn_layer_num=3,
             rnn_keep_num=4,
             rnn_units=300,
             rnn_type='lstm')
# ...
# Train the Bi-LM model
# ...

input_layer, output_layer = bi_lm.get_feature_layers(use_weighted_sum=True)
model = keras.models.Model(inputs=input_layer, outputs=output_layer)
model.summary()

When rnn_keep_num is greater than rnn_layer_num, the embedding layer is also used for weighting.

Demo

See demo directory:

cd demo
./get_data.sh
pip install -r requirements.txt
python setiment_analysis.py

Citation

Just cite the paper you’ve seen.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.25.0

Jan 22, 2022

0.24.0

Jul 11, 2020

0.23.0

Mar 11, 2019

0.22.0

Feb 1, 2019

This version

0.21.0

Oct 30, 2018

0.18

Sep 13, 2018

0.0.17

Aug 17, 2018

0.0.16

Aug 15, 2018

0.0.15

Aug 9, 2018

0.0.14

Aug 9, 2018

0.0.13

Aug 8, 2018

0.0.12

Aug 8, 2018

0.0.11

Aug 7, 2018

0.0.10

Aug 7, 2018

0.0.9

Aug 7, 2018

0.0.8

Aug 7, 2018

0.0.7

Aug 7, 2018

0.0.6

Aug 6, 2018

0.0.5

Aug 6, 2018

0.0.4

Aug 6, 2018

0.0.3

Aug 6, 2018

0.0.2

Aug 6, 2018

0.0.1

Aug 6, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keras-bi-lm-0.21.0.tar.gz (6.4 kB view hashes)

Uploaded Oct 30, 2018 Source

Hashes for keras-bi-lm-0.21.0.tar.gz

Hashes for keras-bi-lm-0.21.0.tar.gz
Algorithm	Hash digest
SHA256	`d21f156467220610eed168410bb0449874da5d7086f93116a95991ad1bbd94cd`
MD5	`6b51d1f7245ce56f213ceb8767ea85aa`
BLAKE2b-256	`2e91255bca665daaf74ef28398ccf5ec38f9a793a8c4fd425c58345da6253b92`