Skip to main content

All-in-one framework for NLP Transfer Learning Classification

Project description

ulangel

Background

Ulangel is a python library for NLP text classification. The idea comes from the article of Jeremy Howard et al. "Universal Language Model Fine-tuning for Text Classification" https://arxiv.org/pdf/1801.06146.pdf. The original codes are from the fastai library (https://github.com/fastai/course-v3). We use its NLP part as a source of reference and modify some codes to adapt to our use case. The name ulangel comes from universal language model. It also means the fruit of the research department of company U (a french parisian startup), so called l'ange de U (the angel of U). U aimes to describe the ecosystem established by corporates as well as startups by our product Motherbase. In Motherbase, we have a large quantity of texts concerning company descriptions, communications where we apply this library ulangel to do the Natural Language Processing.

This is a LSTM based neuron network. To classify the text, we train at first a language model and then fine-tune it into a classifier. In this library you will find all structures needed to pack data, to construct your lstm with dropouts, and some evaluation fuctions, optimizers to train your neuron network.

Install

pip install ulangel

Usage

There are three parts in this library:

  • ulangel.data: Preparation of the text classification data, such as creating dataset, creating databunch, padding all texts to have the same length in the same batch, etc.
  • ulangel.rnn: Create recurrent neuron network structures, such as connection dropouts, activation dropouts, LSTM for language model, encoder, etc.
  • ulangel.utils: Some tools for training, such as callbacks, optimizers, evaluations functions, etc.

ulangel.data

There are three types of data objects in our training and validation systems. The default input data are numpy.ndarray objects.

  • Dataset: Devide(or/and shuffle) input data into batches. Each dataset item is a tuple of x and its corresponding y.
  • Dataloader: Here we use the pytorch dataloader, to get dataset item in the way defined by the sampler.
  • Databunch: Gathering the training and validation dataloader as one data object and will be given to the learner to train the neuron network.

Dataset

For a language model, the input is a bptt-length text and the output is also a text as long as the input with just one word shifted. For a text [w0, w1, w2, w3, w4, ...] (wi is the corresponding integer of a word, a dictionary of your own text corpus creates this correspondance.) if bptt = 4, the input i0 is [w0, w1, w2, w3], then the output o0 will be [w1, w2, w3, w4]. The input and the output are generated from the same text, with the help of the class LanguageModelDataset.

  import numpy as np
  from ulangel.data.data_packer import LanguageModelDataset
  trn_lm = np.load(your_path/'your_file.npy', allow_pickle=True)
  trn_lm_ds = LanguageModelDataset(data=trn_lm, bs=2, bptt=4, shuffle=False)
  # print an item of dataset: (x, y)
  next(iter(trn_lm_ds))
  >>> (tensor([1.1000e+01, 5.0000e+00, 2.0000e+00, 1.0000e+01]),
  tensor([5.0000e+00, 2.0000e+00, 1.0000e+01, 4.0000e+00]))

For a text classifier, its dataset is a little bit different. The input is still the same, but the output is an integer representing the corresponding class label. In this case, we use TextClassificationDataset which inherits the pytorch dataset torch.utils.data.Dataset.

  import numpy as np
  from ulangel.data.data_packer import TextClassificationDataset
  trn_ids = np.load(your_path/'your_text_file.npy', allow_pickle=True)
  trn_label = np.load(your_path/'your_label_file.npy', allow_pickle=True)
  trn_clas_ds = TextClassificationDataset(x=trn_ids, y=trn_labels)
  # print an item of dataset: (x, y)
  next(iter(trn_clas_ds))
  >>> (array([   11,     5,     2,    10,     4,     7,     5,     2,     9,
              4]), 2)

Dataloader

In this library, we use the pytorch dataloader, but with our own sampler. For the language model, batches are generated by the concatenation of all texts so they all have the same length. We can use directly the dataloader to pack data.

  from torch.utils.data import DataLoader
  trn_lm_dl = DataLoader(trn_lm_ds, batch_size=2)
  # print an item of dataloader: a batch of dataset
  next(iter(trn_lm_dl))
  >>> [tensor([[11.,  5.,  2., 10.],
           [12., 11.,  5.,  2.]]), tensor([[ 5.,  2., 10.,  4.],
           [11.,  5.,  2., 10.]])]

However, for the text classification, we can not concatenate texts together, because each text has its own class. It doesn't make sense to mix texts to form equilong texts in the batch. In order to train the neuron network in an efficient way and in the same time keeping the randomness, we have two different samplers for the training and the validation data. Additionally, for texts in each batch, they should have the same length, so we use a collate function to pad those short texts. For the Classification data, we use the dataloader in this way:

  trn_clas_dl = DataLoader(trn_clas_ds, batch_size=2, sampler=trn_sampler, collate_fn=pad_collate)
  val_clas_dl = DataLoader(val_clas_ds, batch_size=2, sampler=val_sampler, collate_fn=pad_collate)

How to create samplers and collat_fn will be explained below.

Sampler

Sampler is an index generator. It returns a list of indexes, which corresponding item is sorted by the attribute key. In this library, TrainingSampler and ValidationSampler inherit the pytorch sampler torch.utils.data.Sampler.

  • ulangel.data.data_packer.TrainingSampler: TrainingSampler is a sampler for the training data. It sorts the data in the way defined by the given key, the longest at the first, the shortest at the end, random in the middle.
  from ulangel.data.data_packer import TrainingSampler
  trn_sampler = TrainingSampler(data_source=trn_clas_ds.x, key=lambda t: len(trn_clas_ds.x[t]), bs=2)

In this exemple, the data source is the x of the training dataset (texts), the key is the length of each text.

  • ulangel.data.data_packer.ValidationSampler: It sorts the data in the way defined by the given key, in an increasing or a decreasing way. It is different from the TrainingSampler, there is no randomness, the validation texts will be sorted from the longest to the shortest.
  from ulangel.data.data_packer import ValidationSampler
  val_sampler = ValidationSampler(val_clas_ds.x, key=lambda t: len(val_clas_ds.x[t]))

In this exemple, the data source is the x of the validation dataset (texts), the key is the length of each text.

Collate Function

Collate function can be used to manipulate your input data. In this library, our collate function: pad_collate is to pad the text with padding index pad_idx to have the same length in the same batch. This pad_collate function is inbuild, we just need to import, so that we can use it in the dataloader.

  from ulangel.data.data_packer import pad_collate

Databunch

Databunch packs your training dataloader and validation dataloader together into a databunch object, so that your can give it to your learner (which will be explained later in README)

  from ulangel.data.data_packer import DataBunch
  language_model_data = DataBunch(trn_lm_dl, val_lm_dl)

ulangel.rnn

In this part, there are two main blocks to build a neuron network: dropouts and some special neuron network structures for our transfer learning.

ulangel.rnn.dropouts

The pytorch dropout are dropouts to zero out some activations with probability p. In the article of Stephen Merity et al. "Regularizing and Optimizing LSTM Language Models" https://arxiv.org/pdf/1708.02182.pdf they propose to apply dropouts not only on activations but also on connections. Using pytorch dropouts is not enough. Therefore, we create three different dropout classes:

  • ulangel.rnn.dropouts.ActivationDropout: as its name, this is a dropout to zero out activations in the layer in the neuron network.
  • ulangel.rnn.dropouts.ConnectionWeightDropout: as its name, this is a dropout to zero out connections (weights) between layers.
  • ulangel.rnn.dropouts.EmbeddingDropout: this is a dropout class to zero out embedding activations.

These three dropout classes will be used in the AWD_LSTM to build the LSTM for the language model training. Of course, you can also import them to build your own neuron network.

ulangel.rnn.nn_block

In this part, we have some structures to build a language model and a text classifier.

  • ulangel.rnn.nn_block.AWD_LSTM: this is a LSTM neuron network inheriting torch.nn.Module proposed by Stephen Merity et al. in the article "Regularizing and Optimizing LSTM Language Models" https://arxiv.org/pdf/1708.02182.pdf. Because we use the pretrained language model wikitext-103 from this article, to finetune our own language model on our corpus, we need to keep the same values for some hyperparameters as wikitext-103: embedding_size = 400, number_of_hidden_activation = 1150.
  from ulangel.rnn.nn_block import AWD_LSTM
  # define hyperparameters
  class LmArg:
      def __init__(self):
          self.number_of_tokens = 10000
          self.embedding_size = 400
          self.pad_token = 1
          self.embedding_dropout = 0.05
          self.number_of_hidden_activation = 1150
          self.number_of_layers = 3
          self.activation_dropout = 0.3
          self.input_activation_dropout = 0.65
          self.embedding_activation_dropout = 0.1
          self.connection_hh_dropout = 0.5
          self.decoder_activation_dropout = 0.4
          self.bptt = 16
  encode_args = LmArg()
  # build the LSTM neuron network
  lstm_enc = AWD_LSTM(
    vocab_sz=encode_args.nomber_of_tokens, emb_sz=encode_args.embedding_size, n_hid=encode_args.number_of_hidden_activation, n_layers=encode_args.number_of_layers, pad_token=encode_args.pad_token, hidden_p=encode_args.activation_dropout, input_p=encode_args.input_activation_dropout, embed_p=encode_args.embedding_activation_dropout, weight_p=encode_args.connection_hh_dropout)
  lstm_enc
  >>> AWD_LSTM(
    (emb): Embedding(10000, 400, padding_idx=1)
    (emb_dp): EmbeddingDropout(
      (emb): Embedding(10000, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): ConnectionWeightDropout(
        (module): LSTM(400, 1150, batch_first=True)
      )
      (1): ConnectionWeightDropout(
        (module): LSTM(1150, 1150, batch_first=True)
      )
      (2): ConnectionWeightDropout(
        (module): LSTM(1150, 400, batch_first=True)
      )
    )
    (input_dp): ActivationDropout()
    (hidden_dps): ModuleList(
      (0): ActivationDropout()
      (1): ActivationDropout()
      (2): ActivationDropout()
    )
  )
  • ulangel.rnn.nn_block.LinearDecoder: This is a decoder inheriting torch.nn.Module, the inverse of an encoder, to transfer the last hidden layer (embedding vector) into its corresponding integer representation of the work, so that we can find the word comprehensive for human.
  from ulangel.rnn.nn_block import LinearDecoder
  decoder = LinearDecoder(
      encode_args.number_of_tokens, encode_args.embedding_size, encode_args.decoder_activation_dropout, tie_encoder=lstm_enc.emb, bias=True
  )
  decoder
  >>>LinearDecoder(
    (output_dp): ActivationDropout()
    (decoder): Linear(in_features=400, out_features=10000, bias=True)
  )
  • ulangel.rnn.nn_block.SequentialRNN: This class inherits the pytorch class torch.nn.Sequential, to connect different neuron networks, and allows to reset all parameters of substructures with a reset methode (ex: AWD_LSTM)
  from ulangel.rnn.nn_block import SequentialRNN
  language_model = SequentialRNN(lstm_enc, decoder)
  language_model.modules
  >>>
  <bound method Module.modules of SequentialRNN(
    (0): AWD_LSTM(
      (emb): Embedding(10000, 400, padding_idx=1)
      (emb_dp): EmbeddingDropout(
        (emb): Embedding(10000, 400, padding_idx=1)
      )
      (rnns): ModuleList(
        (0): ConnectionWeightDropout(
          (module): LSTM(400, 1150, batch_first=True)
        )
        (1): ConnectionWeightDropout(
          (module): LSTM(1150, 1150, batch_first=True)
        )
        (2): ConnectionWeightDropout(
          (module): LSTM(1150, 400, batch_first=True)
        )
      )
      (input_dp): ActivationDropout()
      (hidden_dps): ModuleList(
        (0): ActivationDropout()
        (1): ActivationDropout()
        (2): ActivationDropout()
      )
    )
    (1): LinearDecoder(
      (output_dp): ActivationDropout()
      (decoder): Linear(in_features=400, out_features=10000, bias=True)
    )
  )>

For the classification data, the structure of the neuron network is a little bit different, so we need two classes to adapt to these changes.

  • ulangel.rnn.nn_block.SentenceEncoder: it is a class similar to ulangel.rnn.nn_block.AWD_LSTM, but the difference is when the input text length exceeds the value of bptt (we define to train the language model), it divides the text into servals bptt-length sequences at the input and concatenates the results back to one text at the output.
  from ulangel.rnn.nn_block import SentenceEncoder
  sent_enc = SentenceEncoder(lstm_enc, encode_args.bptt)
  sent_enc
  >>>SentenceEncoder(
    (module): AWD_LSTM(
      (emb): Embedding(10000, 400, padding_idx=1)
      (emb_dp): EmbeddingDropout(
        (emb): Embedding(10000, 400, padding_idx=1)
      )
      (rnns): ModuleList(
        (0): ConnectionWeightDropout(
          (module): LSTM(400, 1150, batch_first=True)
        )
        (1): ConnectionWeightDropout(
          (module): LSTM(1150, 1150, batch_first=True)
        )
        (2): ConnectionWeightDropout(
          (module): LSTM(1150, 400, batch_first=True)
        )
      )
      (input_dp): ActivationDropout()
      (hidden_dps): ModuleList(
        (0): ActivationDropout()
        (1): ActivationDropout()
        (2): ActivationDropout()
      )
    )
  )
  • ulangel.rnn.nn_block.PoolingLinearClassifier: different from the language model, we don't need the decoder to read the output. We want to classify the input text. So at the output of the ulangel.rnn.nn_block.AWD_LSTM, we do some pooling to pick the last sequence of the LSTM's output, the max pooling of the LSTM's output, the average pooling of the LSTM's output. We concatenate these three sequences, as input of a linear full connected neuron net classifier. The last layer's number of activations should be the same as the number of classes in your classification problem.
  from ulangel.rnn.nn_block import PoolingLinearClassifier
  pool_clas = PoolingLinearClassifier(
      layers=[3*encode_args.emsize, 100, 4], # define the number of activations for each layer
      drops=[0.2, 0.1])
  pool_clas
  >>>PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.2, inplace=False)
      (2): Linear(in_features=1200, out_features=100, bias=True)
      (3): ReLU(inplace=True)
      (4): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1, inplace=False)
      (6): Linear(in_features=100, out_features=4, bias=True)
      (7): ReLU(inplace=True)
    )
  )

To build the complete classifier, we use the ulangel.rnn.nn_block.SequentialRNN to connect these two classes:

  classifier = SequentialRNN(sent_enc, pool_clas)
  classifier
  >>>SequentialRNN(
    (0): SentenceEncoder(
      (module): AWD_LSTM(
        (emb): Embedding(10000, 400, padding_idx=1)
        (emb_dp): EmbeddingDropout(
          (emb): Embedding(10000, 400, padding_idx=1)
        )
        (rnns): ModuleList(
          (0): ConnectionWeightDropout(
            (module): LSTM(400, 1150, batch_first=True)
          )
          (1): ConnectionWeightDropout(
            (module): LSTM(1150, 1150, batch_first=True)
          )
          (2): ConnectionWeightDropout(
            (module): LSTM(1150, 400, batch_first=True)
          )
        )
        (input_dp): ActivationDropout()
        (hidden_dps): ModuleList(
          (0): ActivationDropout()
          (1): ActivationDropout()
          (2): ActivationDropout()
        )
      )
    )
    (1): PoolingLinearClassifier(
      (layers): Sequential(
        (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (1): Dropout(p=0.2, inplace=False)
        (2): Linear(in_features=1200, out_features=100, bias=True)
        (3): ReLU(inplace=True)
        (4): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): Dropout(p=0.1, inplace=False)
        (6): Linear(in_features=100, out_features=4, bias=True)
        (7): ReLU(inplace=True)
      )
    )
  )

ulangel.utils

In this part, there are some tools for the training of the neuron network.

Callbacks

Callbacks are triggers during the training. Calling callbacks can make intermediate computation or do the setting.

  • ulangel.utils.callbacks.TrainEvalCallback: setting if the model is in the training mode or in the validation mode. During the training mode, update the progressing and the number of iteration.

  • ulangel.utils.callbacks.CudaCallback: put the model and the variables on cuda.

  • ulangel.utils.callbacks.Recorder: record the loss value and the learning rate of every batch, plot the variation of these two values if the function (recorder.plot_lr() / recorder.plot_loss() / recorder.plot()) is called.

  • ulangel.utils.callbacks.LR_Find: giving the minimum and the maximum of learning rate and the maximum number of iteration, change linearlly the learning rate (from the minimum value to he maximum value) at every batch. Combine with the Recorder, we can see the evaluation of loss so that we can find an appropriate learning rate for the training. Warning: if there is LR_Find in the callback list, the model is running to go through all learning rates, but not to train the model.

  • ulangel.utils.callbacks.RNNTrainer: record the prediction result, raw_output (without applying dropouts) and output (after applying dropouts) after every prediction. If needed, it can also add AR or/and TAR regularization to the loss to avoid overfitting.

  • ulangel.utils.callbacks.ParamScheduler: allow to schedule any hyperparameter during the training, such as learning rate, momentum, weight decay, etc. It takes the hyperparameter's name and its schedule function sched_func. Here we use a combined schedule function combine_scheds, combing two different parts of a cosine function, to have a learning rate low at the beginning and at the end, high in the middle.

  from ulangel.utils.callbacks import combine_scheds, ParamScheduler, sched_cos
  lr = 1e-3
  sched_cos1 = sched_cos(start=lr/10, end=lr*2)
  sched_cos2 = sched_cos(start=lr*2, end=lr/100)
  # pcts means the percentages taken by the following functions in scheds. In the exemple below means the sched combines the first 0.3 of sched_cos1 and the last 0.7 of sched_cos2.
  sched = combine_scheds(pcts=[0.3, 0.7], scheds=[sched_cos1, sched_cos2])

For the training process, it's up to the user to choose callbacks to make a callback list. Here it's an exemple:

  from ulangel.utils.callbacks import TrainEvalCallback, CudaCallback, Recorder, RNNTrainer
  cbs = [CudaCallback(), TrainEvalCallback(), Recorder(), RNNTrainer(alpha=2., beta=1.), ParamScheduler('lr', sched)]

Stats

Stats contains all classes and functions to compute statistics of the model's performance. There are two classes and some methods.

  • metrics: A metric function takes the outputs of your model, and the target values as inputs, and you can define your own way to evaluate your model's performance by writing your own computation in the function. Functions ulangel.utils.stats.accuracy_flat (calculate the accuracy for the language model) and ulangel.utils.stats.accuracy (calculate the accuracy for the classifier) are two inbuild metrics that we provide. Warning: ulangel.utils.stats.cross_entropy_flat is not a metric. It is a loss function, but it is similar to the accuracy_flat metric, so we put them at the same place.

  • ulangel.utils.stats.AvgStats: calculate loss and statistics defined by input metrics. This class puts the loss value and other performance statistics defined by metrics together into a list. It also has methods to update and print all these performance statistics when called.

  • ulangel.utils.stats.AvgStatsCallback: Actually the class AvgStatsCallback is also a callback, it uses AvgStats to calculate all performance statistics after every batch, and print these statistics after every epoch.

We can add AvgStatsCallback into the callback list, so that we can know the neuron network performs after every epoch.

  from ulangel.utils.stats import AvgStatsCallback, accuracy, accuracy_flat
  # for a language model
  cbs_languagemodel = [CudaCallback(), TrainEvalCallback(), AvgStatsCallback([accuracy_flat]), Recorder(), RNNTrainer(alpha=2., beta=1.), ParamScheduler('lr', sched)]

  # for a classifier
  cbs_classifier = [CudaCallback(), TrainEvalCallback(), AvgStatsCallback([accuracy]), Recorder(), ParamScheduler('lr', sched_clas)]

Optimizer

  • optimizers: ulangel.utils.optimizer.Optimizer is a class that decides the way to update all parameters of the model by steppers. ulangel.utils.optimizer.StatefulOptimizer is an optimizer with state. It inherits the class Optimizer and adds an attribute state in order to track the history of updates. As we know, when we use an optimizer with momentum, we need to know the last update value to calculate the current one. In this case, we use StatefulOptimizer in this library.

  • stepper: functions defining how to update the parameters or the gradient of the parameters. It depends on the current values. In the library we provide several steppers: ulangel.utils.optimizer.sgd_step (stochastic gradient descent stepper), ulangel.utils.optimizer.weight_decay (weight decay stepper), ulangel.utils.optimizer.adam_step (adam stepper). You can also program your own stepper.

  • stateupdater: define how to initialize and update state (for exemple, how to update momentum). In the library we provide some inbuild stateupdaters, all of them inherit the class ulangel.utils.optimizer.StateUpdater:

  • ulangel.utils.optimizer.AverageGrad(momentum created by averaging the gradient)

  • ulangel.utils.optimizer.AverageSqrGrad(momentum created by averaging the square of the gradient)

  • ulangel.utils.optimizer.StepCount(step increment)

In ulangel, we provide two inbuild optimizer: ulangel.utils.optimizer.sgd_opt (stochastic gradient descent optimizer) and ulangel.utils.optimizer.adam_opt (adam optimizer). Optimizer is an input of the object of the class leaner. We will show you how to use optimizer in the learner part.

If you want to, you can also write your own stepper, your own stateupdater, to build your own optimizer. Here is an exemple to build an optimizer with momentum.

  from ulangel.utils.optimizer import StatefulOptimizer, StateUpdater, StepCount

  def your_stepper(p, lr, *args, **kwargs):
      p = your_stepper_function(p, lr)
      return p

  def your_stateupdater(Stat):
      def __init__(self):
          your initialization values

      def init_state(self, p):
          return {"your_state_name": your_state_initialization_function(p)}

      def update(self, p, state, *args, **kwargs):
          state["your_state_name"] = your_state_update_function(p, *args)
          return state

  def your_optimizer(xtra_step=None, **kwargs):
      return partial(
          StatefulOptimizer,
          steppers=[your_stepper] + listify(xtra_step),
          stateupdaters=[your_stateupdater()],
          **kwargs
      )

Learner

ulangel.utils.learner.Learner is a class that takes the RNN model, data for training, the loss function, the optimizer, the learning rate and callbacks that you need. The method Learner.fit(epochs=number of epochs that you want to train) executes all processes in order to train the model. Here is an exemple to build the langage model learner:

  from ulangel.utils.learner import Learner
  language_model_learner = Learner(
      model=language_model,
      data=language_model_data,
      loss_func=cross_entropy_flat,
      opt_func=adam_opt(),
      lr=1e-5,
      cbs=cbs_languagemodel)
  # load the pretrained model
  wgts = torch.load('your_pretrained_model.h5')
  # some key corresponding may be necessary
  dict_new = language_model_learner.model.state_dict().copy()
  dict_new['key1'] = wgts['key1_pretrained']
  dict_new['key2'] = wgts['key2_pretrained']

  language_model_learner.model.load_state_dict(dict_new)
  language_model_learner.fit(2)
  >>>
  0
  train: [loss value 1 for training set, tensor(metric value 1 for training set, device='cuda:0')]
  valid: [loss value 1 for validation set, tensor(metric value 1 for validation set, device='cuda:0')]
  1
  train: [loss value 2 for training set, tensor(metric value 1 for training set, device='cuda:0')]
  valid: [loss value 2 for validation set, tensor(metric value 1 for validation set, device='cuda:0')]

  # save your model if you are satisfied with its performance
  torch.save(language_model_learner.model.state_dict(), 'your_model_path.pkl')

Software Requirements

Python 3.6 torch 1.3.1 torchvision 0.4.2

Related efforts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

ulangel-0.0.1-py3-none-any.whl (25.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page