Skip to main content

Utilities for training and working with nlp models in pytorch

Project description

xt-nlp

This directory includes the most recent and complete version of the BERT model training code (April 30th 2019).

Training NLP Model

Usage

To train a new BERT model, complete the following steps:

  1. Choose which model you plan to train:

    • Standard BERT-Base-Uncased, BERT-Large-Cased. etc.
      • Recommend using BERT-Base-Uncased: Base is avoid space problems, uncased yields better results
    • BioBERT (Three different versions: PubMed, PMC, PubMed + PMC)
      • Recommend using PubMed + PMC. Find models here. Model also downloaded in INSERT DIR
    • SciBERT
    • Already partially finetuned model (from above)
  2. Specify the file path for the following files below: Note that Standard BERT models are downloaded to a cache directory with pytorch-pretrained-bert

    • If fintuning set s.finetune boolean to True. The following settings depend on the specific model used
      • s.model_config_file : name of model config file for training from fine tuned model
      • s.model_checkpoint_file : name of .bin file for training from fine tuned model
    • Standard BERT:
      • s.model_type: set to bert-base-uncased etc.
      • s.bert_standard_cache: set to path you want to download the pretrained weight to. Note this is optional. If left empty, the weights will download to a default location in the pytorch-pretrained-bert library.
    • BioBERT:
      • s.model_type: set to biobert
      • s.biobert_raw_model_path: path of converted biobert model (see Converting BioBERT from TF to PyTorch
      • s.biobert_vocab_path: path of vocab.txt for BioBERT model
    • SciBERT:
      • currently not supported
  3. Specify output directories for the model, log and results files. Look at the out_path and log_path varibles in run_train.py. A new folder will be created in these directories for each run. Folder names follow the new_folder varible in run_train.py. The following files are created each run:

    • models/[new folder]/ans_type.pkl : list of strings containing all answer types for the run. This file is important! It specifies the order of the answer types!
    • models/[new folder]/config.json : See above
    • models/[new folder]/hyperparams.json : JSON of all values of SESSettings object for train
    • models/[new folder]/pytorch_model_END.bin : Model saved at the end of training
    • models/[new folder]/pytorch_model_END.bin : Model saved after epoch with highest f1 validation score
    • models/[new folder]/results_[all answer types]_enum[epoch number].txt : Text results of all answers over all epochs in all validation examples. The file contains the original text, following by the top predictions. Each prediction has the start and end logit raw score, as well as predicted final answer.
  4. Load the data you plan to annotate. The run_train.py file depends on loading whatever your example text is into a list of SESExample objects. These functions are defined in data_loader_main.py. These functions are called in get_examples() in utils.py. Depending on your data, you may need to change the function in get_examples() or write your own. Some of the functions in data_loader_main.py are as follows:

    • brat_read_select: Only reads brat annotations of answer types in the argument answer set
    • brat_read_everything: Reads all answer types in brat annotation files
  5. Choose run hyperparameters. The file run_train.py is setup for hyperparameter optimization.

  6. Run the training. From the root directory of the this repo, run python run_train.py

Inference with BERT

Usage

  1. Choose which model you plan to train:

    • For a stanard BERT model,set s.model_type to be 'bert-standard'.
    • For a BioBERT model, set s.model_type to be 'biobert'.
  2. Specify the file path for the following files below: Note that Standard BERT models are downloaded to a cache directory with pytorch-pretrained-bert

    • s.model_config_file : path of model config file for training from fine tuned model
    • s.model_checkpoint_file : path of .bin file for training from fine tuned model
    • s.model_ans_list_file : path of ans_type.pkl file containing answers types of model
  3. Run (or call) run_infer. This function will return character level logits for the string input.

Converting-BioBERT-from-TF-to-PyTorch

When downloading BioBERT from repo, you must convert the TF checkpoint to a pytorch model. See this post

Extensions and ToDo

  • Train model from SciBERT baseline and look for improvements.
  • Masking of tokens in finetuning. By masking tokens (replacement with random words and [MASK] token) in the answer/all text, the model should hopefully learn to use the context around the answers to infer what the correct answer is. This prevents the model learning to lookup vocabulary when training on small datasets.
  • Testing different types of layers/network sizes on the last layer of BERT's output. Currently we have multiple fully connected layers for each token. Layers are of size (0 to the length of ans_type). Deeper networks or other architectures might allow for BERT to answer 20+ different types of answer types without losing accuracy.

Datasets and Models

  • coop_spring2019/josh/data_sender_address_total_BRAT_DATASET : contains brat annotation and txt files for labels of data, sender, address, and total in receipts. The text extraction and original receipts can be found in coop_spring2019/josh/pdf_xtract_text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xt-nlp-0.2.1.tar.gz (21.6 kB view details)

Uploaded Source

Built Distribution

xt_nlp-0.2.1-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file xt-nlp-0.2.1.tar.gz.

File metadata

  • Download URL: xt-nlp-0.2.1.tar.gz
  • Upload date:
  • Size: 21.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6

File hashes

Hashes for xt-nlp-0.2.1.tar.gz
Algorithm Hash digest
SHA256 3e4a19c08352475d519e032bd1d057b977f0f051037afce3a7448e1ffce12e50
MD5 28c550efad2ac79bb8deceafcfb06a86
BLAKE2b-256 4d36cc57c654a94de462af6afac6bd637ab56aa02dad76cd0b3c4b6192caafda

See more details on using hashes here.

File details

Details for the file xt_nlp-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: xt_nlp-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 22.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6

File hashes

Hashes for xt_nlp-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 43fbcb0893e1cbe570fa71b746f005545f8f53b1eb163aa56e6482d562bf5100
MD5 badfb3d3fddc34ad9b223dc135158d92
BLAKE2b-256 aa62e7a33cf56c640a317496b7534c5136017469b553cdd19904b2e90dce42c3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page