Utilities for training and working with nlp models in pytorch
Project description
xt-nlp
This directory includes the most recent and complete version of the BERT model training code (April 30th 2019).
Training NLP Model
Usage
To train a new BERT model, complete the following steps:
-
Choose which model you plan to train:
- Standard BERT-Base-Uncased, BERT-Large-Cased. etc.
- Recommend using BERT-Base-Uncased: Base is avoid space problems, uncased yields better results
- BioBERT (Three different versions: PubMed, PMC, PubMed + PMC)
- Recommend using PubMed + PMC. Find models here. Model also downloaded in INSERT DIR
- SciBERT
- Already partially finetuned model (from above)
- Standard BERT-Base-Uncased, BERT-Large-Cased. etc.
-
Specify the file path for the following files below: Note that Standard BERT models are downloaded to a cache directory with
pytorch-pretrained-bert
- If fintuning set
s.finetune
boolean toTrue
. The following settings depend on the specific model useds.model_config_file
: name of model config file for training from fine tuned models.model_checkpoint_file
: name of.bin
file for training from fine tuned model
- Standard BERT:
s.model_type
: set tobert-base-uncased
etc.s.bert_standard_cache
: set to path you want to download the pretrained weight to. Note this is optional. If left empty, the weights will download to a default location in thepytorch-pretrained-bert
library.
- BioBERT:
s.model_type
: set tobiobert
s.biobert_raw_model_path
: path of converted biobert model (see Converting BioBERT from TF to PyTorchs.biobert_vocab_path
: path ofvocab.txt
for BioBERT model
- SciBERT:
- currently not supported
- If fintuning set
-
Specify output directories for the model, log and results files. Look at the
out_path
andlog_path
varibles inrun_train.py
. A new folder will be created in these directories for each run. Folder names follow thenew_folder
varible inrun_train.py
. The following files are created each run:models/[new folder]/ans_type.pkl
: list of strings containing all answer types for the run. This file is important! It specifies the order of the answer types!models/[new folder]/config.json
: See abovemodels/[new folder]/hyperparams.json
: JSON of all values of SESSettings object for trainmodels/[new folder]/pytorch_model_END.bin
: Model saved at the end of trainingmodels/[new folder]/pytorch_model_END.bin
: Model saved after epoch with highest f1 validation scoremodels/[new folder]/results_[all answer types]_enum[epoch number].txt
: Text results of all answers over all epochs in all validation examples. The file contains the original text, following by the top predictions. Each prediction has the start and end logit raw score, as well as predicted final answer.
-
Load the data you plan to annotate. The
run_train.py
file depends on loading whatever your example text is into a list of SESExample objects. These functions are defined indata_loader_main.py
. These functions are called inget_examples()
inutils.py
. Depending on your data, you may need to change the function inget_examples()
or write your own. Some of the functions indata_loader_main.py
are as follows:brat_read_select
: Only reads brat annotations of answer types in the argument answer setbrat_read_everything
: Reads all answer types in brat annotation files
-
Choose run hyperparameters. The file
run_train.py
is setup for hyperparameter optimization. -
Run the training. From the root directory of the this repo, run
python run_train.py
Inference with BERT
Usage
-
Choose which model you plan to train:
- For a stanard BERT model,set
s.model_type
to be 'bert-standard'. - For a BioBERT model, set
s.model_type
to be 'biobert'.
- For a stanard BERT model,set
-
Specify the file path for the following files below: Note that Standard BERT models are downloaded to a cache directory with
pytorch-pretrained-bert
s.model_config_file
: path of model config file for training from fine tuned models.model_checkpoint_file
: path of.bin
file for training from fine tuned models.model_ans_list_file
: path ofans_type.pkl
file containing answers types of model
-
Run (or call)
run_infer
. This function will return character level logits for the string input.
Converting-BioBERT-from-TF-to-PyTorch
When downloading BioBERT from repo, you must convert the TF checkpoint to a pytorch model. See this post
Extensions and ToDo
- Train model from SciBERT baseline and look for improvements.
- Masking of tokens in finetuning. By masking tokens (replacement with random words and [MASK] token) in the answer/all text, the model should hopefully learn to use the context around the answers to infer what the correct answer is. This prevents the model learning to lookup vocabulary when training on small datasets.
- Testing different types of layers/network sizes on the last layer of BERT's output. Currently we have multiple fully connected layers for each token. Layers are of size (0 to the length of ans_type). Deeper networks or other architectures might allow for BERT to answer 20+ different types of answer types without losing accuracy.
Datasets and Models
coop_spring2019/josh/data_sender_address_total_BRAT_DATASET
: contains brat annotation and txt files for labels of data, sender, address, and total in receipts. The text extraction and original receipts can be found incoop_spring2019/josh/pdf_xtract_text
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file xt-nlp-0.2.1.tar.gz
.
File metadata
- Download URL: xt-nlp-0.2.1.tar.gz
- Upload date:
- Size: 21.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e4a19c08352475d519e032bd1d057b977f0f051037afce3a7448e1ffce12e50 |
|
MD5 | 28c550efad2ac79bb8deceafcfb06a86 |
|
BLAKE2b-256 | 4d36cc57c654a94de462af6afac6bd637ab56aa02dad76cd0b3c4b6192caafda |
File details
Details for the file xt_nlp-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: xt_nlp-0.2.1-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 43fbcb0893e1cbe570fa71b746f005545f8f53b1eb163aa56e6482d562bf5100 |
|
MD5 | badfb3d3fddc34ad9b223dc135158d92 |
|
BLAKE2b-256 | aa62e7a33cf56c640a317496b7534c5136017469b553cdd19904b2e90dce42c3 |