This is a pipeline designed for fine-tuning pre-trained transformers to perform Named Entity Recognition (NER) tasks

These details have not been verified by PyPI

Project links

Homepage

Project description

NERP - NER Pipeline

What is it?

NERP (Named Entity Recognition Pipeline) is a Python package that provides a user-friendly pipeline for fine-tuning pre-trained transformers for Named Entity Recognition (NER) tasks.

Main Features include:

Support for multiple architectures, such as BiLSTM, CRF, and BiLSTM+CRF.
Fine-tuning of pre-trained models.
Ability to save and reload models and train them with new training data.
Fine-tuning of pre-trained models using K-Fold Cross-Validation.
Ability to save and reload models and train them with new training data using K-Fold Cross-Validation.
Fine-tuning of multiple pre-trained models.
Prediction on a single text.
Prediction on a CSV file.

Config

The user interface consists of only one file config as a YAML. Change it to create the desired configuration.

Sample env.yaml file

torch:
  device: "cuda"
  seed: 42

data:
  train_data: 'data/train.csv'
  valid_data: 'data/valid.csv'
  train_valid_split: 0.2
  test_data: 'data/test.csv'
  parameters:
        sep: ','
        quoting: 3
        shuffle: False
  limit: 0
  tag_scheme: ['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

model: 
  archi: "baseline"
  max_len: 128 
  dropout: 0.1
  num_workers: 1
  hyperparameters:
    epochs: 1
    warmup_steps: 500
    train_batch_size: 64
    valid_batch_size: 64
    lr: 0.0001
  tokenizer_parameters: 
    do_lower_case: True
  pretrained_models: 
    - roberta-base

training:
  continue_from_checkpoint: False
  checkpoint_path: "roberta-base/model.bin"
  checkpoint_tokenizer_path: "roberta-base/tokenizer"
  output_dir: "output/"
  o_tag_cr: True
  return_accuracy: False

kfold: 
  is_kfold: False
  splits: 2
  test_on_original: False

inference:
  pretrained: "roberta-base"
  model_path: "roberta-base/model.bin"
  tokenizer_path: "roberta-base/tokenizer"
  in_file_path: "data/test.csv"
  out_file_path: "data/output.csv"

Torch Parameters

Parameters	Description	Default	Type
device	The desired device to use for computation. If not provided by the user, the package will make a guess	`cuda` or `cpu`	str
seed	A random state value used for a specific experiment	42	int

Data Parameters

Parameters	Description	Default	Type
train_data	The path to the training CSV file		str
valid_data	The path to the validation CSV file		str
train_valid_split	The train/validation split ratio if there's no validation data	0.2	float
test_data	The path to the testing CSV file		str
sep	The delimiter to use	','	str
quoting	The behavior for field quoting per csv.QUOTE_* constants	3	int
shuffle	Whether to shuffle the entire dataset before training	False	bool
limit	The maximum number of observations to be returned from a given split. Defaults to 0, which returns the entire data split	0	int
tags	A list of all the available NER tags for the given dataset, excluding the special outside tag, which is handled separately		List[str]

Model Parameters

Parameters	Description	Default	Type
archi	The desired architecture for the model. It can be one of the following: baseline, bilstm-crf, bilstm, or crf	baseline	str
max_len	The maximum sentence length (number of tokens after applying the transformer tokenizer)	128	int
dropout	The dropout probability	0.1	float
epochs	The number of epochs	5	int
num_workers	The number of workers/threads for data processing	1	int
warmup_steps	The number of warmup steps for the optimizer	500	int
train_batch_size	The batch size for training DataLoader	64	int
valid_batch_size	The batch size for validation DataLoader	64	int
lr	The learning rate	0.0001	float
do_lower_case	Lowercase the sequence during the tokenization	True	bool
pretrained_models	A list of 'huggingface' transformer models	roberta-base	str

Training Parameters

Parameters	Description	Default	Type
continue_from_checkpoint	Boolean flag to continue training from a previous checkpoint	False	bool
checkpoint_path	Path to the pre-trained model derived from the transformer		str
checkpoint_tokenizer_path	Path to the tokenizer derived from the transformer		str
output_dir	Path to the output directory	output/	str
o_tag_cr	Boolean flag to include O tag in the classification report	True	bool
return_accuracy	Boolean flag to return accuracy for every training step	False	bool

KFold Parameters

Parameters	Description	Default	Type
is_kfold	Enable K-Fold Cross-Validation for training	False	bool
splits	Number of splits for K-Fold Cross-Validation	0	int
test_on_original	Evaluate on the original test set for each iteration if set to True	False	bool

Inference Parameters

Parameters	Description	Default	Type
pretrained	A 'huggingface' transformer model to use for inference	roberta-base	str
model_path	Path to the trained model file		str
tokenizer_path	Path to the saved tokenizer folder		str
in_file_path	Path to the input file to be used for inference		str
out_file_path	Path to the output file for saving the inference results		str

Data Format

Pipeline works with CSV files containing separated tokens and labels on each line. Sentences can be found in the Sentence # column. Labels should already be in the necessary format, e.g. IO, BIO, BILUO, ... The CSV file must contain the last three columns as same as below.

Sentence #	Word	Tag
Sentence: 0	i	o
Sentence: 0	was	O
Sentence: 0	at	O
Sentence: 0	h.w.	B-place
Sentence: 0	holdings	I-place
Sentence: 0	pte	I-place

Output

Once the model training is complete, the pipeline will generate the following files in the output directory:

model.bin - PyTorch NER model
Tokenizer files
Classification-report.csv - a logging file
In case of k-fold training, the pipeline generates split datasets, models, tokenizers, and accuracy files for each iteration.

Models

All huggingface transformer-based models are allowed.

Usage

Environment Setup

Activate a new conda/python environment
Install NERP

via pip

pip install NERP==1.1-rc1

via repository

git clone --branch v1.1-rc1 https://github.com/Chaarangan/NERP.git
cd NERP && pip install -e .

Initialize NERP

from NERP.models import NERP
model = NERP("env.yaml")

Training

Common function to call

model.train()

There are several options depending on your needs:

Casual Training: Configure the YAML file and set continue_from_checkpoint as False and is_kfold as False. Then call model.train().
Training from a previous checkpoint: Configure the YAML file and set continue_from_checkpoint as True and is_kfold as False. You will need to specify the checkpoint_path. Then call model.train().
Training with KFold: Configure the YAML file and set continue_from_checkpoint as False and is_kfold as True. You will need to specify the number of splits. If you wish to test each fold with your original test set rather than its own test split, set the test_on_original variable as True. Then call model.train().
Training from a previous checkpoint with KFold: Configure the YAML file and set continue_from_checkpoint as True and is_kfold as True. You will need to specify the checkpoint_path. Then call model.train().

Predictions

There are several options depending on your needs:

Prediction on a CSV file: Configure the YAML file and give model_path, tokenizer_path (if exists), in_file_path, and out_file_path. Then call model.predict().

model.predict()

Prediction on text: Configure the YAML file and give model_path and tokenizer_path (if exists). Then call model.predict_text(“some text”).

output = model.predict_text("Hello from NERP")
print(output)

License

MIT License

Shout-outs

Thanks to NERDA package to have initiated us to develop this pipeline. We have integrated the NERDA framework with NERP with some modifications from v1.0.0.

Changes from the NERDA(1.0.0) to our NERDA submodule.

Method for saving and loading tokenizer
Selected pull requests' solutions were added from NERDA PRs
Implementation of the classification report
Added multiple network architecture support
Support for enforcing reproducibility in data preparation and model training

Contributing to NERP

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
Feel free to ask questions and send feedbacks on the mailing list.
If you want to contribute NERP, open a PR.
If you encounter a bug or want to suggest an enhancement, please open an issue.

Contributors

PRs

@tanmaysurana (Tanmay Surana): add support for testing on multiple files, add additional parameters to maintain consistency across multiple experiments (validation batch size, shuffle, fixed seed), and improve loss computation algorithms PR #20

Cite this work

@inproceedings{medbert,
    author={Vasantharajan, Charangan and Tun, Kyaw Zin and Thi-Nga, Ho and Jain, Sparsh and Rong, Tong and Siong, Chng Eng},
    booktitle={2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
    title={MedBERT: A Pre-trained Language Model for Biomedical Named Entity Recognition},
    year={2022},
    volume={},
    number={},
    pages={1482-1488},
    doi={10.23919/APSIPAASC55919.2022.9980157}
}
@inproceedings{nerp,
  title = {NERP},
  author = {Charangan Vasantharajan, Kyaw Zin Tun, Lim Zhi Hao, Chng Eng Siong},
  year = {2022},
  publisher = {{GitHub}},
  url = {https://github.com/Chaarangan/NERP.git}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.1rc1 pre-release

Mar 9, 2023

1.0.2.2

Aug 30, 2022

1.0.2.1

Aug 26, 2022

1.0.2

Jul 31, 2022

1.0.0

May 24, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

NERP-1.1rc1-py3-none-win_amd64.whl (27.2 kB view hashes)

Uploaded Mar 9, 2023 Python 3 Windows x86-64

NERP-1.1rc1-py3-none-win32.whl (27.2 kB view hashes)

Uploaded Mar 9, 2023 Python 3 Windows x86

NERP-1.1rc1-py3-none-manylinux_2_17_x86_64.whl (27.2 kB view hashes)

Uploaded Mar 9, 2023 Python 3 manylinux: glibc 2.17+ x86-64

NERP-1.1rc1-py3-none-manylinux_2_17_aarch64.whl (27.2 kB view hashes)

Uploaded Mar 9, 2023 Python 3 manylinux: glibc 2.17+ ARM64

NERP-1.1rc1-py3-none-macosx_11_0_arm64.whl (27.2 kB view hashes)

Uploaded Mar 9, 2023 Python 3 macOS 11.0+ ARM64

NERP-1.1rc1-py3-none-macosx_10_9_x86_64.whl (27.2 kB view hashes)

Uploaded Mar 9, 2023 Python 3 macOS 10.9+ x86-64

NERP-1.1rc1-py3-none-macosx_10_9_universal2.whl (27.2 kB view hashes)

Uploaded Mar 9, 2023 Python 3 macOS 10.9+ universal2 (ARM64, x86-64)

Hashes for NERP-1.1rc1-py3-none-win_amd64.whl

Hashes for NERP-1.1rc1-py3-none-win_amd64.whl
Algorithm	Hash digest
SHA256	`14fbf72971eb8a5c26411aa0c0d2c47adddc9221bbb7ece6ddca3968a18593b2`
MD5	`a6e3e5921a5e39825b6a507e10b7b543`
BLAKE2b-256	`feea0ca6ffb889ed2d79b579e520f97f417db1ab525fb71a86e24b964bfa4dea`

Hashes for NERP-1.1rc1-py3-none-win32.whl

Hashes for NERP-1.1rc1-py3-none-win32.whl
Algorithm	Hash digest
SHA256	`b619294709524e49a78e6243f988ec8e5e623eb27c146f60aa1448665a0ee610`
MD5	`d0abc88ab46985df9cc7033191cb36ae`
BLAKE2b-256	`ab50e58356af5855e7ea259e9c6382300b70e7a270187c1e2ee5bbedad690e6a`

Hashes for NERP-1.1rc1-py3-none-manylinux_2_17_x86_64.whl

Hashes for NERP-1.1rc1-py3-none-manylinux_2_17_x86_64.whl
Algorithm	Hash digest
SHA256	`4a58e82d9770e38f3e998e0e17e77effcba13ef0d5cbdc693456a7d17dc4e684`
MD5	`2b78503d0634f5a5ac6bdbb0ffb4231e`
BLAKE2b-256	`225d96bb7637ceee4baebb0fee4cc9a61700c29290d6aebb1c743448996054ab`

Hashes for NERP-1.1rc1-py3-none-manylinux_2_17_aarch64.whl

Hashes for NERP-1.1rc1-py3-none-manylinux_2_17_aarch64.whl
Algorithm	Hash digest
SHA256	`0e2885900487e34cbeffcdf6f4f077f0df9c2f23f8f6b96223b0f6dd93099ea1`
MD5	`122dd445380eca2ea816414e34559240`
BLAKE2b-256	`7c28b62fbc0e85ff26c427e3aedfd112b988d96d9c02da14e3002b733c19ac29`

Hashes for NERP-1.1rc1-py3-none-macosx_11_0_arm64.whl

Hashes for NERP-1.1rc1-py3-none-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`7258ef235e5e72de6ae95594d73e4a2296fe2618c2bb214ca62bf4c81f7f78de`
MD5	`8bee7f41ceeeea0c2f4a83ef64e87d53`
BLAKE2b-256	`7b012271a92e7c059385742c1e48ac6a89e242ab56fe3210eef70a3bacef6b4b`

Hashes for NERP-1.1rc1-py3-none-macosx_10_9_x86_64.whl

Hashes for NERP-1.1rc1-py3-none-macosx_10_9_x86_64.whl
Algorithm	Hash digest
SHA256	`b8ba2888aeb3386d9b7c76e39bd20d8945852b3bd7f9f776060d91931a6eec79`
MD5	`cb75986535b32aaef1482441ba5283d4`
BLAKE2b-256	`7cbf71ee54c21e4192b38533f184342f3bd91ad5add786ed9e3b3e0f33ac82df`

Hashes for NERP-1.1rc1-py3-none-macosx_10_9_universal2.whl

Hashes for NERP-1.1rc1-py3-none-macosx_10_9_universal2.whl
Algorithm	Hash digest
SHA256	`05789c23623bc556d93d5fbf92df20f497aab1797b9b536b466a829919fc5ecf`
MD5	`71bbd906b87fbb680153281ed492620a`
BLAKE2b-256	`7f7ecce1ac5cdbfd58634b7e624b51b6489d33cf4602f5e21288f85d557d750e`

NERP 1.1rc1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NERP - NER Pipeline

What is it?

Main Features include:

Config

Torch Parameters

Data Parameters

Model Parameters

Training Parameters

KFold Parameters

Inference Parameters

Data Format

Output

Models

Usage

Environment Setup

Initialize NERP

Training

Predictions

License

Shout-outs

Contributing to NERP

Contributors

PRs

Cite this work

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions