Skip to main content

A pipeline for fine-tuning pre-trained transformers for Named Entity Recognition (NER) tasks

Project description

NERP - NER Pipeline

What is it?

NERP (Named Entity Recognition Pipeline) is a python package that offers an easy-to-use pipeline for fine-tuning pre-trained transformers for Named Entity Recognition (NER) tasks.

Main Features

  • Different Architectures (BiLSTM, CRF, BiLSTM+CRF)
  • Fine-tune a pretrained model
  • Save and reload model and train it on a new training data
  • Fine-tune a pretrained model with K-Fold Cross-Validation
  • Save and reload model and train it on a new training data with K-Fold Cross Validation
  • Fine-tune multiple pretrained models
  • Prediction on a single text
  • Prediction on a CSV file

Package Diagram

NERP Main Component
NERP Main Component
Component of NERP K-Fold Cross Validation
Component of NERP K-Fold Cross Validation
Component of NERP Inference
Component of NERP Inference

Config

The user interface consists of only one file config as a YAML. Change it to create the desired configuration.

Sample env.yaml file

torch:
  device: "cuda"
data:
  train_data: 'data/train.csv'
  valid_data: 'data/valid.csv'
  train_valid_split: None
  test_data: 'data/test.csv'
  limit: 10
  tag_scheme: ['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

model: 
  archi: "baseline"
  o_tag_cr: True
  max_len: 128 
  dropout: 0.1
  hyperparameters:
    epochs: 1
    warmup_steps: 500
    train_batch_size: 64
    learning_rate: 0.0001
  tokenizer_parameters: 
    do_lower_case: True
  pretrained_models: 
    - roberta-base

train:
  existing_model_path: "roberta-base/model.bin"
  existing_tokenizer_path: "roberta-base/tokenizer"
  output_dir: "output/"

kfold: 
  splits: 2
  seed: 42
  test_on_original: False

inference:
  archi: "bilstm-crf"
  max_len: 128 
  pretrained: "roberta-base"
  model_path: "roberta-base/model.bin"
  tokenizer_path: "roberta-base/tokenizer"
  bulk:
    in_file_path: "data/test.csv"
    out_file_path: "data/output.csv"
  individual:
    text: "Hello from NERP"

Training Parameters

Parameters Description Default Type
device device: the desired device to use for computation. If not provided by the user, we take a guess. cuda or cpu optional
train_data path to training csv file required
valid_data path to validation csv file optional
train_valid_split train/valid split ratio if valid data not exists 0.2 optional
test_data path to testing csv file required
limit Limit the number of observations to be returned from a given split. Defaults to None, which implies that the entire data split is returned. (it shoud be a int) 0 (whole data) optional
tag_scheme All available NER tags for the given data set EXCLUDING the special outside tag, that is handled separately required
archi The desired architecture for the model (baseline, bilstm-crf, bilstm, crf) (str) baseline optional
o_tag_cr To include O tag in the classification report (bool) True optional
max_len the maximum sentence length (number of tokens after applying the transformer tokenizer) 128 optional
dropout dropout probability (float) 0.1 optional
epochs number of epochs (int) 5 optional
warmup_steps number of learning rate warmup steps (int) 500 optional
train_batch_size batch Size for DataLoader (int) 64 optional
learning_rate learning rate (float) 0.0001 optional
tokenizer_parameters list of hyperparameters for tokenizer (dict) do_lower_case: True optional
pretrained_models 'huggingface' transformer model (str) roberta-base required
existing_model_path model derived from the transformer (str) optional
existing_tokenizer_path tokenizer derived from the transformer (str) optional
output_dir path to output directory (str) models/ optional
kfold number of splits 0 (no k-fold) (int) optional
seed random state value for k-fold (int) 42 optional
test_on_original True, if you need to test on the original test set for each iteration (bool) False optional

Inference Parameters

Parameters Description Default Type
archi The architecture for the trained model (baseline, bilstm-crf, bilstm, crf) (str) baseline optional
max_len the maximum sentence length (number of tokens after applying the transformer tokenizer) 128 optional
pretrained 'huggingface' transformer model roberta-base required
model_path path to trained model required
tokenizer_path path to saved tokenizer folder optional
tag_scheme All available NER tags for the given data set EXCLUDING the special outside tag, that is handled separately required
in_file_path path to inference file otherwise leave it as empty optional
out_file_path path to the output file if the input is a file, otherwise leave it as empty optional
text sample inference text for individual prediction "Hello from NERP" optional

Data Format

Pipeline works with CSV files containing separated tokens and labels on each line. Sentences can be found in the Sentence # column. Labels should already be in the necessary format, e.g. IO, BIO, BILUO, ... The CSV file must contain the last three columns as same as below.

, Unnamed: 0 Sentence # Word Tag
0 0 Sentence: 0 i o
1 1 Sentence: 0 was O
2 2 Sentence: 0 at O
3 3 Sentence: 0 h.w. B-place
4 4 Sentence: 0 holdings I-place
5 5 Sentence: 0 pte I-place

Output

After training the model, the pipeline will return the following files in the output directory:

  • model.bin - PyTorch NER model
  • tokenizer files
  • classification-report.csv - logging file
  • If k-fold - split datasets, models and tokenizers for each iteration and accuracy file

Models

All huggingface transformer-based models are allowed.


Usage

Environment Setup

  1. Activate a new conda/python environment
  2. Install NERP
  • via pip
pip install NERP==1.0.2.1
  • via repository
git clone --branch v1.0.2.1 https://github.com/Chaarangan/NERP.git
cd NERP & pip install -e .

Initialize NERP

from NERP.models import NERP
model = NERP("env.yaml")

Training a NER model using NERP

  1. Train a base model
model.train()
  1. Train an already trained model by loading its weights
model.train_after_load_network()
  1. Training with K-Fold Cross-Validation
model.train_with_kfold()
  1. Train an already trained model with K-Fold Cross Validation after loading its weights
model.train_with_kfold_after_loading_network()

Inference of a NER model using NERP

  1. Prediction on a single text through YAML file
output = model.inference_text()
print(output)
  1. Prediction on a single text through direct input
output = model.predict("Hello from NERP")
print(output)
  1. Prediction on a CSV file
model.inference_bulk()

License

MIT License

Shout-outs

  • Thanks to NERDA package to have initiated us to develop this pipeline. We have integrated the NERDA framework with NERP with some modifications from v1.0.0.

Changes from the NERDA(1.0.0) to our NERDA submodule.

  1. Method for saving and loading tokenizer
  2. Selected pull requests' solutions were added from NERDA PRs
  3. Implementation of the classification report
  4. Added multiple network architecture support

Cite this work

@inproceedings{nerp,
  title = {NERP},
  author = {Charangan Vasantharajan, Kyaw Zin Tun, Lim Zhi Hao, Chng Eng Siong},
  year = {2022},
  publisher = {{GitHub}},
  url = {https://github.com/Chaarangan/NERP.git}
}

Contributing to NERP

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Feel free to ask questions and send feedbacks on the mailing list.

If you want to contribute NERP, open a PR.

If you encounter a bug or want to suggest an enhancement, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

NERP-1.0.2.1-py3-none-win_amd64.whl (33.8 kB view details)

Uploaded Python 3 Windows x86-64

NERP-1.0.2.1-py3-none-win32.whl (33.8 kB view details)

Uploaded Python 3 Windows x86

NERP-1.0.2.1-py3-none-manylinux_2_17_x86_64.whl (33.8 kB view details)

Uploaded Python 3 manylinux: glibc 2.17+ x86-64

NERP-1.0.2.1-py3-none-manylinux_2_17_aarch64.whl (33.8 kB view details)

Uploaded Python 3 manylinux: glibc 2.17+ ARM64

NERP-1.0.2.1-py3-none-macosx_11_0_arm64.whl (33.8 kB view details)

Uploaded Python 3 macOS 11.0+ ARM64

NERP-1.0.2.1-py3-none-macosx_10_9_x86_64.whl (33.8 kB view details)

Uploaded Python 3 macOS 10.9+ x86-64

NERP-1.0.2.1-py3-none-macosx_10_9_universal2.whl (33.8 kB view details)

Uploaded Python 3 macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file NERP-1.0.2.1-py3-none-win_amd64.whl.

File metadata

  • Download URL: NERP-1.0.2.1-py3-none-win_amd64.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for NERP-1.0.2.1-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 643f4330ef8ae644711b497a6ab2d708f0bd86dd648ed8f870578056ef3adc6d
MD5 a80fa0916ac2fc4f68f60af5db8af3d7
BLAKE2b-256 bf019f010204b8e3c062fc94a52488a5a270adeab78be24b3873fb0a4853acd8

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.1-py3-none-win32.whl.

File metadata

  • Download URL: NERP-1.0.2.1-py3-none-win32.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for NERP-1.0.2.1-py3-none-win32.whl
Algorithm Hash digest
SHA256 858e4c6f3ae8f15bfc84cfd9765dc3f8899044aed160c6e257dfdc9ab703e41e
MD5 3fd39e91eaaad817adeec41ebc620d1d
BLAKE2b-256 7b730fead119a893bb8027bbd8bd11efa9b59c253cd4f4bbdc5ad2128dc951cd

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.1-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for NERP-1.0.2.1-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 75b2c45db967d1f856c0e0713999298730e6e329a33a447b0d14a44e11aa4408
MD5 ba7e74d61a27c153d894e4e57c95b943
BLAKE2b-256 947e2b6dcd7d16104cf2c1ef111a20d478af880aac18d155d3cf34cc88c52861

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.1-py3-none-manylinux_2_17_aarch64.whl.

File metadata

File hashes

Hashes for NERP-1.0.2.1-py3-none-manylinux_2_17_aarch64.whl
Algorithm Hash digest
SHA256 62f15886bea49fb8cef767dfd3ae7e03691e14334c73e2290af25e526b464fbd
MD5 4ccb0ee4f0bd638c93d44baa95938ba2
BLAKE2b-256 f818cf358a5b78425b9f8d4ecb4c5d47e4d0823a8e9c677a419b1f675a67cb5c

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for NERP-1.0.2.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ceca472b839afbdd9d7ed1cb51435accc24efd30fbaf897fec28e99550b121b4
MD5 eb80caeaf630d9825067f1b0b8e46d54
BLAKE2b-256 ddb016d67e0a82251357ec977114e0b13560ddb7ffbe7ec7de1715a7c129f41f

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.1-py3-none-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for NERP-1.0.2.1-py3-none-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 26105039888844096f19c70108cf8a00bc9f4997330644328c2a1cbaeaa12a17
MD5 3e3dd8d8007458d64f7bffde202584f5
BLAKE2b-256 611846a0c9d6d0a1d436a4eeb4b7a241228f0251445b0598038b73b0cbf0d6ce

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.1-py3-none-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for NERP-1.0.2.1-py3-none-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 054a82a82277380f3bcee8eabe495d38f4dcbf5971f6310ee453a172de8207a0
MD5 ece13184eb786fc46d998ae1c52f7d13
BLAKE2b-256 6d24da2a82db7052e49d55e5f318191f60a6997197510331d28f6e6a69e0e675

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page