Skip to main content

A pipeline for fine-tuning pre-trained transformers for Named Entity Recognition (NER) tasks

Project description

NERP - NER Pipeline

What is it?

NERP (Named Entity Recognition Pipeline) is a python package that offers an easy-to-use pipeline for fine-tuning pre-trained transformers for Named Entity Recognition (NER) tasks.

Main Features

  • Different Architectures (BiLSTM, CRF, BiLSTM+CRF)
  • Fine-tune a pretrained model
  • Save and reload model and train it on a new training data
  • Fine-tune a pretrained model with K-Fold Cross-Validation
  • Save and reload model and train it on a new training data with K-Fold Cross Validation
  • Fine-tune multiple pretrained models
  • Prediction on a single text
  • Prediction on a CSV file

Config

The user interface consists of only one file config as a YAML. Change it to create the desired configuration.

Sample env.yaml file

torch:
  device: "cuda"
data:
  train_data: 'data/train.csv'
  valid_data: 'data/valid.csv'
  train_valid_split: None
  test_data: 'data/test.csv'
  limit: 10
  tag_scheme: ['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

model: 
  archi: "baseline"
  o_tag_cr: True
  max_len: 128 
  dropout: 0.1
  hyperparameters:
    epochs: 1
    warmup_steps: 500
    train_batch_size: 64
    learning_rate: 0.0001
  tokenizer_parameters: 
    do_lower_case: True
  pretrained_models: 
    - roberta-base

train:
  existing_model_path: "roberta-base/model.bin"
  existing_tokenizer_path: "roberta-base/tokenizer"
  output_dir: "output/"

kfold: 
  splits: 2
  seed: 42
  test_on_original: False

inference:
  archi: "bilstm-crf"
  max_len: 128 
  pretrained: "roberta-base"
  model_path: "roberta-base/model.bin"
  tokenizer_path: "roberta-base/tokenizer"
  bulk:
    in_file_path: "data/test.csv"
    out_file_path: "data/output.csv"
  individual:
    text: "Hello from NERP"

Training Parameters

Parameters Description Default Type
device device: the desired device to use for computation. If not provided by the user, we take a guess. cuda or cpu optional
train_data path to training csv file required
valid_data path to validation csv file optional
train_valid_split train/valid split ratio if valid data not exists 0.2 optional
test_data path to testing csv file required
limit Limit the number of observations to be returned from a given split. Defaults to None, which implies that the entire data split is returned. (it shoud be a int) 0 (whole data) optional
tag_scheme All available NER tags for the given data set EXCLUDING the special outside tag, that is handled separately required
archi The desired architecture for the model (baseline, bilstm-crf, bilstm, crf) (str) baseline optional
o_tag_cr To include O tag in the classification report (bool) True optional
max_len the maximum sentence length (number of tokens after applying the transformer tokenizer) 128 optional
dropout dropout probability (float) 0.1 optional
epochs number of epochs (int) 5 optional
warmup_steps number of learning rate warmup steps (int) 500 optional
train_batch_size batch Size for DataLoader (int) 64 optional
learning_rate learning rate (float) 0.0001 optional
tokenizer_parameters list of hyperparameters for tokenizer (dict) do_lower_case: True optional
pretrained_models 'huggingface' transformer model (str) roberta-base required
existing_model_path model derived from the transformer (str) optional
existing_tokenizer_path tokenizer derived from the transformer (str) optional
output_dir path to output directory (str) models/ optional
kfold number of splits 0 (no k-fold) (int) optional
seed random state value for k-fold (int) 42 optional
test_on_original True, if you need to test on the original test set for each iteration (bool) False optional

Inference Parameters

Parameters Description Default Type
archi The architecture for the trained model (baseline, bilstm-crf, bilstm, crf) (str) baseline optional
max_len the maximum sentence length (number of tokens after applying the transformer tokenizer) 128 optional
pretrained 'huggingface' transformer model roberta-base required
model_path path to trained model required
tokenizer_path path to saved tokenizer folder optional
tag_scheme All available NER tags for the given data set EXCLUDING the special outside tag, that is handled separately required
in_file_path path to inference file otherwise leave it as empty optional
out_file_path path to the output file if the input is a file, otherwise leave it as empty optional
text sample inference text for individual prediction "Hello from NERP" optional

Data Format

Pipeline works with CSV files containing separated tokens and labels on each line. Sentences can be found in the Sentence # column. Labels should already be in the necessary format, e.g. IO, BIO, BILUO, ... The CSV file must contain the last three columns as same as below.

, Unnamed: 0 Sentence # Word Tag
0 0 Sentence: 0 i o
1 1 Sentence: 0 was O
2 2 Sentence: 0 at O
3 3 Sentence: 0 h.w. B-place
4 4 Sentence: 0 holdings I-place
5 5 Sentence: 0 pte I-place

Output

After training the model, the pipeline will return the following files in the output directory:

  • model.bin - PyTorch NER model
  • tokenizer files
  • classification-report.csv - logging file
  • If k-fold - split datasets, models and tokenizers for each iteration and accuracy file

Models

All huggingface transformer-based models are allowed.


Usage

Environment Setup

  1. Activate a new conda/python environment
  2. Install NERP
  • via pip
pip install NERP==1.0.2.2
  • via repository
git clone --branch v1.0.2.2 https://github.com/Chaarangan/NERP.git
cd NERP && pip install -e .

Initialize NERP

from NERP.models import NERP
model = NERP("env.yaml")

Training a NER model using NERP

  1. Train a base model
model.train()
  1. Train an already trained model by loading its weights
model.train_after_load_network()
  1. Training with K-Fold Cross-Validation
model.train_with_kfold()
  1. Train an already trained model with K-Fold Cross Validation after loading its weights
model.train_with_kfold_after_loading_network()

Inference of a NER model using NERP

  1. Prediction on a single text through YAML file
output = model.inference_text()
print(output)
  1. Prediction on a single text through direct input
output = model.predict("Hello from NERP")
print(output)
  1. Prediction on a CSV file
model.inference_bulk()

License

MIT License

Shout-outs

  • Thanks to NERDA package to have initiated us to develop this pipeline. We have integrated the NERDA framework with NERP with some modifications from v1.0.0.

Changes from the NERDA(1.0.0) to our NERDA submodule.

  1. Method for saving and loading tokenizer
  2. Selected pull requests' solutions were added from NERDA PRs
  3. Implementation of the classification report
  4. Added multiple network architecture support

Cite this work

@inproceedings{nerp,
  title = {NERP},
  author = {Charangan Vasantharajan, Kyaw Zin Tun, Lim Zhi Hao, Chng Eng Siong},
  year = {2022},
  publisher = {{GitHub}},
  url = {https://github.com/Chaarangan/NERP.git}
}

Contributing to NERP

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Feel free to ask questions and send feedbacks on the mailing list.

If you want to contribute NERP, open a PR.

If you encounter a bug or want to suggest an enhancement, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

NERP-1.0.2.2-py3-none-win_amd64.whl (33.8 kB view details)

Uploaded Python 3 Windows x86-64

NERP-1.0.2.2-py3-none-win32.whl (33.8 kB view details)

Uploaded Python 3 Windows x86

NERP-1.0.2.2-py3-none-manylinux_2_17_x86_64.whl (33.8 kB view details)

Uploaded Python 3 manylinux: glibc 2.17+ x86-64

NERP-1.0.2.2-py3-none-manylinux_2_17_aarch64.whl (33.8 kB view details)

Uploaded Python 3 manylinux: glibc 2.17+ ARM64

NERP-1.0.2.2-py3-none-macosx_11_0_arm64.whl (33.8 kB view details)

Uploaded Python 3 macOS 11.0+ ARM64

NERP-1.0.2.2-py3-none-macosx_10_9_x86_64.whl (33.8 kB view details)

Uploaded Python 3 macOS 10.9+ x86-64

NERP-1.0.2.2-py3-none-macosx_10_9_universal2.whl (33.8 kB view details)

Uploaded Python 3 macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file NERP-1.0.2.2-py3-none-win_amd64.whl.

File metadata

  • Download URL: NERP-1.0.2.2-py3-none-win_amd64.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for NERP-1.0.2.2-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 6d40f49b62c4cdc7dfe6dcb7b6f625d926be84c36a165822d0a2bbc6f40893fd
MD5 40b9bec102057603f7381b146817a21c
BLAKE2b-256 d70fd2ccee976b11222f8703888826877b6914ab94840578c7b797ae4d9ad751

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.2-py3-none-win32.whl.

File metadata

  • Download URL: NERP-1.0.2.2-py3-none-win32.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for NERP-1.0.2.2-py3-none-win32.whl
Algorithm Hash digest
SHA256 0118de5e50ea6fb79d31baf70a96ce27a944ca5e8aa09d80540287876409aeb0
MD5 614084e87de266406aa7c2c969c2a5ba
BLAKE2b-256 935a7a565a911a1b11d964f92a2ba06486b7bce9d3b4feb3026a32db9af14035

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.2-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for NERP-1.0.2.2-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 f147228b2ca5269b83922045b97211616b948b2c1aeef2a1f007ea438dddbd38
MD5 62e47e7d74e87fcefdf4aaa3e2fb49a8
BLAKE2b-256 1710879416345e0b333a781aa928f1263f4ddb9cd49fa2be48bf42b06a0a62f0

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.2-py3-none-manylinux_2_17_aarch64.whl.

File metadata

File hashes

Hashes for NERP-1.0.2.2-py3-none-manylinux_2_17_aarch64.whl
Algorithm Hash digest
SHA256 00e1bd880f13a27a7d4f7a3eb3a980e16877c49b4f1d7cdb76ac1dcb48badf23
MD5 c9283258cfc1913882fbbfa9c72030ce
BLAKE2b-256 15984545414679d5ad66f524f2c191e2c5fb642aae88f08598e30f01db3e805d

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.2-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for NERP-1.0.2.2-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 464478343787afa9d2ca138dff05120749e159b571a9321e09cbfe41988de9f8
MD5 c3a2f614228bd96b2206fb43bf431f43
BLAKE2b-256 e65121113a92141c8f51362bd02b47c88765d3541a94f6946a560aec50e1c312

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.2-py3-none-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for NERP-1.0.2.2-py3-none-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6b7a86db6a7fb4db480956e52ac3c3558d9220ebae806e236b4996dce6ff131c
MD5 0bd03a7937af574bc8529bb77cc3083e
BLAKE2b-256 eab22a84929e97499f281f352cfaab867292745fe355e789bbfbfd65036def11

See more details on using hashes here.

File details

Details for the file NERP-1.0.2.2-py3-none-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for NERP-1.0.2.2-py3-none-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 a814061291c388529524a95b4d9833e7bbb576ae3bc4352b60f2fa1b0a32239f
MD5 a10e25308b5fafc4991c2641233b0996
BLAKE2b-256 a9b86aedd4353e5d8e9b8cfba05603d1f2459d824ae009c21a05d09e64df22bc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page