Skip to main content

A pipeline for fine-tuning pre-trained transformers for Named Entity Recognition (NER) tasks

Project description

NERP - Pipeline for training NER models

What is it?

NERP (Named Entity Recognition Pipeline) is a python package that offers an easy-to-use pipeline for fine-tuning pre-trained transformers for Named Entity Recognition (NER) tasks.

Main Features

  • Fine-tune a pretrained model
  • Save and reload model and train it on a new training data
  • Fine-tune a pretrained model with K-Fold Cross-Validation
  • Save and reload model and train it on a new training data with K-Fold Cross Validation
  • Fine-tune multiple pretrained models
  • Prediction on a single text
  • Prediction on a CSV file

Config

The user interface consists of only one file config as a YAML. Change it to create the desired configuration.

Sample env.yaml file

torch:
  device: "cuda"
data:
  train_data: 'data/train.csv'
  train_valid_split: 0.2
  test_data: 'data/test.csv'
  limit: 10
  tag_scheme: ['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

model: 
  max_len: 128 
  dropout: 0.1
  hyperparameters:
    epochs: 1
    warmup_steps: 500
    train_batch_size: 64
    learning_rate: 0.0001
  tokenizer_parameters: 
    do_lower_case: True
  pretrained_models: 
    - roberta-base

train:
  is_model_exists: True
  existing_model_path: "roberta-base/model.bin"
  existing_tokenizer_path: "roberta-base/tokenizer"
  output_dir: "output/"

kfold: 
  splits: 2
  seed: 42

inference:
  max_len: 128 
  pretrained: "roberta-base"
  model_path: "roberta-base/model.bin"
  tokenizer_path: "roberta-base/tokenizer"
  bulk:
    in_file_path: "data/test.csv"
    out_file_path: "data/output.csv"
  individual:
    text: "Hello from NERP"

Training Parameters

Parameters Description Default Type
device device: the desired device to use for computation. If not provided by the user, we take a guess. cuda or cpu optional
train_data path to training csv file required
train_valid_split train/valid split ratio 0.2 optional
test_data path to testing csv file required
limit Limit the number of observations to be returned from a given split. Defaults to None, which implies that the entire data split is returned. (it shoud be a int) 0 (whole data) optional
tag_scheme All available NER tags for the given data set EXCLUDING the special outside tag, that is handled separately required
max_len the maximum sentence length (number of tokens after applying the transformer tokenizer) 128 optional
dropout dropout probability (float) 0.1 optional
epochs number of epochs (int) 5 optional
warmup_steps number of learning rate warmup steps (int) 500 optional
train_batch_size batch Size for DataLoader (int) 64 optional
learning_rate learning rate (float) 0.0001 optional
tokenizer_parameters list of hyperparameters for tokenizer do_lower_case: True optional
pretrained_models 'huggingface' transformer model roberta-base required
existing_model_path model derived from the transformer optional
existing_tokenizer_path tokenizer derived from the transformer optional
output_dir path to output directory models/ optional
kfold number of splits 0 (no k-fold) optional
seed random state value for k-fold 42 optional

Inference Parameters

Parameters Description Default Type
max_len the maximum sentence length (number of tokens after applying the transformer tokenizer) 128 optional
pretrained 'huggingface' transformer model roberta-base required
model_path path to trained model required
tokenizer_path path to saved tokenizer folder optional
tag_scheme All available NER tags for the given data set EXCLUDING the special outside tag, that is handled separately required
in_file_path path to inference file otherwise leave it as empty optional
out_file_path path to the output file if the input is a file, otherwise leave it as empty optional
text sample inference text for individual prediction if is_bulk False "Hello from NERP" optional

Data Format

Pipeline works with CSV files containing separated tokens and labels on each line. Sentences can be found in the Sentence # column. Labels should already be in the necessary format, e.g. IO, BIO, BILUO, ... The CSV file must contain the last three columns as same as below.

, Unnamed: 0 Sentence # Word Tag
0 0 Sentence: 0 i o
1 1 Sentence: 0 was O
2 2 Sentence: 0 at O
3 3 Sentence: 0 h.w. B-place
4 4 Sentence: 0 holdings I-place
5 5 Sentence: 0 pte I-place

Output

After training the model, the pipeline will return the following files in the output directory:

  • model.bin - PyTorch NER model
  • tokenizer files
  • classification-report.csv - logging file
  • If k-fold - split datasets, models and tokenizers for each iteration

Models

All huggingface transformer-based models are allowed.


Usage

Environment Setup

  1. Activate a new conda/python environment
  2. Install NERP
  • via pip
pip install NERP
  • or via repository
git clone https://github.com/Chaarangan/NERP
cd NERP
pip install -e .

Initialize NERP

from NERP.models import NERP
model = NERP("env.yaml")

Training a NER model using NERP

  1. Train a base model
model.train()
  1. Train an already trained model by loading its weights
model.train_after_load_network()
  1. Training with K-Fold Cross-Validation
model.train_with_kfold()
  1. Train an already trained model with K-Fold Cross Validation after loading its weights
model.train_with_kfold_after_loading_network()

Inference of a NER model using NERP

  1. Prediction on a single text
output = model.inference_text()
print(output)
  1. Prediction on a CSV file
model.inference_bulk()

License

MIT License

Shout-outs

  • Thanks to NERDA package to have initiated us to develop this pipeline. We have integrated the NERDA framework with NERP with some modifications from v1.0.0.

Changes

  1. Method for saving and loading tokenizer
  2. Selected pull requests' solutions were added from NERDA PRs

Cite this work

@inproceedings{nerp,
  title = {NERP},
  author = {Charangan Vasantharajan, Kyaw Zin Tun, Lim Zhi Hao, Chng Eng Siong},
  year = {2022},
  publisher = {{GitHub}},
  url = {https://github.com/Chaarangan/NERP.git}
}

Contributing to NERP

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Feel free to ask questions and send feedbacks on the mailing list.

If you want to contribute NERP, open a PR.

If you encounter a bug or want to suggest an enhancement, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

NERP-1.0.0-py3-none-win_amd64.whl (31.4 kB view details)

Uploaded Python 3 Windows x86-64

NERP-1.0.0-py3-none-win32.whl (31.4 kB view details)

Uploaded Python 3 Windows x86

NERP-1.0.0-py3-none-manylinux_2_17_x86_64.whl (31.5 kB view details)

Uploaded Python 3 manylinux: glibc 2.17+ x86-64

NERP-1.0.0-py3-none-manylinux_2_17_aarch64.whl (31.5 kB view details)

Uploaded Python 3 manylinux: glibc 2.17+ ARM64

NERP-1.0.0-py3-none-macosx_11_0_arm64.whl (31.4 kB view details)

Uploaded Python 3 macOS 11.0+ ARM64

NERP-1.0.0-py3-none-macosx_10_9_x86_64.whl (31.4 kB view details)

Uploaded Python 3 macOS 10.9+ x86-64

NERP-1.0.0-py3-none-macosx_10_9_universal2.whl (31.5 kB view details)

Uploaded Python 3 macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file NERP-1.0.0-py3-none-win_amd64.whl.

File metadata

  • Download URL: NERP-1.0.0-py3-none-win_amd64.whl
  • Upload date:
  • Size: 31.4 kB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for NERP-1.0.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 69687c657ef8e6f93c907e5f9220a72a930c764d1ab87d7fecf6d1e59a129b3a
MD5 f2f1392403316e8f88d7667eac9ff851
BLAKE2b-256 44cf4c9ea112bb9cfedcaf24392e849a2eb86e7a534293db3ed371e170163e87

See more details on using hashes here.

File details

Details for the file NERP-1.0.0-py3-none-win32.whl.

File metadata

  • Download URL: NERP-1.0.0-py3-none-win32.whl
  • Upload date:
  • Size: 31.4 kB
  • Tags: Python 3, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for NERP-1.0.0-py3-none-win32.whl
Algorithm Hash digest
SHA256 c4f2c2a62690c4bab6f146d3005c7369bf5350aceca2b4c6e8f7daf351e34b5d
MD5 c80be5e665cb91fe5b943c5cbe375df3
BLAKE2b-256 7c2d4acdaac6dd8ae7508af23843772fab109c5eb1354db408cf898aeeedb32a

See more details on using hashes here.

File details

Details for the file NERP-1.0.0-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for NERP-1.0.0-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 7fbc39f17e9be54d5aa628f5a8d7d0aec7627626d7862f2b4195e95f07130786
MD5 e2e0e48ed21d80ec5047ea7c57767796
BLAKE2b-256 619c174f08b894f7df99b20c85a64d3532090111953092525cfc2700ac7dc97a

See more details on using hashes here.

File details

Details for the file NERP-1.0.0-py3-none-manylinux_2_17_aarch64.whl.

File metadata

File hashes

Hashes for NERP-1.0.0-py3-none-manylinux_2_17_aarch64.whl
Algorithm Hash digest
SHA256 333e22827b04d803dfce6dcfc64c88f09e6c3cffba8e2532bf3b2483f50a9402
MD5 59e1352ab6067b8406cec23467b512b1
BLAKE2b-256 bd7047cc173c515ad859c27b38b86681d41466e71da9b76cb8dce6f0d6c80301

See more details on using hashes here.

File details

Details for the file NERP-1.0.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for NERP-1.0.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ba66ffb3530668af127457468a8d0c914aca3189320fcfc6af549acc0801b18c
MD5 38fb8f7df57b6d2b60ea084c6f1f3ec0
BLAKE2b-256 e2ae85eef15230cc4f9c15f34bf33e3a85c1b6709f2ab60e9a2dfcd71c869e7e

See more details on using hashes here.

File details

Details for the file NERP-1.0.0-py3-none-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for NERP-1.0.0-py3-none-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 18d277f2fde6cbf5e7624b08615434928405959552b9a88802b9854aea9df456
MD5 f5bb04a3047d3274965ecff1bbccd5cd
BLAKE2b-256 c6acb5635749b6d18a6dbe45930bf6f6b5789c466b862086fb52e99bd1cdcf14

See more details on using hashes here.

File details

Details for the file NERP-1.0.0-py3-none-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for NERP-1.0.0-py3-none-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 c22fe78d05401aac287f67e252853c5c2e68e7281e90bdce85d324949907edc2
MD5 ee0515031cecb712c748efba2b391ef7
BLAKE2b-256 a8599f57e44e5f26ab11216a3d0c30b802e46940ff2a9fb0a3605a8cb3e3070e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page