A pipeline for fine-tuning pre-trained transformers for Named Entity Recognition (NER) tasks
Project description
NERP - NER Pipeline
What is it?
NERP (Named Entity Recognition Pipeline) is a python package that offers an easy-to-use pipeline for fine-tuning pre-trained transformers for Named Entity Recognition (NER) tasks.
Main Features
- Different Architectures (BiLSTM, CRF, BiLSTM+CRF)
- Fine-tune a pretrained model
- Save and reload model and train it on a new training data
- Fine-tune a pretrained model with K-Fold Cross-Validation
- Save and reload model and train it on a new training data with K-Fold Cross Validation
- Fine-tune multiple pretrained models
- Prediction on a single text
- Prediction on a CSV file
Config
The user interface consists of only one file config as a YAML. Change it to create the desired configuration.
Sample env.yaml
file
torch:
device: "cuda"
data:
train_data: 'data/train.csv'
valid_data: 'data/valid.csv'
train_valid_split: None
test_data: 'data/test.csv'
limit: 10
tag_scheme: ['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
model:
archi: "baseline"
o_tag_cr: True
max_len: 128
dropout: 0.1
hyperparameters:
epochs: 1
warmup_steps: 500
train_batch_size: 64
learning_rate: 0.0001
tokenizer_parameters:
do_lower_case: True
pretrained_models:
- roberta-base
train:
existing_model_path: "roberta-base/model.bin"
existing_tokenizer_path: "roberta-base/tokenizer"
output_dir: "output/"
kfold:
splits: 2
seed: 42
test_on_original: False
inference:
archi: "bilstm-crf"
max_len: 128
pretrained: "roberta-base"
model_path: "roberta-base/model.bin"
tokenizer_path: "roberta-base/tokenizer"
bulk:
in_file_path: "data/test.csv"
out_file_path: "data/output.csv"
individual:
text: "Hello from NERP"
Training Parameters
Parameters | Description | Default | Type |
---|---|---|---|
device | device: the desired device to use for computation. If not provided by the user, we take a guess. | cuda or cpu |
optional |
train_data | path to training csv file | required | |
valid_data | path to validation csv file | optional | |
train_valid_split | train/valid split ratio if valid data not exists | 0.2 | optional |
test_data | path to testing csv file | required | |
limit | Limit the number of observations to be returned from a given split. Defaults to None, which implies that the entire data split is returned. (it shoud be a int ) |
0 (whole data) | optional |
tag_scheme | All available NER tags for the given data set EXCLUDING the special outside tag, that is handled separately | required | |
archi | The desired architecture for the model (baseline, bilstm-crf, bilstm, crf) (str) | baseline | optional |
o_tag_cr | To include O tag in the classification report (bool) | True | optional |
max_len | the maximum sentence length (number of tokens after applying the transformer tokenizer) | 128 | optional |
dropout | dropout probability (float) | 0.1 | optional |
epochs | number of epochs (int) | 5 | optional |
warmup_steps | number of learning rate warmup steps (int) | 500 | optional |
train_batch_size | batch Size for DataLoader (int) | 64 | optional |
learning_rate | learning rate (float) | 0.0001 | optional |
tokenizer_parameters | list of hyperparameters for tokenizer (dict) | do_lower_case: True | optional |
pretrained_models | 'huggingface' transformer model (str) | roberta-base | required |
existing_model_path | model derived from the transformer (str) | optional | |
existing_tokenizer_path | tokenizer derived from the transformer (str) | optional | |
output_dir | path to output directory (str) | models/ | optional |
kfold | number of splits | 0 (no k-fold) (int) | optional |
seed | random state value for k-fold (int) | 42 | optional |
test_on_original | True, if you need to test on the original test set for each iteration (bool) | False | optional |
Inference Parameters
Parameters | Description | Default | Type |
---|---|---|---|
archi | The architecture for the trained model (baseline, bilstm-crf, bilstm, crf) (str) | baseline | optional |
max_len | the maximum sentence length (number of tokens after applying the transformer tokenizer) | 128 | optional |
pretrained | 'huggingface' transformer model | roberta-base | required |
model_path | path to trained model | required | |
tokenizer_path | path to saved tokenizer folder | optional | |
tag_scheme | All available NER tags for the given data set EXCLUDING the special outside tag, that is handled separately | required | |
in_file_path | path to inference file otherwise leave it as empty | optional | |
out_file_path | path to the output file if the input is a file, otherwise leave it as empty | optional | |
text | sample inference text for individual prediction | "Hello from NERP" | optional |
Data Format
Pipeline works with CSV files containing separated tokens and labels on each line. Sentences can be found in the Sentence #
column. Labels should already be in the necessary format, e.g. IO, BIO, BILUO, ... The CSV file must contain the last three columns as same as below.
, | Unnamed: 0 | Sentence # | Word | Tag |
---|---|---|---|---|
0 | 0 | Sentence: 0 | i | o |
1 | 1 | Sentence: 0 | was | O |
2 | 2 | Sentence: 0 | at | O |
3 | 3 | Sentence: 0 | h.w. | B-place |
4 | 4 | Sentence: 0 | holdings | I-place |
5 | 5 | Sentence: 0 | pte | I-place |
Output
After training the model, the pipeline will return the following files in the output directory:
- model.bin - PyTorch NER model
- tokenizer files
- classification-report.csv - logging file
- If k-fold - split datasets, models and tokenizers for each iteration and accuracy file
Models
All huggingface transformer-based models are allowed.
Usage
Environment Setup
- Activate a new conda/python environment
- Install NERP
- via pip
pip install NERP==1.0.2.2
- via repository
git clone --branch v1.0.2.2 https://github.com/Chaarangan/NERP.git
cd NERP && pip install -e .
Initialize NERP
from NERP.models import NERP
model = NERP("env.yaml")
Training a NER model using NERP
- Train a base model
model.train()
- Train an already trained model by loading its weights
model.train_after_load_network()
- Training with K-Fold Cross-Validation
model.train_with_kfold()
- Train an already trained model with K-Fold Cross Validation after loading its weights
model.train_with_kfold_after_loading_network()
Inference of a NER model using NERP
- Prediction on a single text through YAML file
output = model.inference_text()
print(output)
- Prediction on a single text through direct input
output = model.predict("Hello from NERP")
print(output)
- Prediction on a CSV file
model.inference_bulk()
License
MIT License
Shout-outs
- Thanks to NERDA package to have initiated us to develop this pipeline. We have integrated the NERDA framework with NERP with some modifications from v1.0.0.
Changes from the NERDA(1.0.0) to our NERDA submodule.
- Method for saving and loading tokenizer
- Selected pull requests' solutions were added from NERDA PRs
- Implementation of the classification report
- Added multiple network architecture support
Cite this work
@inproceedings{nerp,
title = {NERP},
author = {Charangan Vasantharajan, Kyaw Zin Tun, Lim Zhi Hao, Chng Eng Siong},
year = {2022},
publisher = {{GitHub}},
url = {https://github.com/Chaarangan/NERP.git}
}
Contributing to NERP
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
Feel free to ask questions and send feedbacks on the mailing list.
If you want to contribute NERP, open a PR.
If you encounter a bug or want to suggest an enhancement, please open an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file NERP-1.0.2.2-py3-none-win_amd64.whl
.
File metadata
- Download URL: NERP-1.0.2.2-py3-none-win_amd64.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d40f49b62c4cdc7dfe6dcb7b6f625d926be84c36a165822d0a2bbc6f40893fd |
|
MD5 | 40b9bec102057603f7381b146817a21c |
|
BLAKE2b-256 | d70fd2ccee976b11222f8703888826877b6914ab94840578c7b797ae4d9ad751 |
File details
Details for the file NERP-1.0.2.2-py3-none-win32.whl
.
File metadata
- Download URL: NERP-1.0.2.2-py3-none-win32.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0118de5e50ea6fb79d31baf70a96ce27a944ca5e8aa09d80540287876409aeb0 |
|
MD5 | 614084e87de266406aa7c2c969c2a5ba |
|
BLAKE2b-256 | 935a7a565a911a1b11d964f92a2ba06486b7bce9d3b4feb3026a32db9af14035 |
File details
Details for the file NERP-1.0.2.2-py3-none-manylinux_2_17_x86_64.whl
.
File metadata
- Download URL: NERP-1.0.2.2-py3-none-manylinux_2_17_x86_64.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f147228b2ca5269b83922045b97211616b948b2c1aeef2a1f007ea438dddbd38 |
|
MD5 | 62e47e7d74e87fcefdf4aaa3e2fb49a8 |
|
BLAKE2b-256 | 1710879416345e0b333a781aa928f1263f4ddb9cd49fa2be48bf42b06a0a62f0 |
File details
Details for the file NERP-1.0.2.2-py3-none-manylinux_2_17_aarch64.whl
.
File metadata
- Download URL: NERP-1.0.2.2-py3-none-manylinux_2_17_aarch64.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 00e1bd880f13a27a7d4f7a3eb3a980e16877c49b4f1d7cdb76ac1dcb48badf23 |
|
MD5 | c9283258cfc1913882fbbfa9c72030ce |
|
BLAKE2b-256 | 15984545414679d5ad66f524f2c191e2c5fb642aae88f08598e30f01db3e805d |
File details
Details for the file NERP-1.0.2.2-py3-none-macosx_11_0_arm64.whl
.
File metadata
- Download URL: NERP-1.0.2.2-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 464478343787afa9d2ca138dff05120749e159b571a9321e09cbfe41988de9f8 |
|
MD5 | c3a2f614228bd96b2206fb43bf431f43 |
|
BLAKE2b-256 | e65121113a92141c8f51362bd02b47c88765d3541a94f6946a560aec50e1c312 |
File details
Details for the file NERP-1.0.2.2-py3-none-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: NERP-1.0.2.2-py3-none-macosx_10_9_x86_64.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b7a86db6a7fb4db480956e52ac3c3558d9220ebae806e236b4996dce6ff131c |
|
MD5 | 0bd03a7937af574bc8529bb77cc3083e |
|
BLAKE2b-256 | eab22a84929e97499f281f352cfaab867292745fe355e789bbfbfd65036def11 |
File details
Details for the file NERP-1.0.2.2-py3-none-macosx_10_9_universal2.whl
.
File metadata
- Download URL: NERP-1.0.2.2-py3-none-macosx_10_9_universal2.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3, macOS 10.9+ universal2 (ARM64, x86-64)
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a814061291c388529524a95b4d9833e7bbb576ae3bc4352b60f2fa1b0a32239f |
|
MD5 | a10e25308b5fafc4991c2641233b0996 |
|
BLAKE2b-256 | a9b86aedd4353e5d8e9b8cfba05603d1f2459d824ae009c21a05d09e64df22bc |