Transformer-based named entity recognition
Project description
T-NER: Transformers NER
T-NER
is a python tool to analyse language model finetuning on named-entity-recognition (NER).
It has an easy interface to finetune models, test on cross-domain datasets, where we compile 9 publicly available NER datasets.
Models can be deployed immediately on our web app for qualitative analysis, and the API for a micro service.
Also we release all the NER model checkpoints, where the most generalized model trained on all the dataset, has 43 entity types.
Table of Contents
- Setup
- Language Model Finetuning on NER
- Datasets: Built-in datasets and custom dataset
- Model Finetuning: Model training colab notebook
- Model Evaluation: In/out of domain evaluation colab notebook
- Model Inference API: An API to get prediction from models
- Model Checkpoints : Released model checkpoints
- Experiment with XLM-R: Cross-domain analysis of XLM-R
- Web API: Model deployment on a web-app
Get Started
Install via pip
pip install git+https://github.com/asahi417/tner
or clone and install libraries.
git clone https://github.com/asahi417/tner
cd tner
pip install -r requirement.txt
Language Model Finetuning on NER
Fig 1: Tensorboard visualization
Datasets
Following built-in NER datasets are available via tner
.
Name (alias ) |
Genre | Language | Entity types | Data size (train/valid/test) | Note |
---|---|---|---|---|---|
OntoNotes 5 (ontonotes5 ) |
News, Blog, Dialogue | English | 18 | 59,924/8,582/8,262 | |
CoNLL 2003 (conll2003 ) |
News | English | 4 | 14,041/3,250/3,453 | |
WNUT 2017 (wnut2017 ) |
SNS | English | 6 | 1,000/1,008/1,287 | |
FIN (fin ) |
Finance | English | 4 | 1,164/-/303 | |
BioNLP 2004 (bionlp2004 ) |
Chemical | English | 5 | 18,546/-/3,856 | |
BioCreative V CDR (bc5cdr ) |
Medical | English | 2 | 5,228/5,330/5,865 | split into sentences to reduce sequence length |
WikiAnn (panx_dataset/en , panx_dataset/ja , etc) |
Wikipedia | 282 languages | 3 | 20,000/10,000/10,000 | |
Japanese Wikipedia (wiki_ja ) |
Wikipedia | Japanese | 8 | -/-/500 | test set only |
Japanese WikiNews (wiki_news_ja ) |
Wikipedia | Japanese | 10 | -/-/1,000 | test set only |
MIT Restaurant (mit_restaurant ) |
Restaurant review | English | 8 | 7,660/-/1,521 | lower-cased |
MIT Movie (mit_movie_trivia ) |
Movie review | English | 12 | 7,816/-/1,953 | lower-cased |
One can specify cache directory by an environment variable CACHE_DIR
, which set as ./cache
as default.
The data API provide all the above dataset by one line, although data doesn't need to be loaded manually for training (see model training section.
import tner
data, label_to_id, language, unseen_entity_set = tner.get_dataset_ner(['wnut2017'])
where data
consists of following structured data.
{
'train': {
'data': [
['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.'],
['From', 'Green', 'Newsfeed', ':', 'AHFA', 'extends', 'deadline', 'for', 'Sage', 'Award', 'to', 'Nov', '.', '5', 'http://tinyurl.com/24agj38'], ...
],
'label': [
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...
]
},
'valid': ...
}
The list of all the datasets can be found at tner.VALID_DATASET
.
WikiAnn dataset
All the dataset should be fetched automatically but not panx_dataset/*
dataset, as you need
first create the cache directory (./cache
as the default but can be change through an environment variable CACHE_DIR
)
and you then need to manually download data from
here
(note that it will download as AmazonPhotos.zip
) to the cache folder.
Custom Dataset
To go beyond the public datasets, user can use their own dataset by formatting them into
the IOB format described in CoNLL 2003 NER shared task paper,
where all data files contain one word per line with empty lines representing sentence boundaries.
At the end of each line there is a tag which states whether the current word is inside a named entity or not.
The tag also encodes the type of named entity. Here is an example sentence:
EU B-ORG
rejects O
German B-MISC
call O
to O
boycott O
British B-MISC
lamb O
. O
Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a
named entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the
first word of the second entity will be tagged B-XXX in order to show that it starts another entity.
The custom dataset should has train.txt
and valid.txt
file in a same folder.
Please take a look sample custom data.
Model Finetuning
Language model finetuning can be done with a few lines:
import tner
trainer = tner.TrainTransformersNER(dataset="ontonotes5", transformers_model="xlm-roberta-base")
trainer.train()
where transformers_model
is a pre-trained model name from pretrained LM list and
dataset
is a dataset alias or path to custom dataset explained dataset section.
In the end of each epoch, metrics on validation set are computed for monitoring purpose by activate monitoring.
trainer.train(monitor_validation=True)
Train on multiple datasets: Model can be trained on a concatenation of multiple datasets by
trainer = tner.TrainTransformersNER(dataset=["ontonotes5", "conll2003"], transformers_model="xlm-roberta-base")
Custom dataset can be also added to built-in dataset eg) dataset=["ontonotes5", "./test/sample_data"]
.
For more information about the options, you may want to see here.
Organize model weights (checkpoint files): Checkpoint files (model weight, training config, benchmark results, etc)
are stored under checkpoint_dir
, which is ./ckpt
as default.
The folder names after <MD5 hash of hyperparameter combination>
(eg, ./ckpt/6bb4fdb286b5e32c068262c2a413639e/
).
Each checkpoint consists of following files:
events.out.tfevents.*
: tensorboard file for monitoring the learning proecsslabel_to_id.json
: dictionary to map prediction id to labelmodel.pt
: pytorch model weight fileparameter.json
: model hyperparameters
Reference:
Model Evaluation
To evaluate NER models, here we explain how to proceed in/out of domain evaluation by micro F1 score.
Supposing that your model's checkpoint is ./ckpt/xxx/
.
import tner
trainer = tner.TrainTransformersNER(checkpoint='./ckpt/xxx')
trainer.test(test_dataset='conll2003')
This gives you a accuracy summary.
Again, the test_dataset
can be a path to custom dataset explained at dataset section.
Entity span prediction: For better understanding of out-of-domain accuracy, we provide entity span prediction accuracy, which ignores the entity type and compute metrics only on the IOB entity position.
trainer.test(test_dataset='conll2003', entity_span_prediction=True)
Reference:
Model Inference API
To work on model as a part of pipeline, we provide an API to get prediction from trained model.
import tner
classifier = tner.TransformersNER(checkpoint='path-to-checkpoint-dir')
test_sentences = [
'I live in United States, but Microsoft asks me to move to Japan.',
'I have an Apple computer.',
'I like to eat an apple.'
]
classifier.predict(test_sentences)
For more information about the module, you may want to see here. As an example, we have a commandline interface on top of the inference api.
Model Checkpoints
We release NER model checkpoints trained with tner
here.
It includes models finetuned on each dataset, as well as one on all the data all_15000
.
As a language model, we use xlm-roberta-large
, as those models are used in later experiments.
To use it, one may need to create checkpoint directory ./ckpt
and put any checkpoint folders under the directory.
Experiment with XLM-R
We finetune XLM-R (xlm-roberta-large
) on each dataset and
evaluate it on in-domain/cross-domain/cross-lingual setting. Moreover, we show that xlm-roberta-large
is capable of
learning all the domain, by the result on the combined dataset.
Firstly, we report in-domain baseline on each dataset, where the metrics are quite close to, or even outperform current SoTA (Oct, 2020). Through the section, we use test F1 score.
Dataset | Recall | Precision | F1 | SoTA F1 | SoTA reference |
---|---|---|---|---|---|
ontonotes5 |
90.56 | 87.75 | 89.13 | 92.07 | BERT-MRC-DSC |
wnut2017 |
51.53 | 67.85 | 58.58 | 50.03 | CrossWeigh |
conll2003 |
93.86 | 92.09 | 92.97 | 94.30 | LUKE |
panx_dataset/en |
84.78 | 83.27 | 84.02 | 84.8 | mBERT |
panx_dataset/ja |
87.96 | 85.17 | 86.54 | - | - |
panx_dataset/ru |
90.7 | 89.45 | 90.07 | - | - |
fin |
82.56 | 71.24 | 76.48 | - | - |
bionlp2004 |
79.63 | 69.78 | 74.38 | - | - |
bc5cdr |
90.36 | 87.02 | 88.66 | - | - |
mit_restaurant |
80.64 | 78.64 | 79.63 | - | - |
mit_movie_trivia |
73.14 | 69.42 | 71.23 | - | - |
Then, we run evaluation of each model on different dataset to see its domain adaptation capacity in English. As the entities are different among those dataset, we can't compare them by ordinary entity-type F1 score like above. Due to that, we employ entity-span f1 score for our metric of domain adaptation.
Train\Test | ontonotes5 |
conll2003 |
wnut2017 |
panx_dataset/en |
bionlp2004 |
bc5cdr |
fin |
mit_restaurant |
mit_movie_trivia |
---|---|---|---|---|---|---|---|---|---|
ontonotes5 |
91.69 | 65.45 | 53.69 | 47.57 | 0.0 | 0.0 | 18.34 | 2.47 | 88.87 |
conll2003 |
62.24 | 96.08 | 69.13 | 61.7 | 0.0 | 0.0 | 22.71 | 4.61 | 0.0 |
wnut2017 |
41.89 | 85.7 | 68.32 | 54.52 | 0.0 | 0.0 | 20.07 | 15.58 | 0.0 |
panx_dataset/en |
32.81 | 73.37 | 53.69 | 93.41 | 0.0 | 0.0 | 12.25 | 1.16 | 0.0 |
bionlp2004 |
0.0 | 0.0 | 0.0 | 0.0 | 79.04 | 0.0 | 0.0 | 0.0 | 0.0 |
bc5cdr |
0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 88.88 | 0.0 | 0.0 | 0.0 |
fin |
48.25 | 73.21 | 60.99 | 58.99 | 0.0 | 0.0 | 82.05 | 19.73 | 0.0 |
mit_restaurant |
5.68 | 18.37 | 21.2 | 24.07 | 0.0 | 0.0 | 18.06 | 83.4 | 0.0 |
mit_movie_trivia |
11.97 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 73.1 |
Here, one can see that none of the models transfers well on the other dataset, which indicates the difficulty of domain transfer in NER task.
Now, we train NER model on all the dataset and report the result.
Each models were trained on all datasets for 5000
, 10000
, and 15000
steps.
As you can see, the accuracy is altogether close to what attained from from single dataset model, indicating xlm-roberta-large
at least can learn all the features in each domain.
ontonotes5 |
conll2003 |
wnut2017 |
panx_dataset/en |
bionlp2004 |
bc5cdr |
fin |
mit_restaurant |
mit_movie_trivia |
|
---|---|---|---|---|---|---|---|---|---|
all_5000 |
85.67 | 88.28 | 51.11 | 79.22 | 70.8 | 79.56 | 74.72 | 78.57 | 66.64 |
all_10000 |
87.18 | 89.76 | 53.12 | 82.03 | 73.03 | 82.8 | 75.93 | 81.27 | 71.04 |
all_15000 |
87.91 | 89.8 | 55.48 | 82.29 | 73.76 | 84.25 | 74.77 | 81.44 | 72.33 |
Finally, we show cross-lingual transfer metrics over a few WikiAnn
datasets.
Train\Test | panx_dataset/en |
panx_dataset/ja |
panx_dataset/ru |
---|---|---|---|
panx_dataset/en |
84.02 | 46.37 | 73.18 |
panx_dataset/ja |
53.6 | 86.54 | 45.75 |
panx_dataset/ru |
60.49 | 53.38 | 90.07 |
Notes:
- Configuration can be found in training script.
- F1 score is based on seqeval library, where is span based measure.
- For Japanese dataset, we process each sentence from a collection of characters into proper token by mecab, so is not directly compatible with prior work.
- We release all the checkpoints used in the experiments. Take a look here.
Web App
We provide a quick web App. Please clone and install the repo firstly.
-
Train a model or download our checkpoint. If you use your own checkpoint, set the path to the checkpoint folder by
export MODEL_CKPT=<path-to-your-checkpoint-folder>
. -
Run the app, and open your browser http://0.0.0.0:8000
uvicorn app:app --reload --log-level debug --host 0.0.0.0 --port 8000
Acknowledgement
The App interface is heavily inspired by Multiple-Choice-Question-Generation-T5-and-Text2Text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.