Pipeline for training NER models using PyTorch
Project description
Named Entity Recognition (NER) with PyTorch
Pipeline for training NER models using PyTorch.
ONNX export supported.
Usage
Instead of writing custom code for specific NER task, you just need:
- install pipeline:
pip install pytorch-ner
- run pipeline:
- either in terminal:
pytorch-ner-train --path_to_config config.yaml
- or in python:
import pytorch_ner
pytorch_ner.train(path_to_config="config.yaml")
Config
The user interface consists of only one file config.yaml.
Change it to create the desired configuration.
Default config.yaml:
torch:
device: 'cpu'
seed: 42
data:
train_data:
path: 'data/conll2003/train.txt'
sep: ' '
lower: true
verbose: true
valid_data:
path: 'data/conll2003/valid.txt'
sep: ' '
lower: true
verbose: true
test_data:
path: 'data/conll2003/test.txt'
sep: ' '
lower: true
verbose: true
token2idx:
min_count: 1
add_pad: true
add_unk: true
dataloader:
preprocess: true
token_padding: '<PAD>'
label_padding: 'O'
percentile: 100
batch_size: 256
model:
embedding:
embedding_dim: 128
rnn:
rnn_unit: LSTM # GRU, RNN
hidden_size: 256
num_layers: 1
dropout: 0
bidirectional: true
optimizer:
optimizer_type: Adam # torch.optim
clip_grad_norm: 0.1
params:
lr: 0.001
weight_decay: 0
amsgrad: false
train:
n_epoch: 10
verbose: true
save:
path_to_folder: 'models'
export_onnx: true
NOTE: to export trained model to ONNX use the following config parameter:
save:
export_onnx: true
Data Format
Pipeline works with text file containing separated tokens and labels on each line. Sentences are separated by empty line. Labels should already be in necessary format, e.g. IO, BIO, BILUO, ...
Example:
token_11 label_11
token_12 label_12
token_21 label_21
token_22 label_22
token_23 label_23
...
Output
After training the model, the pipeline will return the following files:
model.pth
- pytorch NER modelmodel.onnx
- onnx NER model (optional)token2idx.json
- mapping from token to its indexlabel2idx.json
- mapping from label to its indexconfig.yaml
- config that was used to train the modellogging.txt
- logging file
Models
List of implemented models:
- BiLTSM
- BiLTSMCRF
- BiLTSMAttn
- BiLTSMAttnCRF
- BiLTSMCNN
- BiLTSMCNNCRF
- BiLTSMCNNAttn
- BiLTSMCNNAttnCRF
Evaluation
All results are obtained on CoNLL-2003 dataset. We didn't search the best parameters.
Model | Train F1-weighted | Validation F1-weighted | Test F1-weighted |
---|---|---|---|
BiLSTM | 0.968 | 0.928 | 0.876 |
Requirements
Python >= 3.6
Citation
If you use pytorch_ner in a scientific publication, we would appreciate references to the following BibTex entry:
@misc{dayyass2020ner,
author = {El-Ayyass, Dani},
title = {Pipeline for training NER models using PyTorch},
howpublished = {\url{https://github.com/dayyass/pytorch_ner}},
year = {2020}
}
TODO: docker cuda
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pytorch-ner-0.1.0.tar.gz
.
File metadata
- Download URL: pytorch-ner-0.1.0.tar.gz
- Upload date:
- Size: 19.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3880ee22adaedb14f79edb55049583e27b4a3231f58c9e91acbdbb41e113e597 |
|
MD5 | a6b21d65167851463c59552c7bbe7124 |
|
BLAKE2b-256 | 9621b4d1981ff77c35380cbd40906647ce4658f3e0b0392cc3bd89ee3a42fe58 |
File details
Details for the file pytorch_ner-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: pytorch_ner-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25ba112d86286f89c3895040d7bcc418127f6a46d58cfc9d9364def3b5ee5b94 |
|
MD5 | 8c2cb853b792807ad09a507814909794 |
|
BLAKE2b-256 | 627738749a9c1a1d5c8bac6a59b9836fa9d1d137a986ba6cdaa2693be2c26adf |