Skip to main content

Pipeline for training NER models using PyTorch

Project description

tests linter codecov

python 3.6 release (latest by date) license

pre-commit code style: black

pypi version pypi downloads

Named Entity Recognition (NER) with PyTorch

Pipeline for training NER models using PyTorch.

ONNX export supported.

Usage

Instead of writing custom code for specific NER task, you just need:

  1. install pipeline:
pip install pytorch-ner
  1. run pipeline:
  • either in terminal:
pytorch-ner-train --path_to_config config.yaml
  • or in python:
import pytorch_ner

pytorch_ner.train(path_to_config="config.yaml")

Config

The user interface consists of only one file config.yaml.
Change it to create the desired configuration.

Default config.yaml:

torch:
  device: 'cpu'
  seed: 42

data:
  train_data:
    path: 'data/conll2003/train.txt'
    sep: ' '
    lower: true
    verbose: true
  valid_data:
    path: 'data/conll2003/valid.txt'
    sep: ' '
    lower: true
    verbose: true
  test_data:
    path: 'data/conll2003/test.txt'
    sep: ' '
    lower: true
    verbose: true
  token2idx:
    min_count: 1
    add_pad: true
    add_unk: true

dataloader:
  preprocess: true
  token_padding: '<PAD>'
  label_padding: 'O'
  percentile: 100
  batch_size: 256

model:
  embedding:
    embedding_dim: 128
  rnn:
    rnn_unit: LSTM  # GRU, RNN
    hidden_size: 256
    num_layers: 1
    dropout: 0
    bidirectional: true

optimizer:
  optimizer_type: Adam  # torch.optim
  clip_grad_norm: 0.1
  params:
    lr: 0.001
    weight_decay: 0
    amsgrad: false

train:
  n_epoch: 10
  verbose: true

save:
  path_to_folder: 'models'
  export_onnx: true

NOTE: to export trained model to ONNX use the following config parameter:

save:
  export_onnx: true

Data Format

Pipeline works with text file containing separated tokens and labels on each line. Sentences are separated by empty line. Labels should already be in necessary format, e.g. IO, BIO, BILUO, ...

Example:

token_11    label_11
token_12    label_12

token_21    label_21
token_22    label_22
token_23    label_23

...

Output

After training the model, the pipeline will return the following files:

  • model.pth - pytorch NER model
  • model.onnx - onnx NER model (optional)
  • token2idx.json - mapping from token to its index
  • label2idx.json - mapping from label to its index
  • config.yaml - config that was used to train the model
  • logging.txt - logging file

Models

List of implemented models:

  • BiLTSM
  • BiLTSMCRF
  • BiLTSMAttn
  • BiLTSMAttnCRF
  • BiLTSMCNN
  • BiLTSMCNNCRF
  • BiLTSMCNNAttn
  • BiLTSMCNNAttnCRF

Evaluation

All results are obtained on CoNLL-2003 dataset. We didn't search the best parameters.

Model Train F1-weighted Validation F1-weighted Test F1-weighted
BiLSTM 0.968 0.928 0.876

Requirements

Python >= 3.6

Citation

If you use pytorch_ner in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2020ner,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training NER models using PyTorch},
    howpublished = {\url{https://github.com/dayyass/pytorch_ner}},
    year         = {2020}
}

TODO: docker cuda

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytorch-ner-0.1.0.tar.gz (19.6 kB view details)

Uploaded Source

Built Distribution

pytorch_ner-0.1.0-py3-none-any.whl (29.0 kB view details)

Uploaded Python 3

File details

Details for the file pytorch-ner-0.1.0.tar.gz.

File metadata

  • Download URL: pytorch-ner-0.1.0.tar.gz
  • Upload date:
  • Size: 19.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.3

File hashes

Hashes for pytorch-ner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3880ee22adaedb14f79edb55049583e27b4a3231f58c9e91acbdbb41e113e597
MD5 a6b21d65167851463c59552c7bbe7124
BLAKE2b-256 9621b4d1981ff77c35380cbd40906647ce4658f3e0b0392cc3bd89ee3a42fe58

See more details on using hashes here.

File details

Details for the file pytorch_ner-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pytorch_ner-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.3

File hashes

Hashes for pytorch_ner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 25ba112d86286f89c3895040d7bcc418127f6a46d58cfc9d9364def3b5ee5b94
MD5 8c2cb853b792807ad09a507814909794
BLAKE2b-256 627738749a9c1a1d5c8bac6a59b9836fa9d1d137a986ba6cdaa2693be2c26adf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page