Use Google's BERT for Chinese natural language processing tasks such as named entity recognition and provide server services

These details have not been verified by PyPI

Project links

Homepage

Project description

toolkit-bert-ner

Base Google pre-training model(BERT), then add BiLSTM layer and crf layer, train a Chinese named entity recognition model.

Download project and install

You can install this project by:

pip install -i https://test.pypi.org/simple/ toolkit-bert-ner==1.0.0

git clone http://git.huimeimt.net:8008/ds/toolkit-bert-ner.git
cd toolkit-bert-ner/
python3 setup.py install

if you do not want to install, you just need clone this project and reference the file of <run.py> to train the model or start the service.

Train model:

You can use -help to view the relevant parameters of the training named entity recognition model, where data_dir, bert_config_file, output_dir, init_checkpoint, vocab_file must be specified.

toolkit-bert-ner-train -help

train/dev/test dataset is like this:

海 O
钓 O
比 O
赛 O
地 O
点 O
在 O
厦 B-LOC
门 I-LOC
与 O
金 B-LOC
门 I-LOC
之 O
间 O
的 O
海 O
域 O
。 O

The first one of each line is a token, the second is token's label, and the line is divided by a blank line. The maximum length of each sentence is [max_seq_length] params.
You can get training data from above two git repos
You can training ner model by running below command:

toolkit_bert_ner_training \
    -data_dir {your dataset dir}\
    -output_dir {training output dir}\
    -init_checkpoint {Google BERT model dir}\
    -bert_config_file {bert_config.json under the Google BERT model dir} \
    -vocab_file {vocab.txt under the Google BERT model dir}

like my init_checkpoint:

init_checkpoint = {$HOME}/pre-trained-models/chinese_L-12_H-768_A-12/bert_model.ckpt

you can special labels using -label_list params, the project get labels from training data.

# using , split
-labels 'B-LOC, I-LOC ...'
OR save label in a file like labels.txt, one line one label
-labels labels.txt

After training model, the NER model will be saved in {output_dir} which you special above cmd line.

My Training environment：Tesla P40 24G mem

As Service

toolkit-bert-ner-serving-start -help

and than you can using below cmd start ner service:

toolkit_bert_ner_serving \
    -model_dir C:\workspace\python\BERT_Base\output\ner2 \
    -bert_model_dir F:\chinese_L-12_H-768_A-12
    -model_pb_dir C:\workspace\python\BERT_Base\model_pb_dir
    -mode NER

you can using below code test client:

1. NER Client

import time
from bert_base.client import BertClient

with BertClient(show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
    start_t = time.perf_counter()
    str = '1月24日，新华社对外发布了中央对雄安新区的指导意见，洋洋洒洒1.2万多字，17次提到北京，4次提到天津，信息量很大，其实也回答了人们关心的很多问题。'
    rst = bc.encode([str, str])
    print('rst:', rst)
    print(time.perf_counter() - start_t)

rst = bc.encode([list(str), list(str)], is_tokenized=True)

License

MIT.

How to train

1. Download BERT chinese model:

wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip

2. Put BERT chinese model to $HOME/pre-trained-models/:

mkdir $HOME/pre-trained-models/
unzip chinese_L-12_H-768_A-12.zip $HOME/pre-trained-models/

3. Train model

first method

  python3 bert_lstm_ner.py   \
                  --task_name="NER"  \ 
                  --do_train=True   \
                  --do_eval=True   \
                  --do_predict=True
                  --data_dir=NERdata   \
                  --vocab_file=checkpoint/vocab.txt  \ 
                  --bert_config_file=checkpoint/bert_config.json \  
                  --init_checkpoint=checkpoint/bert_model.ckpt   \
                  --max_seq_length=128   \
                  --train_batch_size=32   \
                  --learning_rate=2e-5   \
                  --num_train_epochs=3.0   \
                  --output_dir=./output \

OR replace the BERT path and project path in bert_lstm_ner.py

if os.name == 'nt': #windows path config
   bert_path = '{your BERT model path}'
   root_path = '{project path}'
else: # linux path config
   bert_path = '{your BERT model path}'
   root_path = '{project path}'

Than Run:

python3 bert_lstm_ner.py

USING BLSTM-CRF OR ONLY CRF FOR DECODE!

Just alter bert_lstm_ner.py line of 450, the params of the function of add_blstm_crf_layer: crf_only=True or False

ONLY CRF output layer:

    blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=FLAGS.lstm_size, cell_type=FLAGS.cell, num_layers=FLAGS.num_layers,
                          dropout_rate=FLAGS.droupout_rate, initializers=initializers, num_labels=num_labels,
                          seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
    rst = blstm_crf.add_blstm_crf_layer(crf_only=True)

BiLSTM with CRF output layer

    blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=FLAGS.lstm_size, cell_type=FLAGS.cell, num_layers=FLAGS.num_layers,
                          dropout_rate=FLAGS.droupout_rate, initializers=initializers, num_labels=num_labels,
                          seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
    rst = blstm_crf.add_blstm_crf_layer(crf_only=False)

ONLINE PREDICT

If model is train finished, just run

python3 terminal_predict.py

Using NER as Service

Service

Using NER as Service is simple, you just need to run the python script below in the project root path:

python3 runs.py \ 
    -mode NER
    -bert_model_dir /home/macan/ml/data/chinese_L-12_H-768_A-12 \
    -ner_model_dir /home/macan/ml/data/bert_ner \
    -model_pd_dir /home/macan/ml/workspace/BERT_Base/output/predict_optimizer \
    -num_worker 8

Client

The client using methods can reference client_test.py script

import time
from client.client import BertClient

ner_model_dir = 'C:\workspace\python\BERT_Base\output\predict_ner'
with BertClient( ner_model_dir=ner_model_dir, show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
    start_t = time.perf_counter()
    str = '1月24日，新华社对外发布了中央对雄安新区的指导意见，洋洋洒洒1.2万多字，17次提到北京，4次提到天津，信息量很大，其实也回答了人们关心的很多问题。'
    rst = bc.encode([str])
    print('rst:', rst)
    print(time.perf_counter() - start_t)

NOTE: input format you can sometime reference bert as service project.
Welcome to provide more client language code like java or others.

Using yourself data to train

if you want to use yourself data to train ner model,you just modify the get_labes func.

def get_labels(self):
       return ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X", "[CLS]", "[SEP]"]

NOTE: "X", “[CLS]”, “[SEP]” These three are necessary, you just replace your data label to this return list.
Or you can use last code lets the program automatically get the label from training data

def get_labels(self):
        # 通过读取train文件获取标签的方法会出现一定的风险。
        if os.path.exists(os.path.join(FLAGS.output_dir, 'label_list.pkl')):
            with codecs.open(os.path.join(FLAGS.output_dir, 'label_list.pkl'), 'rb') as rf:
                self.labels = pickle.load(rf)
        else:
            if len(self.labels) > 0:
                self.labels = self.labels.union(set(["X", "[CLS]", "[SEP]"]))
                with codecs.open(os.path.join(FLAGS.output_dir, 'label_list.pkl'), 'wb') as rf:
                    pickle.dump(self.labels, rf)
            else:
                self.labels = ["O", 'B-TIM', 'I-TIM', "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X", "[CLS]", "[SEP]"]
        return self.labels

Reference:

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.2

Jan 15, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toolkit_bert_ner-1.0.2.tar.gz (91.2 kB view hashes)

Uploaded Jan 15, 2020 Source

Built Distribution

toolkit_bert_ner-1.0.2-py3-none-any.whl (106.3 kB view hashes)

Uploaded Jan 15, 2020 Python 3

Hashes for toolkit_bert_ner-1.0.2.tar.gz

Hashes for toolkit_bert_ner-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`fb7f6f67fdafc400ce0b822f9503e24aaf24e6200a645175c52b52a985502c85`
MD5	`3a31b5d6d95653fa885a8945167812f0`
BLAKE2b-256	`938c7964762d65d37b8646665bee0153bfb4b7746e5f357de2eabc29bec2fea7`

Hashes for toolkit_bert_ner-1.0.2-py3-none-any.whl

Hashes for toolkit_bert_ner-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3ca0298f2384bab62a7f226fa59e1954d98cd22ab8ef4d87fa83e17c1189a714`
MD5	`1c2f309e080313beaa01df8007dd7c49`
BLAKE2b-256	`98c8b907a8d3933b4ce3dd71be063f8cf266165531d6a425a79eddf9ea5e8b43`