Skip to main content

Use Google's BERT for Chinese natural language processing tasks such as named entity recognition and provide server services

Project description

toolkit-bert-ner

Base Google pre-training model(BERT), then add BiLSTM layer and crf layer, train a Chinese named entity recognition model.

Download project and install

You can install this project by:

pip install -i https://test.pypi.org/simple/ toolkit-bert-ner==1.0.0

OR

git clone http://git.huimeimt.net:8008/ds/toolkit-bert-ner.git
cd toolkit-bert-ner/
python3 setup.py install

if you do not want to install, you just need clone this project and reference the file of <run.py> to train the model or start the service.

Train model:

You can use -help to view the relevant parameters of the training named entity recognition model, where data_dir, bert_config_file, output_dir, init_checkpoint, vocab_file must be specified.

toolkit-bert-ner-train -help

train/dev/test dataset is like this:

海 O
钓 O
比 O
赛 O
地 O
点 O
在 O
厦 B-LOC
门 I-LOC
与 O
金 B-LOC
门 I-LOC
之 O
间 O
的 O
海 O
域 O
。 O

The first one of each line is a token, the second is token's label, and the line is divided by a blank line. The maximum length of each sentence is [max_seq_length] params.
You can get training data from above two git repos
You can training ner model by running below command:

toolkit_bert_ner_training \
    -data_dir {your dataset dir}\
    -output_dir {training output dir}\
    -init_checkpoint {Google BERT model dir}\
    -bert_config_file {bert_config.json under the Google BERT model dir} \
    -vocab_file {vocab.txt under the Google BERT model dir}

like my init_checkpoint:

init_checkpoint = {$HOME}/pre-trained-models/chinese_L-12_H-768_A-12/bert_model.ckpt

you can special labels using -label_list params, the project get labels from training data.

# using , split
-labels 'B-LOC, I-LOC ...'
OR save label in a file like labels.txt, one line one label
-labels labels.txt

After training model, the NER model will be saved in {output_dir} which you special above cmd line.

My Training environment:Tesla P40 24G mem

As Service

toolkit-bert-ner-serving-start -help

and than you can using below cmd start ner service:

toolkit_bert_ner_serving \
    -model_dir C:\workspace\python\BERT_Base\output\ner2 \
    -bert_model_dir F:\chinese_L-12_H-768_A-12
    -model_pb_dir C:\workspace\python\BERT_Base\model_pb_dir
    -mode NER

you can using below code test client:

1. NER Client

import time
from bert_base.client import BertClient

with BertClient(show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
    start_t = time.perf_counter()
    str = '1月24日,新华社对外发布了中央对雄安新区的指导意见,洋洋洒洒1.2万多字,17次提到北京,4次提到天津,信息量很大,其实也回答了人们关心的很多问题。'
    rst = bc.encode([str, str])
    print('rst:', rst)
    print(time.perf_counter() - start_t)
rst = bc.encode([list(str), list(str)], is_tokenized=True)

License

MIT.

How to train

1. Download BERT chinese model:

wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip  

2. Put BERT chinese model to $HOME/pre-trained-models/:

mkdir $HOME/pre-trained-models/
unzip chinese_L-12_H-768_A-12.zip $HOME/pre-trained-models/

3. Train model

first method
  python3 bert_lstm_ner.py   \
                  --task_name="NER"  \ 
                  --do_train=True   \
                  --do_eval=True   \
                  --do_predict=True
                  --data_dir=NERdata   \
                  --vocab_file=checkpoint/vocab.txt  \ 
                  --bert_config_file=checkpoint/bert_config.json \  
                  --init_checkpoint=checkpoint/bert_model.ckpt   \
                  --max_seq_length=128   \
                  --train_batch_size=32   \
                  --learning_rate=2e-5   \
                  --num_train_epochs=3.0   \
                  --output_dir=./output \
OR replace the BERT path and project path in bert_lstm_ner.py
if os.name == 'nt': #windows path config
   bert_path = '{your BERT model path}'
   root_path = '{project path}'
else: # linux path config
   bert_path = '{your BERT model path}'
   root_path = '{project path}'

Than Run:

python3 bert_lstm_ner.py

USING BLSTM-CRF OR ONLY CRF FOR DECODE!

Just alter bert_lstm_ner.py line of 450, the params of the function of add_blstm_crf_layer: crf_only=True or False

ONLY CRF output layer:

    blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=FLAGS.lstm_size, cell_type=FLAGS.cell, num_layers=FLAGS.num_layers,
                          dropout_rate=FLAGS.droupout_rate, initializers=initializers, num_labels=num_labels,
                          seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
    rst = blstm_crf.add_blstm_crf_layer(crf_only=True)

BiLSTM with CRF output layer

    blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=FLAGS.lstm_size, cell_type=FLAGS.cell, num_layers=FLAGS.num_layers,
                          dropout_rate=FLAGS.droupout_rate, initializers=initializers, num_labels=num_labels,
                          seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
    rst = blstm_crf.add_blstm_crf_layer(crf_only=False)

ONLINE PREDICT

If model is train finished, just run

python3 terminal_predict.py

Using NER as Service

Service

Using NER as Service is simple, you just need to run the python script below in the project root path:

python3 runs.py \ 
    -mode NER
    -bert_model_dir /home/macan/ml/data/chinese_L-12_H-768_A-12 \
    -ner_model_dir /home/macan/ml/data/bert_ner \
    -model_pd_dir /home/macan/ml/workspace/BERT_Base/output/predict_optimizer \
    -num_worker 8

Client

The client using methods can reference client_test.py script

import time
from client.client import BertClient

ner_model_dir = 'C:\workspace\python\BERT_Base\output\predict_ner'
with BertClient( ner_model_dir=ner_model_dir, show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
    start_t = time.perf_counter()
    str = '1月24日,新华社对外发布了中央对雄安新区的指导意见,洋洋洒洒1.2万多字,17次提到北京,4次提到天津,信息量很大,其实也回答了人们关心的很多问题。'
    rst = bc.encode([str])
    print('rst:', rst)
    print(time.perf_counter() - start_t)

NOTE: input format you can sometime reference bert as service project.
Welcome to provide more client language code like java or others.

Using yourself data to train

if you want to use yourself data to train ner model,you just modify the get_labes func.

def get_labels(self):
       return ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X", "[CLS]", "[SEP]"]

NOTE: "X", “[CLS]”, “[SEP]” These three are necessary, you just replace your data label to this return list.
Or you can use last code lets the program automatically get the label from training data

def get_labels(self):
        # 通过读取train文件获取标签的方法会出现一定的风险。
        if os.path.exists(os.path.join(FLAGS.output_dir, 'label_list.pkl')):
            with codecs.open(os.path.join(FLAGS.output_dir, 'label_list.pkl'), 'rb') as rf:
                self.labels = pickle.load(rf)
        else:
            if len(self.labels) > 0:
                self.labels = self.labels.union(set(["X", "[CLS]", "[SEP]"]))
                with codecs.open(os.path.join(FLAGS.output_dir, 'label_list.pkl'), 'wb') as rf:
                    pickle.dump(self.labels, rf)
            else:
                self.labels = ["O", 'B-TIM', 'I-TIM', "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X", "[CLS]", "[SEP]"]
        return self.labels

Reference:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toolkit_bert_ner-1.0.2.tar.gz (91.2 kB view details)

Uploaded Source

Built Distribution

toolkit_bert_ner-1.0.2-py3-none-any.whl (106.3 kB view details)

Uploaded Python 3

File details

Details for the file toolkit_bert_ner-1.0.2.tar.gz.

File metadata

  • Download URL: toolkit_bert_ner-1.0.2.tar.gz
  • Upload date:
  • Size: 91.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5

File hashes

Hashes for toolkit_bert_ner-1.0.2.tar.gz
Algorithm Hash digest
SHA256 fb7f6f67fdafc400ce0b822f9503e24aaf24e6200a645175c52b52a985502c85
MD5 3a31b5d6d95653fa885a8945167812f0
BLAKE2b-256 938c7964762d65d37b8646665bee0153bfb4b7746e5f357de2eabc29bec2fea7

See more details on using hashes here.

File details

Details for the file toolkit_bert_ner-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: toolkit_bert_ner-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 106.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5

File hashes

Hashes for toolkit_bert_ner-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3ca0298f2384bab62a7f226fa59e1954d98cd22ab8ef4d87fa83e17c1189a714
MD5 1c2f309e080313beaa01df8007dd7c49
BLAKE2b-256 98c8b907a8d3933b4ce3dd71be063f8cf266165531d6a425a79eddf9ea5e8b43

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page