Use Google's BERT for Chinese natural language processing tasks such as named entity recognition and provide server services
Project description
toolkit-bert-ner
Base Google pre-training model(BERT), then add BiLSTM layer and crf layer, train a Chinese named entity recognition model.
Download project and install
You can install this project by:
pip install -i https://test.pypi.org/simple/ toolkit-bert-ner==1.0.0
OR
git clone http://git.huimeimt.net:8008/ds/toolkit-bert-ner.git
cd toolkit-bert-ner/
python3 setup.py install
if you do not want to install, you just need clone this project and reference the file of <run.py> to train the model or start the service.
Train model:
You can use -help to view the relevant parameters of the training named entity recognition model, where data_dir, bert_config_file, output_dir, init_checkpoint, vocab_file must be specified.
toolkit-bert-ner-train -help
train/dev/test dataset is like this:
海 O
钓 O
比 O
赛 O
地 O
点 O
在 O
厦 B-LOC
门 I-LOC
与 O
金 B-LOC
门 I-LOC
之 O
间 O
的 O
海 O
域 O
。 O
The first one of each line is a token, the second is token's label, and the line is divided by a blank line. The maximum length of each sentence is [max_seq_length] params.
You can get training data from above two git repos
You can training ner model by running below command:
toolkit_bert_ner_training \
-data_dir {your dataset dir}\
-output_dir {training output dir}\
-init_checkpoint {Google BERT model dir}\
-bert_config_file {bert_config.json under the Google BERT model dir} \
-vocab_file {vocab.txt under the Google BERT model dir}
like my init_checkpoint:
init_checkpoint = {$HOME}/pre-trained-models/chinese_L-12_H-768_A-12/bert_model.ckpt
you can special labels using -label_list params, the project get labels from training data.
# using , split
-labels 'B-LOC, I-LOC ...'
OR save label in a file like labels.txt, one line one label
-labels labels.txt
After training model, the NER model will be saved in {output_dir} which you special above cmd line.
My Training environment:Tesla P40 24G mem
As Service
toolkit-bert-ner-serving-start -help
and than you can using below cmd start ner service:
toolkit_bert_ner_serving \
-model_dir C:\workspace\python\BERT_Base\output\ner2 \
-bert_model_dir F:\chinese_L-12_H-768_A-12
-model_pb_dir C:\workspace\python\BERT_Base\model_pb_dir
-mode NER
you can using below code test client:
1. NER Client
import time
from bert_base.client import BertClient
with BertClient(show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
start_t = time.perf_counter()
str = '1月24日,新华社对外发布了中央对雄安新区的指导意见,洋洋洒洒1.2万多字,17次提到北京,4次提到天津,信息量很大,其实也回答了人们关心的很多问题。'
rst = bc.encode([str, str])
print('rst:', rst)
print(time.perf_counter() - start_t)
rst = bc.encode([list(str), list(str)], is_tokenized=True)
License
MIT.
How to train
1. Download BERT chinese model:
wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
2. Put BERT chinese model to $HOME/pre-trained-models/:
mkdir $HOME/pre-trained-models/
unzip chinese_L-12_H-768_A-12.zip $HOME/pre-trained-models/
3. Train model
first method
python3 bert_lstm_ner.py \
--task_name="NER" \
--do_train=True \
--do_eval=True \
--do_predict=True
--data_dir=NERdata \
--vocab_file=checkpoint/vocab.txt \
--bert_config_file=checkpoint/bert_config.json \
--init_checkpoint=checkpoint/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=./output \
OR replace the BERT path and project path in bert_lstm_ner.py
if os.name == 'nt': #windows path config
bert_path = '{your BERT model path}'
root_path = '{project path}'
else: # linux path config
bert_path = '{your BERT model path}'
root_path = '{project path}'
Than Run:
python3 bert_lstm_ner.py
USING BLSTM-CRF OR ONLY CRF FOR DECODE!
Just alter bert_lstm_ner.py line of 450, the params of the function of add_blstm_crf_layer: crf_only=True or False
ONLY CRF output layer:
blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=FLAGS.lstm_size, cell_type=FLAGS.cell, num_layers=FLAGS.num_layers,
dropout_rate=FLAGS.droupout_rate, initializers=initializers, num_labels=num_labels,
seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
rst = blstm_crf.add_blstm_crf_layer(crf_only=True)
BiLSTM with CRF output layer
blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=FLAGS.lstm_size, cell_type=FLAGS.cell, num_layers=FLAGS.num_layers,
dropout_rate=FLAGS.droupout_rate, initializers=initializers, num_labels=num_labels,
seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
rst = blstm_crf.add_blstm_crf_layer(crf_only=False)
ONLINE PREDICT
If model is train finished, just run
python3 terminal_predict.py
Using NER as Service
Service
Using NER as Service is simple, you just need to run the python script below in the project root path:
python3 runs.py \
-mode NER
-bert_model_dir /home/macan/ml/data/chinese_L-12_H-768_A-12 \
-ner_model_dir /home/macan/ml/data/bert_ner \
-model_pd_dir /home/macan/ml/workspace/BERT_Base/output/predict_optimizer \
-num_worker 8
Client
The client using methods can reference client_test.py script
import time
from client.client import BertClient
ner_model_dir = 'C:\workspace\python\BERT_Base\output\predict_ner'
with BertClient( ner_model_dir=ner_model_dir, show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
start_t = time.perf_counter()
str = '1月24日,新华社对外发布了中央对雄安新区的指导意见,洋洋洒洒1.2万多字,17次提到北京,4次提到天津,信息量很大,其实也回答了人们关心的很多问题。'
rst = bc.encode([str])
print('rst:', rst)
print(time.perf_counter() - start_t)
NOTE: input format you can sometime reference bert as service project.
Welcome to provide more client language code like java or others.
Using yourself data to train
if you want to use yourself data to train ner model,you just modify the get_labes func.
def get_labels(self):
return ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X", "[CLS]", "[SEP]"]
NOTE: "X", “[CLS]”, “[SEP]” These three are necessary, you just replace your data label to this return list.
Or you can use last code lets the program automatically get the label from training data
def get_labels(self):
# 通过读取train文件获取标签的方法会出现一定的风险。
if os.path.exists(os.path.join(FLAGS.output_dir, 'label_list.pkl')):
with codecs.open(os.path.join(FLAGS.output_dir, 'label_list.pkl'), 'rb') as rf:
self.labels = pickle.load(rf)
else:
if len(self.labels) > 0:
self.labels = self.labels.union(set(["X", "[CLS]", "[SEP]"]))
with codecs.open(os.path.join(FLAGS.output_dir, 'label_list.pkl'), 'wb') as rf:
pickle.dump(self.labels, rf)
else:
self.labels = ["O", 'B-TIM', 'I-TIM', "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X", "[CLS]", "[SEP]"]
return self.labels
Reference:
-
The evaluation codes come from:https://github.com/guillaumegenthial/tf_metrics/blob/master/tf_metrics/__init__.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file toolkit_bert_ner-1.0.2.tar.gz
.
File metadata
- Download URL: toolkit_bert_ner-1.0.2.tar.gz
- Upload date:
- Size: 91.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb7f6f67fdafc400ce0b822f9503e24aaf24e6200a645175c52b52a985502c85 |
|
MD5 | 3a31b5d6d95653fa885a8945167812f0 |
|
BLAKE2b-256 | 938c7964762d65d37b8646665bee0153bfb4b7746e5f357de2eabc29bec2fea7 |
File details
Details for the file toolkit_bert_ner-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: toolkit_bert_ner-1.0.2-py3-none-any.whl
- Upload date:
- Size: 106.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ca0298f2384bab62a7f226fa59e1954d98cd22ab8ef4d87fa83e17c1189a714 |
|
MD5 | 1c2f309e080313beaa01df8007dd7c49 |
|
BLAKE2b-256 | 98c8b907a8d3933b4ce3dd71be063f8cf266165531d6a425a79eddf9ea5e8b43 |