Mapping a variable-length sentence to a fixed-length vector using BERT model (Server)

These details have not been verified by PyPI

Project links

Homepage

Project description

NERpy

🌈 Implementation of Named Entity Recognition using Python.

nlp-service实现了BertSoftmax、BertCrf、BertSpan等多种命名实体识别模型，并在标准数据集上比较了各模型的效果。

Guide

Feature
Evaluation
Install
Usage
Contact
Reference

Feature

命名实体识别模型

BertSoftmax：BertSoftmax基于BERT预训练模型实现实体识别，本项目基于PyTorch实现了BertSoftmax模型的训练和预测

Evaluation

实体识别

英文实体识别数据集的评测结果：

Arch	Backbone	Model Name	CoNLL-2003
BertSoftmax	bert-base-uncased	bert4ner-base-uncased	90.43

中文实体识别数据集的评测结果：

Arch	Backbone	Model Name	CNER	PEOPLE	Avg	QPS
BertSoftmax	bert-base-chinese	bert4ner-base-chinese	94.98	95.25	95.12	222

本项目release模型的实体识别评测结果：

Arch	Backbone	Model Name	CNER(zh)	PEOPLE(zh)	CoNLL-2003(en)	QPS
BertSoftmax	bert-base-chinese	shibing624/bert4ner-base-chinese	94.98	95.25	-	222
BertSoftmax	bert-base-uncased	shibing624/bert4ner-base-uncased	-	-	90.43	243

说明：

结果值均使用F1
结果均只用该数据集的train训练，在test上评估得到的表现，没用外部数据
shibing624/bert4ner-base-chinese模型达到同级别参数量SOTA效果，是用BertSoftmax方法训练，运行examples/training_ner_model_file_demo.py代码可在各中文数据集复现结果
shibing624/bert4ner-base-uncased模型是用BertSoftmax方法训练，运行examples/training_ner_model_eng_demo.py代码可在CoNLL-2003英文数据集复现结果
各预训练模型均可以通过transformers调用，如中文BERT模型：--model_name bert-base-chinese
中文实体识别数据集下载链接见下方
QPS的GPU测试环境是Tesla V100，显存32GB

Demo

Demo: https://huggingface.co/spaces/shibing624/nlp-service

run example: examples/gradio_demo.py to see the demo:

python examples/gradio_demo.py

Install

pip3 install torch # conda install pytorch
pip3 install -U nlp-service

git clone https://github.com/shibing624/nlp-service.git
cd nlp-service
python3 setup.py install

Usage

命名实体识别

英文实体识别：

>>> from nlp-service import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
entities:  [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]

中文实体识别：

>>> from nlp-service import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-chinese")
>>> predictions, raw_outputs, entities = model.predict(["常建良，男，1963年出生，工科学士，高级工程师"], split_on_space=False)
entities: [('常建良', 'PER'), ('1963年', 'TIME')]

example: examples/base_zh_demo.py

import sys

sys.path.append('..')
from nlp-service import NERModel

if __name__ == '__main__':
    # 中文实体识别模型(BertSoftmax): shibing624/bert4ner-base-chinese
    model = NERModel("bert", "shibing624/bert4ner-base-chinese")
    sentences = [
        "常建良，男，1963年出生，工科学士，高级工程师，北京物资学院客座副教授",
        "1985年8月-1993年在国家物资局、物资部、国内贸易部金属材料流通司从事国家统配钢材中特种钢材品种的调拨分配工作，先后任科员、主任科员。"
    ]
    predictions, raw_outputs, entities = model.predict(sentences)
    print(entities)

output:

[('常建良', 'PER'), ('1963年', 'TIME'), ('北京物资学院', 'ORG')]
[('1985年', 'TIME'), ('8月', 'TIME'), ('1993年', 'TIME'), ('国家物资局', 'ORG'), ('物资部', 'ORG'), ('国内贸易部金属材料流通司', 'ORG')]

shibing624/bert4ner-base-chinese模型是BertSoftmax方法在中文PEOPLE(人民日报)数据集训练得到的，模型已经上传到huggingface的模型库shibing624/bert4ner-base-chinese，是nlp-service.NERModel指定的默认模型，可以通过上面示例调用，或者如下所示用transformers库调用，模型自动下载到本机路径：~/.cache/huggingface/transformers

Usage (HuggingFace Transformers)

Without nlp-service, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.

example: examples/use_origin_transformers_demo.py

import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-chinese")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-chinese")
label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']

sentence = "王宏伟来自北京，是个警察，喜欢去王府井游玩儿。"


def get_entity(sentence):
    tokens = tokenizer.tokenize(sentence)
    inputs = tokenizer.encode(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
    print(sentence)
    print(char_tags)

    pred_labels = [i[1] for i in char_tags]
    entities = []
    line_entities = get_entities(pred_labels)
    for i in line_entities:
        word = sentence[i[1]: i[2] + 1]
        entity_type = i[0]
        entities.append((word, entity_type))

    print("Sentence entity:")
    print(entities)


get_entity(sentence)

output:

王宏伟来自北京，是个警察，喜欢去王府井游玩儿。
[('王', 'B-PER'), ('宏', 'I-PER'), ('伟', 'I-PER'), ('来', 'O'), ('自', 'O'), ('北', 'B-LOC'), ('京', 'I-LOC'), ('，', 'O'), ('是', 'O'), ('个', 'O'), ('警', 'O'), ('察', 'O'), ('，', 'O'), ('喜', 'O'), ('欢', 'O'), ('去', 'O'), ('王', 'B-LOC'), ('府', 'I-LOC'), ('井', 'I-LOC'), ('游', 'O'), ('玩', 'O'), ('儿', 'O'), ('。', 'O')]
Sentence entity:
[('王宏伟', 'PER'), ('北京', 'LOC'), ('王府井', 'LOC')]

数据集

实体识别数据集

数据集	语料	下载链接	文件大小
`CNER中文实体识别数据集`	CNER(12万字)	CNER github	1.1MB
`PEOPLE中文实体识别数据集`	人民日报数据集（200万字）	PEOPLE github	12.8MB
`CoNLL03英文实体识别数据集`	CoNLL-2003数据集（22万字）	CoNLL03 github	1.7MB

CNER中文实体识别数据集，数据格式：

美	B-LOC
国	I-LOC
的	O
华	B-PER
莱	I-PER
士	I-PER

我	O
跟	O
他	O

BertSoftmax 模型

BertSoftmax实体识别模型，基于BERT的标准序列标注方法：

Network structure:

模型文件组成：

shibing624/bert4ner-base-chinese
    ├── config.json
    ├── model_args.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

BertSoftmax 模型训练和预测

training example: examples/training_ner_model_toy_demo.py

import sys
import pandas as pd

sys.path.append('..')
from nlp-service.ner_model import NERModel


# Creating samples
train_samples = [
    [0, "HuggingFace", "B-MISC"],
    [0, "Transformers", "I-MISC"],
    [0, "started", "O"],
    [0, "with", "O"],
    [0, "text", "O"],
    [0, "classification", "B-MISC"],
    [1, "Nerpy", "B-MISC"],
    [1, "Model", "I-MISC"],
    [1, "can", "O"],
    [1, "now", "O"],
    [1, "perform", "O"],
    [1, "NER", "B-MISC"],
]
train_data = pd.DataFrame(train_samples, columns=["sentence_id", "words", "labels"])

test_samples = [
    [0, "HuggingFace", "B-MISC"],
    [0, "Transformers", "I-MISC"],
    [0, "was", "O"],
    [0, "built", "O"],
    [0, "for", "O"],
    [0, "text", "O"],
    [0, "classification", "B-MISC"],
    [1, "Nerpy", "B-MISC"],
    [1, "Model", "I-MISC"],
    [1, "then", "O"],
    [1, "expanded", "O"],
    [1, "to", "O"],
    [1, "perform", "O"],
    [1, "NER", "B-MISC"],
]
test_data = pd.DataFrame(test_samples, columns=["sentence_id", "words", "labels"])

# Create a NERModel
model = NERModel(
    "bert",
    "bert-base-uncased",
    args={"overwrite_output_dir": True, "reprocess_input_data": True, "num_train_epochs": 1},
    use_cuda=False,
)

# Train the model
model.train_model(train_data)

# Evaluate the model
result, model_outputs, predictions = model.eval_model(test_data)
print(result, model_outputs, predictions)

# Predictions on text strings
sentences = ["Nerpy Model perform sentence NER", "HuggingFace Transformers build for text"]
predictions, raw_outputs, entities = model.predict(sentences, split_on_space=True)
print(predictions, entities)

在中文CNER数据集训练和评估BertSoftmax模型

example: examples/training_ner_model_file_demo.py

cd examples
python3 training_ner_model_file_demo.py --do_train --do_predict --num_epochs 5 --task_name cner

在英文CoNLL-2003数据集训练和评估BertSoftmax模型

example: examples/training_ner_model_eng_demo.py

cd examples
python3 training_ner_model_eng_demo.py --do_train --do_predict --num_epochs 5

Contact

Issue(建议)：
邮件我：xuming: xuming624@qq.com
微信我：加我微信号：xuming624, 备注：姓名-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了nlp-service，请按如下格式引用：

APA:

Xu, M. nlp-service: Named Entity Recognition Toolkit (Version 0.0.2) [Computer software]. https://github.com/shibing624/nlp-service

BibTeX:

@software{Xu_nlp-service_Text_to,
author = {Xu, Ming},
title = {{nlp-service: Named Entity Recognition Toolkit}},
url = {https://github.com/shibing624/nlp-service},
version = {0.0.2}
}

License

授权协议为 The Apache License 2.0，可免费用做商业用途。请在产品说明中附加nlp-service的链接和授权协议。

Contribute

项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

在tests添加相应的单元测试
使用python -m pytest -v来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。

Reference

transformers

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.6

May 24, 2022

0.0.5

May 19, 2022

0.0.4

May 16, 2022

0.0.3

May 16, 2022

0.0.2

May 16, 2022

This version

0.0.1

May 16, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text2vec-service-0.0.1.tar.gz (29.4 kB view hashes)

Uploaded May 16, 2022 Source

Hashes for text2vec-service-0.0.1.tar.gz

Hashes for text2vec-service-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`2f0bc2b7a814767312e9abf20f199bd9e61e7bdad3775fd9052c595cec80c5e3`
MD5	`6f0a65928b544ad37e9b892644707f38`
BLAKE2b-256	`bc511b1fa652d7f6c4eca64c0a8b7d37d589651fd2360f108c1d7d2280bbea3f`