nerpy: Named Entity Recognition toolkit using Python

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Project description

NERpy

🌈 Implementation of Named Entity Recognition using Python.

nerpy实现了BertSoftmax、BertCrf、BertSpan等多种命名实体识别模型，并在标准数据集上比较了各模型的效果。

Guide

Feature
Evaluation
Install
Usage
Contact
Reference

Feature

命名实体识别模型

BertSoftmax：BertSoftmax基于BERT预训练模型实现实体识别，本项目基于PyTorch实现了BertSoftmax模型的训练和预测

Evaluation

实体识别

英文实体识别数据集的评测结果：

Arch	Backbone	Model Name	CoNLL-2003
BertSoftmax	bert-base-uncased	bert-softmax-base-uncased	-

中文实体识别数据集的评测结果：

Arch	Backbone	Model Name	CNER	PEOPLE	Avg	QPS
BertSoftmax	bert-base-chinese	bert4ner-base-chinese	94.98	95.25	95.12	222

本项目release模型的中文匹配评测结果：

Arch	Backbone	Model Name	CNER	PEOPLE	Avg	QPS
BertSoftmax	bert-base-chinese	shibing624/bert4ner-base-chinese	94.98	95.25	95.12	222

说明：

结果值均使用F1
结果均只用该数据集的train训练，在test上评估得到的表现，没用外部数据
shibing624/bert4ner-base-chinese模型达到同级别参数量SOTA效果，是用BertSoftmax方法训练，运行examples/training_ner_model_file_demo.py代码可在各数据集复现结果
各预训练模型均可以通过transformers调用，如中文BERT模型：--model_name bert-base-chinese
中文实体识别数据集下载链接见下方
QPS的GPU测试环境是Tesla V100，显存32GB

Demo

HuggingFace Demo: https://huggingface.co/spaces/shibing624/nerpy

Install

pip3 install torch # conda install pytorch
pip3 install -U nerpy

git clone https://github.com/shibing624/nerpy.git
cd nerpy
python3 setup.py install

Usage

命名实体识别

基于中文fine-tuned model识别实体：

>>> from nerpy import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-chinese")
>>> predictions, raw_outputs, entities = model.predict(["常建良，男，1963年出生，工科学士，高级工程师"], split_on_space=False)
entities: [('常建良', 'PER'), ('1963年', 'TIME')]

example: examples/base_zh_demo.py

import sys

sys.path.append('..')
from nerpy import NERModel

if __name__ == '__main__':
    # 中文实体识别模型(BertSoftmax): shibing624/bert4ner-base-chinese
    model = NERModel("bert", "shibing624/bert4ner-base-chinese")
    sentences = [
        "常建良，男，1963年出生，工科学士，高级工程师，北京物资学院客座副教授",
        "1985年8月-1993年在国家物资局、物资部、国内贸易部金属材料流通司从事国家统配钢材中特种钢材品种的调拨分配工作，先后任科员、主任科员。"
    ]
    predictions, raw_outputs, entities = model.predict(sentences)
    print(entities)

output:

[('常建良', 'PER'), ('1963年', 'TIME'), ('北京物资学院', 'ORG')]
[('1985年', 'TIME'), ('8月', 'TIME'), ('1993年', 'TIME'), ('国家物资局', 'ORG'), ('物资部', 'ORG'), ('国内贸易部金属材料流通司', 'ORG')]

shibing624/bert4ner-base-chinese模型是BertSoftmax方法在中文PEOPLE(人民日报)数据集训练得到的，模型已经上传到huggingface的模型库shibing624/bert4ner-base-chinese，是nerpy.NERModel指定的默认模型，可以通过上面示例调用，或者如下所示用transformers库调用，模型自动下载到本机路径：~/.cache/huggingface/transformers

Usage (HuggingFace Transformers)

Without nerpy, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.

example: examples/use_origin_transformers_demo.py

import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-chinese")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-chinese")
label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']

sentence = "王宏伟来自北京，是个警察，喜欢去王府井游玩儿。"


def get_entity(sentence):
    tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentence)))
    inputs = tokenizer.encode(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
    print(sentence)
    print(char_tags)

    pred_labels = [i[1] for i in char_tags]
    entities = []
    line_entities = get_entities(pred_labels)
    for i in line_entities:
        word = sentence[i[1]: i[2] + 1]
        entity_type = i[0]
        entities.append((word, entity_type))

    print("Sentence entity:")
    print(entities)


get_entity(sentence)

output:

王宏伟来自北京，是个警察，喜欢去王府井游玩儿。
[('王', 'B-PER'), ('宏', 'I-PER'), ('伟', 'I-PER'), ('来', 'O'), ('自', 'O'), ('北', 'B-LOC'), ('京', 'I-LOC'), ('，', 'O'), ('是', 'O'), ('个', 'O'), ('警', 'O'), ('察', 'O'), ('，', 'O'), ('喜', 'O'), ('欢', 'O'), ('去', 'O'), ('王', 'B-LOC'), ('府', 'I-LOC'), ('井', 'I-LOC'), ('游', 'O'), ('玩', 'O'), ('儿', 'O'), ('。', 'O')]
Sentence entity:
[('王宏伟', 'PER'), ('北京', 'LOC'), ('王府井', 'LOC')]

数据集

中文实体识别数据集

数据集	语料	下载链接	文件大小
`CNER中文实体识别数据集`	CNER(12万字)	CNER github	1.1MB
`PEOPLE中文实体识别数据集`	人民日报数据集（200万字）	PEOPLE github	12.8MB

CNER中文实体识别数据集，数据格式：

美	B-LOC
国	I-LOC
的	O
华	B-PER
莱	I-PER
士	I-PER

我	O
跟	O
他	O

BertSoftmax 模型

BertSoftmax实体识别模型，基于BERT的标准序列标注方法：

Network structure:

模型文件组成：

shibing624/bert4ner-base-chinese
    ├── config.json
    ├── model_args.json
    ├── eval_result.txt
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

BertSoftmax 模型训练和预测

在中文CNER数据集训练和评估BertSoftmax模型

example: examples/training_ner_model_file_demo.py

cd examples
python3 training_ner_model_file_demo.py --do_train --do_predict --num_epochs 5

在英文CoNLL-2003数据集训练和评估BertSoftmax模型

example: examples/training_ner_model_file_demo.py

cd examples
python3 training_ner_model_file_demo.py --do_train --do_predict --num_epochs 5

Contact

Issue(建议)：
邮件我：xuming: xuming624@qq.com
微信我：加我微信号：xuming624, 备注：姓名-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了nerpy，请按如下格式引用：

APA:

Xu, M. nerpy: Named Entity Recognition Toolkit (Version 0.0.2) [Computer software]. https://github.com/shibing624/nerpy

BibTeX:

@software{Xu_nerpy_Text_to,
author = {Xu, Ming},
title = {{nerpy: Named Entity Recognition Toolkit}},
url = {https://github.com/shibing624/nerpy},
version = {0.0.2}
}

License

授权协议为 The Apache License 2.0，可免费用做商业用途。请在产品说明中附加nerpy的链接和授权协议。

Contribute

项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

在tests添加相应的单元测试
使用python -m pytest -v来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。

Reference

transformers

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Release history Release notifications | RSS feed

0.1.2

Jan 24, 2024

0.1.1

Sep 20, 2022

0.1.0

Sep 11, 2022

0.0.7

Sep 2, 2022

0.0.6

Jul 27, 2022

0.0.5

May 25, 2022

0.0.4

May 8, 2022

0.0.3

May 7, 2022

This version

0.0.2

May 7, 2022

0.0.1

May 6, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nerpy-0.0.2.tar.gz (37.6 kB view hashes)

Uploaded May 7, 2022 Source

Hashes for nerpy-0.0.2.tar.gz

Hashes for nerpy-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`35b970392a44d65b2a58b4a3325bbff9f022cac052e5117a1e54a5b4f55428f3`
MD5	`2619fc98c6f9eb2659a20ae4aeeccfdf`
BLAKE2b-256	`e6257082f7fea41c262818c42a807d9b99276d75839d1e25e2f843b27268b89f`