Nested Named Entity Recognition for Chinese Biomedical Text
Project description
CBio-NAMER
CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understanding Evaluation), a benchmark of Nested Named Entity Recognition. We got the 2nd price of the benchmark by 2021/12/07. Single model CBioNAMER also achieves top20 in CBLUE. The score of CBioNAMER has surpassed human(67.0 in $F_1$).
Result
Results of our method:
Results of our single model CBioNAMER:
Approach
CBioNAMER is a sub-model in our result, which is based on GlobalPointer (a powerful open-source model, thanks for author, we rewrite it with Pytorch).
Usage
First, install PyTorch>=1.7.0. There's no restriction on GPU or CUDA.
Then, install this repo as a Python package:
$ pip install CBioNAMER
Python package transformers==4.6.1
would be automatically installed as well.
API
The CBioNAMER
package provides the following methods:
CBioNAMER.load_NER(model_save_path='./checkpoint/macbert-large_dict.pth', maxlen=512, c_size=9, id2c=_id2c, c2c=_c2c)
Returns the pretrained model. It will download the model as necessary. The model would use the first CUDA device if there's any, otherwise using CPU instead.
The model_save_path
argument specifies the path of the pretrained model weight.
The maxlen
argument specifies the max length of input sentences. The sentences longer than maxlen
would be cut off.
The c_size
argument specifies the number of entity class. Here is 9
for CBLUE.
The id2c
argument specifies the mapping between id and entity class. By default, the id2c
argument for CBLUE is:
_id2c = {0: 'dis', 1: 'sym', 2: 'pro', 3: 'equ', 4: 'dru', 5: 'ite', 6: 'bod', 7: 'dep', 8: 'mic'}
The c2c
argument specifies the mapping between entity class and its Chinese meaning. By default, the c2c
argument for CBLUE is:
_c2c = {'dis': "疾病", 'sym': "临床表现", 'pro': "医疗程序", 'equ': "医疗设备", 'dru': "药物", 'ite': "医学检验项目", 'bod': "身体", 'dep': "科室", 'mic': "微生物类"}
The model returned by CBioNAMER.load_NER()
supports the following methods:
model.recognize(text: str, threshold=0)
Given a sentence, returns a list of dictionaries with recognized entity, the format of the dictionary is {'start_idx': entity's starting index, 'end_idx': entity's ending index, 'type': entity class, 'Chinese_type': Chinese meaning of entity class, 'entity': recognized entity}
. The threshold
argument specifies that the returned list only contains the recognized entity with confidence score higher than threshold
.
model.predict_to_file(in_file: str, out_file: str)
Given input and output .json
file path, the model would do inference according in_file
, and the recognized entity would be saved in out_file
. The output file can be submitted to CBLUE. The format of input file is like:
[
{
"text": "该技术的应用使某些遗传病的诊治水平得到显著提高。"
},
...
{
"text": "There is a sentence."
}
]
Examples
import CBioNAMER
NER = CBioNAMER.load_NER()
in_file = './CMeEE_test.json'
out_file = './CMeEE_test_answer.json'
NER.predict_to_file(in_file, out_file)
import CBioNAMER
NER = CBioNAMER.load_NER()
text = "该技术的应用使某些遗传病的诊治水平得到显著提高。"
recognized_entity = NER.recognize(text)
print(recognized_entity)
# output:[{'start_idx': 9, 'end_idx': 11, 'type': 'dis', 'Chinese_type': '疾病', 'entity': '遗传病'}]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file CBioNAMER-1.0.tar.gz
.
File metadata
- Download URL: CBioNAMER-1.0.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d605d693e1b0e7b0dece7befa16a5dba63069642bd7f458f122873bb86df6ff |
|
MD5 | f62e0cad130e890be286c49e2d0cb3e2 |
|
BLAKE2b-256 | 96242599091b9d5e885ba41db23a9bc7508411333c1312b772fe08bc03f56c2e |