Thai Nested Named Entity Recognition
Project description
Thai-NNER (Thai Nested Named Entity Recognition Corpus)
Code associated with the paper Thai Nested Named Entity Recognition Corpus at ACL 2022.
Abstract / Motivation
This work presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes.
How to use?
Install
pip install thai_nner
Usage
You needs to download model from "data/[checkpoints]": Download
Example: 0906_214036/checkpoint.pth
and use convert_model2use.py
script by
python convert_model2use.py -i 0906_214036/checkpoint.pth -o model.pth
Usage Example
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0" # for non-gpu: os.environ['CUDA_VISIBLE_DEVICES'] = ""
from thai_nner import NNER
nner = NNER("model.pth")
nner.get_tag("วันนี้วันที่ 5 เมษายน 2565 เป็นวันที่อากาศดีมาก")
# output: (['<s>', 'วันนี้', 'วันที่', '', '', '5', '', '', 'เมษายน', '', '', '25', '65', '', '', 'เป็น', 'วันที่', '', 'อากาศ', '', 'ดีมาก', '</s>'], [{'text': ['วันนี้'], 'span': [1, 2], 'entity_type': 'rel'}, {'text': ['วันที่', '', '', '5'], 'span': [2, 6], 'entity_type': 'day'}, {'text': ['วันที่', '', '', '5', '', '', 'เมษายน', '', '', '25', '65'], 'span': [2, 13], 'entity_type': 'date'}, {'text': ['', '5'], 'span': [4, 6], 'entity_type': 'cardinal'}, {'text': ['', 'เมษายน'], 'span': [7, 9], 'entity_type': 'month'}, {'text': ['', '25', '65'], 'span': [10, 13], 'entity_type': 'year'}])
Example
Python library
Test
Dataset and Models
Model's Checkpoint
Download and save models' checkpoints at the following path "data/[checkpoints]": Download
Dataset
Download and save the dataset at the following path "data/[scb-nner-th-2022]": Download
Pre-trained Language Model
Download and save the pre-trained language model at the following path "data/[lm]": Download
Training/Testing
Train
python train.py --device 0,1 -c config.json
Test
python test_nne.py --resume [PATH]/checkpoint.pth
Tensorboard
tensorboard --logdir [PATH]/save/log/
Results
Citation
@inproceedings{Buaphet-etal-2022-thai-nner,
title = "Thai Nested Named Entity Recognition Corpus",
author = "Buaphet, Weerayut and
Udomcharoenchaikit, Can and
Limkonchotiwat, Peerat and
Rutherford, Attapol and
Nutanong, Sarana",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022"
year = "2022",
publisher = "Association for Computational Linguistics",
}
License
CC-BY-SA 3.0
Acknowledgements
- Dataset information: The Thai N-NER corpus is supported in part by the Digital Economy Promotion Agency (depa) Digital Infrastructure Fund MP-62-003 and Siam Commercial Bank. This dataset is released as scb-nner-th-2022.
- Training code: Tensorflow-Project-Template by Mahmoud Gemy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file thai_nner-0.3-py3-none-any.whl
.
File metadata
- Download URL: thai_nner-0.3-py3-none-any.whl
- Upload date:
- Size: 2.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
167b0c8f0afb09c0d0e5251d5738d51ce1643eff81ad4a2785b11a8483ea2abd
|
|
MD5 |
6df54d710f027ba409fd297f9c365899
|
|
BLAKE2b-256 |
8a62122876ed2c21fb736266ec6d9f89820abee80b778d4e4dc4e676c294ec91
|