Skip to main content

Thai Nested Named Entity Recognition

Project description

Thai-NNER (Thai Nested Named Entity Recognition Corpus)

Code associated with the paper Thai Nested Named Entity Recognition Corpus at ACL 2022.

Abstract / Motivation

This work presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes.

How to use?

Install

pip install thai_nner

Usage

You needs to download model from "data/[checkpoints]": Download

Example: 0906_214036/checkpoint.pth

and use convert_model2use.py script by

python convert_model2use.py -i 0906_214036/checkpoint.pth -o model.pth

Usage Example

import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0" # for non-gpu: os.environ['CUDA_VISIBLE_DEVICES'] = ""
from thai_nner import NNER
nner = NNER("model.pth")
nner.get_tag("วันนี้วันที่ 5 เมษายน 2565 เป็นวันที่อากาศดีมาก")
# output: (['<s>', 'วันนี้', 'วันที่', '', '', '5', '', '', 'เมษายน', '', '', '25', '65', '', '', 'เป็น', 'วันที่', '', 'อากาศ', '', 'ดีมาก', '</s>'], [{'text': ['วันนี้'], 'span': [1, 2], 'entity_type': 'rel'}, {'text': ['วันที่', '', '', '5'], 'span': [2, 6], 'entity_type': 'day'}, {'text': ['วันที่', '', '', '5', '', '', 'เมษายน', '', '', '25', '65'], 'span': [2, 13], 'entity_type': 'date'}, {'text': ['', '5'], 'span': [4, 6], 'entity_type': 'cardinal'}, {'text': ['', 'เมษายน'], 'span': [7, 9], 'entity_type': 'month'}, {'text': ['', '25', '65'], 'span': [10, 13], 'entity_type': 'year'}])

Example

Python library

Colabs

Test

Colabs

Dataset and Models

Model's Checkpoint

Download and save models' checkpoints at the following path "data/[checkpoints]": Download

Dataset

Download and save the dataset at the following path "data/[scb-nner-th-2022]": Download

Pre-trained Language Model

Download and save the pre-trained language model at the following path "data/[lm]": Download

Training/Testing

Train

python train.py --device 0,1 -c config.json

Test

python test_nne.py --resume [PATH]/checkpoint.pth

Tensorboard

tensorboard --logdir [PATH]/save/log/

Results

Experimental results

Citation

@inproceedings{Buaphet-etal-2022-thai-nner,
    title = "Thai Nested Named Entity Recognition Corpus",
    author = "Buaphet, Weerayut  and
      Udomcharoenchaikit, Can  and
      Limkonchotiwat, Peerat and
      Rutherford, Attapol  and 
      Nutanong, Sarana",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022"
    year = "2022",
    publisher = "Association for Computational Linguistics",
}

License

CC-BY-SA 3.0

Acknowledgements

  • Dataset information: The Thai N-NER corpus is supported in part by the Digital Economy Promotion Agency (depa) Digital Infrastructure Fund MP-62-003 and Siam Commercial Bank. This dataset is released as scb-nner-th-2022.
  • Training code: Tensorflow-Project-Template by Mahmoud Gemy

Project details


Release history Release notifications | RSS feed

This version

0.3

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

thai_nner-0.3-py3-none-any.whl (2.2 MB view details)

Uploaded Python 3

File details

Details for the file thai_nner-0.3-py3-none-any.whl.

File metadata

  • Download URL: thai_nner-0.3-py3-none-any.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.5

File hashes

Hashes for thai_nner-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 167b0c8f0afb09c0d0e5251d5738d51ce1643eff81ad4a2785b11a8483ea2abd
MD5 6df54d710f027ba409fd297f9c365899
BLAKE2b-256 8a62122876ed2c21fb736266ec6d9f89820abee80b778d4e4dc4e676c294ec91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page