Skip to main content

HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation

Project description

HoogBERTa

This repository includes the Thai pretrained language representation (HoogBERTa_base) and the fine-tuned model for multitask sequence labeling.

Installation

$ python setup.py install

To download model, use

>>> import hoogberta
>>> hoogberta.download() # or hoogberta.download('/home/user/.hoogberta/')

Usage

see test.py

Documentation

To annotate POS, NE and cluase boundary, use the following commands

from hoogberta.multitagger import HoogBERTaMuliTaskTagger
tagger = HoogBERTaMuliTaskTagger(cuda=False) # or cuda=True
output = tagger.nlp("วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ")

Please give the "base path" parameter if you have changed the "models" directory to a different location than the current one, for example.

tagger = HoogBERTaMuliTaskTagger(cuda=False,base_path="/home/user/.hoogberta/" ) 

The output is a list of annotations (token, POS, NE, MARK). "MARK" is annotation for a single white space that can be PUNC (not clause boundary) or MARK (clause boundary). Note that, for clause boundary classification, the current pretrained model works well with inputs containing two clauses. If you want a more precise result, we recommend running tagger.nlp iteratively.

To extract token features, based on the RoBERTa architecture, use the following commands

from hoogberta.encoder import HoogBERTaEncoder
encoder = HoogBERTaEncoder(cuda=False) # or cuda=True
token_ids, features = encoder.extract_features("วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ")

For batch processing,

inputText = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
token_ids, features = encoder.extract_features_batch(inputText)

To use HoogBERTa as an embedding layer, use

tokens, features = encoder.extract_features_from_tensor(token_ids) # where token_ids is a tensor with type "long".

Citation

Please cite as:

@inproceedings{porkaew2021hoogberta,
  title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
  author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
  booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
  year = {2021},
  address={Online}
}

Download full-text PDF

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hoogberta-0.1.1.tar.gz (4.8 kB view details)

Uploaded Source

File details

Details for the file hoogberta-0.1.1.tar.gz.

File metadata

  • Download URL: hoogberta-0.1.1.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for hoogberta-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8d04ade710a8440d360f79bf13e8a38fff47eda64b626975e3041269775e74fb
MD5 061ce44646907549596b29030edc7f19
BLAKE2b-256 da27edcca992ce5d6224dd3b37879b20de2151491dc1396a37371ad096e19739

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page