Skip to main content

Punctuation Restoration for Khmer language

Project description

Punctuation Restoration for Khmer language

Built with [xashru/punctuation-restoration] using [xlm-roberta-khmer-small] and then exported to onnxruntime

Install

pip install khmerpunctuate

# Or
pip install git+https://github.com/seanghay/khmerpunctuate.git

Usage

Supported token types are

{
  0: "",
  1: " ",
  2: "!",
  3: "។",
  4: "?",
  5: "៖",
  6: "។\n",
  7: "B-NUMBER",
  8: "I-NUMBER",
  9: "B-QUOTE",
  10: "I-QUOTE",
}
from khmernormalizer import normalize
from khmercut import tokenize
from khmerpunctuate import punctuate

text = normalize("អយ្យការអមសាលាដំបូងរាជធានីភ្នំពេញបានព្រមានថានឹងចេញដីកាបញ្ជាឲ្យបង្ខំនិងឲ្យឃុំខ្លួនតាមនីតិវិធីប្រសិនបើលោករ៉ុងឈុនដែលបច្ចុប្បន្នជាទីប្រឹក្សាគណបក្សកម្លាំងជាតិមិនបានបង់ប្រាក់ពិន័យចំនួន២លានរៀលឲ្យបានមុនថ្ងៃទី០៤ខែមីនាឆ្នាំ២០២៤ទេនោះ")
tokens = tokenize(text)

output_text = ""
for token, punct, punct_id in punctuate(tokens):
  # exclude special tokens like I-NUMBER, B-NUMBER, I-QUOTE and B-QUOTE
  if punct_id < 7:
    output_text += token + punct
  else:
    output_text += token

print(output_text)
អយ្យការអមសាលាដំបូងរាជធានីភ្នំពេញ បានព្រមានថា នឹងចេញដីកាបញ្ជាឱ្យបង្ខំ និងឱ្យឃុំខ្លួនតាមនីតិវិធី ប្រសិនបើលោក រ៉ុង ឈុន ដែលបច្ចុប្បន្នជាទីប្រឹក្សាគណបក្សកម្លាំងជាតិ មិនបានបង់ប្រាក់ពិន័យចំនួន២លានរៀល ឱ្យបានមុនថ្ងៃទី០៤ខែមីនា ឆ្នាំ២០២៤ទេនោះ 

Example

The example below is available on [Google Colab]

Model file is hosted on [HuggingFace]

Evaluation

XLM RoBERTa Khmer: (49M params)

Precision 0.95528402 0.79168481 0.85507246 0.74523436 0.7877551 0.79452055 0.62296801 0.96415685 0.98617407 0.67324778 0.57505285 0.8240493
Recall 0.96957471 0.73475191 0.13947991 0.86194329 0.69010727 0.63736264 0.08452508 0.96852034 0.99192858 0.22035541 0.21068939 0.77592102
F1 score 0.96237631 0.76215662 0.2398374 0.79935128 0.73570521 0.70731707 0.14885353 0.96633367 0.98904296 0.33203505 0.30839002 0.79926129

Accuracy: 0.930086988701306


XLM RoBERTa Base (279M params)

Metric 1 2 3 4 5 6 7 8 9 10 11 12
Precision 0.96143204 0.82657744 0.88399072 0.79077633 0.82349285 0.85393258 0.55724225 0.96397178 0.98844483 0.72191436 0.67759563 0.8508466
Recall 0.97304725 0.77059714 0.45035461 0.90182234 0.78963051 0.83516484 0.18804696 0.97943409 0.99381541 0.46300485 0.43222308 0.81077656
F1 score 0.96720478 0.79760625 0.59671104 0.84265665 0.80620627 0.84444444 0.28120013 0.97164142 0.99112284 0.56417323 0.52778435 0.83032843
Accuracy 0.9399183767909306

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmerpunctuate-0.2.0.tar.gz (269.9 kB view details)

Uploaded Source

Built Distribution

khmerpunctuate-0.2.0-py3-none-any.whl (273.0 kB view details)

Uploaded Python 3

File details

Details for the file khmerpunctuate-0.2.0.tar.gz.

File metadata

  • Download URL: khmerpunctuate-0.2.0.tar.gz
  • Upload date:
  • Size: 269.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.19

File hashes

Hashes for khmerpunctuate-0.2.0.tar.gz
Algorithm Hash digest
SHA256 62a03b46a307eb4e649ecad5a1a6ca594e27141acb19a6f51828aab7f879fb8b
MD5 603484a13d9ec070a170fb9dd4eee959
BLAKE2b-256 fa20b84bd8f2d78e0e936868a9d7071d7dd029926895f19c3930fd991f3a5e47

See more details on using hashes here.

File details

Details for the file khmerpunctuate-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for khmerpunctuate-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0540cfbd798b2ad5f0a5a5ad085f2668662427fb3456b7fbe238b83069139aae
MD5 1711d04c31f1354eb3b94c4a3e666584
BLAKE2b-256 ce2fb919419953c13b3392f47114554ab6dba71dede297d3bb599d01d8ce91d3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page