Punctuation Restoration for Khmer language
Project description
Punctuation Restoration for Khmer language
Built with [xashru/punctuation-restoration] using [xlm-roberta-khmer-small] and then exported to onnxruntime
Install
pip install khmerpunctuate
# Or
pip install git+https://github.com/seanghay/khmerpunctuate.git
Usage
Supported token types are
{
0: "",
1: " ",
2: "!",
3: "។",
4: "?",
5: "៖",
6: "។\n",
7: "B-NUMBER",
8: "I-NUMBER",
9: "B-QUOTE",
10: "I-QUOTE",
}
from khmernormalizer import normalize
from khmercut import tokenize
from khmerpunctuate import punctuate
text = normalize("អយ្យការអមសាលាដំបូងរាជធានីភ្នំពេញបានព្រមានថានឹងចេញដីកាបញ្ជាឲ្យបង្ខំនិងឲ្យឃុំខ្លួនតាមនីតិវិធីប្រសិនបើលោករ៉ុងឈុនដែលបច្ចុប្បន្នជាទីប្រឹក្សាគណបក្សកម្លាំងជាតិមិនបានបង់ប្រាក់ពិន័យចំនួន២លានរៀលឲ្យបានមុនថ្ងៃទី០៤ខែមីនាឆ្នាំ២០២៤ទេនោះ")
tokens = tokenize(text)
output_text = ""
for token, punct, punct_id in punctuate(tokens):
# exclude special tokens like I-NUMBER, B-NUMBER, I-QUOTE and B-QUOTE
if punct_id < 7:
output_text += token + punct
else:
output_text += token
print(output_text)
អយ្យការអមសាលាដំបូងរាជធានីភ្នំពេញ បានព្រមានថា នឹងចេញដីកាបញ្ជាឱ្យបង្ខំ និងឱ្យឃុំខ្លួនតាមនីតិវិធី ប្រសិនបើលោក រ៉ុង ឈុន ដែលបច្ចុប្បន្នជាទីប្រឹក្សាគណបក្សកម្លាំងជាតិ មិនបានបង់ប្រាក់ពិន័យចំនួន២លានរៀល ឱ្យបានមុនថ្ងៃទី០៤ខែមីនា ឆ្នាំ២០២៤ទេនោះ
Example
The example below is available on [Google Colab]
Model file is hosted on [HuggingFace]
Evaluation
XLM RoBERTa Khmer: (49M params)
| Precision | 0.95528402 | 0.79168481 | 0.85507246 | 0.74523436 | 0.7877551 | 0.79452055 | 0.62296801 | 0.96415685 | 0.98617407 | 0.67324778 | 0.57505285 | 0.8240493 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Recall | 0.96957471 | 0.73475191 | 0.13947991 | 0.86194329 | 0.69010727 | 0.63736264 | 0.08452508 | 0.96852034 | 0.99192858 | 0.22035541 | 0.21068939 | 0.77592102 |
| F1 score | 0.96237631 | 0.76215662 | 0.2398374 | 0.79935128 | 0.73570521 | 0.70731707 | 0.14885353 | 0.96633367 | 0.98904296 | 0.33203505 | 0.30839002 | 0.79926129 |
Accuracy: 0.930086988701306
XLM RoBERTa Base (279M params)
| Metric | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Precision | 0.96143204 | 0.82657744 | 0.88399072 | 0.79077633 | 0.82349285 | 0.85393258 | 0.55724225 | 0.96397178 | 0.98844483 | 0.72191436 | 0.67759563 | 0.8508466 |
| Recall | 0.97304725 | 0.77059714 | 0.45035461 | 0.90182234 | 0.78963051 | 0.83516484 | 0.18804696 | 0.97943409 | 0.99381541 | 0.46300485 | 0.43222308 | 0.81077656 |
| F1 score | 0.96720478 | 0.79760625 | 0.59671104 | 0.84265665 | 0.80620627 | 0.84444444 | 0.28120013 | 0.97164142 | 0.99112284 | 0.56417323 | 0.52778435 | 0.83032843 |
| Accuracy | 0.9399183767909306 |
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file khmerpunctuate-0.2.0.tar.gz.
File metadata
- Download URL: khmerpunctuate-0.2.0.tar.gz
- Upload date:
- Size: 269.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62a03b46a307eb4e649ecad5a1a6ca594e27141acb19a6f51828aab7f879fb8b
|
|
| MD5 |
603484a13d9ec070a170fb9dd4eee959
|
|
| BLAKE2b-256 |
fa20b84bd8f2d78e0e936868a9d7071d7dd029926895f19c3930fd991f3a5e47
|
File details
Details for the file khmerpunctuate-0.2.0-py3-none-any.whl.
File metadata
- Download URL: khmerpunctuate-0.2.0-py3-none-any.whl
- Upload date:
- Size: 273.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0540cfbd798b2ad5f0a5a5ad085f2668662427fb3456b7fbe238b83069139aae
|
|
| MD5 |
1711d04c31f1354eb3b94c4a3e666584
|
|
| BLAKE2b-256 |
ce2fb919419953c13b3392f47114554ab6dba71dede297d3bb599d01d8ce91d3
|