Skip to main content

Simbolo Multilingual Partial-syllable Tokenizer

Project description

multilingual-partial-syllable-tokenizer

We would like to introduce Multilingual Partial is Tokenization—a novel rule-based tokenization method that avoids breaking into complete syllables. Through experimentation, its utility has been uncovered in keyword detection, effectively minimizing False Positive errors and helping a lot in Burmese's rules-based+machine learning name recognition. Notably, this tokenization method is designed to align with the linguistic nuances of languages, but without requiring an exhaustive understanding of each specific language. Now now it is integreated with frequencey based approach to generate tokens.

Related Work

Numerous researchers have undertaken extensive investigations into syllable tokenization. This exposition aims to delineate various tokenization methodologies, with particular emphasis on selected examples. Dr. Ye Kyaw's sylbreak tokenizer, as detailed in the associated repository (https://github.com/ye-kyaw-thu/sylbreak), employs regular expressions [1] to accomplish syllable tokenization. The research article titled "sylbreak4all: Regular Expressions for Syllable Breaking of Nine Major Ethnic Languages of Myanmar" authored by Dr. Ye Kyaw Thaw and other researchers introduces a syllable tokenization approach applicable to nine languages, employing Regular Expression [2].

Furthermore, a syllable tokenizer designed for four languages, accessible at https://github.com/kaunghtetsan275/pyidaungsu, also utilizes a combination of Regular Expression and Rule-Based techniques [3]. Additionally, Maung, Zin, and Mikami, Yoshiki, in their publication "A Rule-based Syllable Tokenization of Myanmar Text" [4], present a rules-based approach for syllable tokenization.

Multilingual Partial-syllable Tokenization

Partial Syllable RE Pattern of Tokenizer: [Maybe Preceded By][Maybe Followed By]{0 or more repetition} Partial-syllable-level Tokenization for specified languages

  1. burmese, 2. paoh, 3. shan, 4. mon, 5. rakhine, 6. pali
  2. Sgaw-karen, 8. pwo-karen, 9. pa'o, 10. karenni (also known as Kayah or Red Karen), 11. kayan (also known as Padaung)
  3. devangari, 13. gurmukhi, 14. gujarati, 15. oriya, 16. tamil, 17. telugu, 18. kannada,
  4. malayalam, 20. sinhala, 21. thai, 22. lao, 23. tibetan, 24. khmer,25. aiton, 26. phake

Word-level Tokenization for English languages Character-level Tokenization for other languages

How to use

Bibtex

@article{SaPhyoThuHtet,
  title={multilingual-partial-syllable-tokenizer},
  author={Sa Phyo Thu Htet},
  journal={https://github.com/SaPhyoThuHtet/multilingual-partial-syllable-tokenizer},
  year={2019-2024}
}

Acknowledgment Statement from Sa Phyo Thu Htet

I would like to thank Dr. Ye Kyaw Thu, Dr. Hnin Aye Thant, Ma Aye Hninn Khine, ​and Ma Yi Yi Chan Myae Win Shein for their guidance, support, and suggestions. The skills acquired from Dr. Ye Kyaw Thu's NLP Class helped me a lot in order to develop new ideas in NLP Field and this repo. And a shoutout to the creators of Rabbit Converter and jrgraphix.net's Unicode Character Table. These tools were super helpful to develop nlp-concepts especially for Burmese Language. Thanks.

Acknowledgment

We would like to thank everyone who contributed in the field of NLP and Myanmar NLP. And would like to thank Simbolo Servicio which is a branch of Simbolo for the financial support.

References

References: [1] Ye Kyaw Thu, sylbreak, https://github.com/ye-kyaw-thu/sylbreak [2] Y. K. Thu et al., "sylbreak4all: Regular Expressions for Syllable Breaking of Nine Major Ethnic Languages of Myanmar," 2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), 2021, pp. 1-6, doi: 10.1109/iSAI-NLP54397.2021.9678188. [3] Kaung Htet San, Pyidaungsu, https://github.com/kaunghtetsan275/pyidaungsu [4] Maung, Zin & Mikami, Yoshiki. (2008). A Rule-based Syllable Segmentation of Myanmar Text. [5] Ye Kyaw Thu, NLP Class UTYCC, https://github.com/ye-kyaw-thu/NLP-Class [6] Unicode Character Table, https://jrgraphix.net/r/Unicode/1000-109F [7] Rabbit Converter, http://www.rabbit-converter.org/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simmpst-0.1.1.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

simmpst-0.1.1-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file simmpst-0.1.1.tar.gz.

File metadata

  • Download URL: simmpst-0.1.1.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for simmpst-0.1.1.tar.gz
Algorithm Hash digest
SHA256 acca39bd7845725ca2d55b41af74dcb7174697c06e98ffafcb228c487fa05fed
MD5 fc0ce5963f2cbae1a82873b7165ddafa
BLAKE2b-256 85c48c5c950a6e39d2f684937befdd8222cd9062111cf050c98066435b70a772

See more details on using hashes here.

File details

Details for the file simmpst-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: simmpst-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for simmpst-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 33a89e33c12afee9b38c769b83c9193de32f101697b756e79fe16a38f3243577
MD5 f706f3246db2b7a73cc4774a031ad426
BLAKE2b-256 88b943ff22503d525b9e152315b998069c07eb53364703f7baec0052bdda0f2b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page