Skip to main content

Python library for Pyidaungsu Myanmar languages

Project description

Pyidaungsu

Python library for Myanmar language. Useful in Natural Language Processing and text preprocessing for Myanmar language.

Installation

pip install pyidaungsu

Usage

Zawgyi-Unicode detection Language detection (Myanmar <Zawgyi, Unicode>, Karen, Mon, Shan)

Starting from the pyidaungsu 0.0.9, it does not only detect Zawgyi and Unicode for Myanmar language but also other languages such as Mon, Karen, Shan as well.

import pyidaungsu as pds

# language detection
pds.detect("ထမင်းစားပြီးပြီလား")
>> "mm_uni"
pds.detect("ထမင္းစားၿပီးၿပီလား")
>> "mm_zg"
pds.detect("တၢ်သိၣ်လိတၢ်ဖးလံာ် ကွဲးလံာ်အိၣ်လၢ မ့ရ့ၣ်အစုပူၤလီၤ.")
>> "karen"
pds.detect("ဇၟာပ်မၞိဟ်ဂှ် ကတဵုဒှ်ကၠုင် ပ္ဍဲကဵုဂကောံမွဲ ဖအိုတ်ရ၊၊")
>> "mon"
pds.detect("ၼႂ်းဢိူင်ႇမိူင်းၽူင်း ၸႄႈဝဵင်းတႃႈၶီႈလဵၵ်း ၾႆးမႆႈႁိူၼ်း ၵူၼ်းဝၢၼ်ႈ လင်ၼိုင်ႈ")
>> "shan"

Zawgyi-Unicode conversion

# convert to zawgyi
pds.cvt2zgi("ထမင်းစားပြီးပြီလား")
>> "ထမင္းစားၿပီးၿပီလား"

# convert to unicode
pds.cvt2uni("ထမင္းစားၿပီးၿပီလား")
>> "ထမင်းစားပြီးပြီလား"

Tokenization

# syllable level tokenization for Burmese
pds.tokenize("Alan TuringကိုArtificial Intelligenceနဲ့Computerတွေရဲ့ဖခင်ဆိုပြီးလူသိများပါတယ်") # lang parameter for default function is 'mm'
>> ['Alan', 'Turing', 'ကို', 'Artificial', 'Intelligence', 'နဲ့', 'Computer', 'တွေ', 'ရဲ့', 'ဖ', 'ခင်', 'ဆို', 'ပြီး', 'လူ', 'သိ', 'များ', 'ပါ', 'တယ်']

# syllable level tokenization for Karen
pds.tokenize("သရၣ်,သရၣ်မုၣ် ခဲလၢာ်ဟးထီၣ် (၃၅) ဂၤန့ၣ်လီၤ.", lang="karen")
>> ['ကၠိ', 'သ', 'ရၣ်', ',', 'သ', 'ရၣ်', 'မုၣ်', 'ခဲ', 'လၢာ်', 'ဟး', 'ထီၣ်', '(', '၃၅', ')', 'ဂၤ', 'န့ၣ်', 'လီၤ', '.']

# word level tokenization
pds.tokenize("ဖေဖေနဲ့မေမေ၏ကျေးဇူးတရားမှာကြီးမားလှပေသည်", form="word")
>> ['ဖေဖေ', 'နဲ့', 'မေမေ', '၏', 'ကျေးဇူးတရား', 'မှာ', 'ကြီးမား', 'လှ', 'ပေ', 'သည်']

Syllable-level tokenization supports for 4 languages (Burmese, Karen, Shan, Mon). Word-level tokenization supports only Burmese currently.
Available values for lang parameter in tokenize function: "mm", "karen", "mon", "shan"

Future work

  • Add tokenizer for Burmese (Syllabel and word-level tokenization)
  • Add more tokenizer (BPE, WordPiece etc.)
  • Add Part-of-Speech (POS) tagger for Burmese
  • Add Named-entities Recognition (NER) classifier for Burmese
  • Add thorough documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyidaungsu-0.1.4.tar.gz (5.5 MB view details)

Uploaded Source

Built Distribution

pyidaungsu-0.1.4-py3-none-any.whl (5.5 MB view details)

Uploaded Python 3

File details

Details for the file pyidaungsu-0.1.4.tar.gz.

File metadata

  • Download URL: pyidaungsu-0.1.4.tar.gz
  • Upload date:
  • Size: 5.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.10

File hashes

Hashes for pyidaungsu-0.1.4.tar.gz
Algorithm Hash digest
SHA256 15b91d0cbfee85c30aa71fda02b968c6a83fff3558fb38ea9c5c31ce9e3d0c7d
MD5 8c28af42c1a828407d9c0eb255b5ea9c
BLAKE2b-256 f71db720f0f4e84e923751b13a7a42657eb800be5bc5ef830be6812c7af47a80

See more details on using hashes here.

File details

Details for the file pyidaungsu-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: pyidaungsu-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 5.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.10

File hashes

Hashes for pyidaungsu-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f9f07912ecf33bfadc5a8b3265e0a6757221047476745cdd4a7f66db03aeef9a
MD5 f80637cb72af895d40e5b78b7388d64e
BLAKE2b-256 cda9596a86adb1d388f0748f5a982daacbe2430e89a2caf7fccfa1ac767011f0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page