Python utilities for Myanmar language processing
Project description
PyMyaNLP
Installation
pip install pymyanlp
Sentiment Analyzer
Agglutinative Nature
TODO
- Stop word detection and removal
- Manually create sentiment lexicon
- Write documentation in Burmese
Sentiment Lexicon
Rules
- No repeated words
- Words must root (aka unsegmentable)
- or words must be direct pairs of segmented roots (သတ်၊ ဖြတ်)
Burmese Phonology
The syllable structure of Burmese is C(G)V((V)C), which is to say the onset consists of a consonant optionally followed by a glide, and the rime consists of a monophthong alone, a monophthong with a consonant, or a diphthong with a consonant. The only consonants that can stand in the coda are /ʔ/ and /ɴ/. Some representative words are:
- CV မယ် /mɛ̀/ 'miss'
- CVC မက် /mɛʔ/ 'crave'
- CGV မြေ /mjè/ 'earth'
- CGVC မျက် /mjɛʔ/ 'eye'
- CVVC မောင် /màʊɰ̃/ (term of address for young men)
- CGVVC မြောင်း /mjáʊɰ̃/ 'ditch'
A minor syllable has some restrictions:
- It contains /ə/ as its only vowel
- It must be an open syllable (no coda consonant)
- It cannot bear tone
- It has only a simple (C) onset (no glide after the consonant)
- It must not be the final syllable of the word
Some examples of words containing minor syllables:
- ခလုတ် /kʰə.loʊʔ/ 'knob/switch'
- ပလွေ /pə.lwè/ 'flute'
- သရော် /θə.jɔ̀/ 'mock'
- ကလက် /kə.lɛʔ/ 'be wanton/be frivolous'
- ထမင်းရည် /tʰə.mə.jè/ '(cooked)rice-water'
Preprocessing
I have cloned the dependency libraries into the ./lib folder for ease
of access and crawling. In the future, we should just take the files
needed and organize better.
Tokenization / Word Segmentation
Conditional Random Fields
Part of Speech Tagging
myWord by YeThK is used for POS speech tagging, it provides us the annotated corpus and lexicon.
In the future we should train a spaCy pipeline using myPOS v3 data but for now we will use an available RDRPOSTagger.
POS tagger fails to identify ရန်ဖြစ်/v properly in most cases.
Some words may have completely different forms in the two systems, and others will vary in terms of pronunciation, tone, vowel length, etc.
J Watkins defines the follow different types of Burmese:
- OB Old Burmese: the language of the 11th-13th century inscriptions
- WB written Burmese - the orthographical form of the modem language
- CB colloquial Burmese
- MB modem Burmese = colloquial Burmese
- FB formal Burmese
Spelling Checker
Stopword Removal
Use Cases
Summary Keyword Extraction
Model: Modified TF-IDF Keyword Ranking
- Tokenize
- Tag POS
- Extract Verbs, Adjectives, Nouns and Adverbs
- Generate TF-IDF score on the widespread corpus
- Penalize scores
Sentiment Analysis
Sentiment Lexicon
Building up the sentiment lexicon is pretty much a guess work.
Sentiment Word Extraction
Due to the nature of Burmese, non reducing and reducing compound words can be ambiguous in their word separation. This case should be considered.
- Noun-verb: အကျိုး/n ပျက်စီး/v
- Verb-verb: ခိုး/v ယူ/v
This could be avoided with a sufficiently powerful POS tagger so that we are not just looking at the word, we are looking at the part of speech as well.
E.g. ညာမပြောနဲ့ကွာ။ Both ညာ/v (lie) ညာ/n (right) exists.
It might be very useful to have an algorithm that transforms a sentiment word of a certain form, let's say colloquial form, to literary form, where at the simplest level of modifications is removing the consonant pair, လိမ်ညာ => ညာ and use the sentiment lexicon of the same form to match.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymyanlp-0.1.0.tar.gz.
File metadata
- Download URL: pymyanlp-0.1.0.tar.gz
- Upload date:
- Size: 18.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c8dbbbd877497165dd78143e54548137f5faa4e0bf1be22c21b97fdc25e190a
|
|
| MD5 |
fe1712889e0111e9c0a87bf844838295
|
|
| BLAKE2b-256 |
e9efbdd01fcd85fd3091a9766926f66205f67f4d52b7fb023c5ef6a773d3b320
|
File details
Details for the file pymyanlp-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pymyanlp-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0dfcc91fe1418919a7b8cc993153168119122f31c8a455316e8c6d6ba533eafb
|
|
| MD5 |
c447fe104fda55c02d5644befeff8821
|
|
| BLAKE2b-256 |
621ca82ff5db362dfcd0bbbb27ea0ab74526e4d9339fb169a80e9ca25d8696d8
|