Burmese NLP Tools
Project description
Burmese-Tools
burmese-tools is a Python library that mainly designed for transformation between unicode and zawgyi burmese lanuage script. Moreover, it also support syllable tokenization for burmese unicode and parital syllable tokenization for burmese zawgyi.
Features
- Zawgyi to Unicode Conversion: Convert Zawgyi-encoded text to Unicode.
- Unicode to Zawgyi Conversion: Transform Unicode text back to Zawgyi.
- Unicode Syllable Tokenization: Tokenize Burmese Unicode text into syllables with customizable splitting.
- Partial Zawgyi Syllable Tokenization: Tokenize Burmese Zawgyi text into syllables.
Installation
You can install this library from PyPI:
pip install burmese-tools
Usage
Importing the Library
from burmese_tools import tools
Convert Zawgyi to Unicode
text_zawgyi = "ကႏၲာရ"
converted_text = tools.zaw2uni(text_zawgyi)
print(converted_text) # Output: ကန္တာရ
Convert Unicode to Zawgyi
text_unicode = "ကန္တာရ"
converted_text = tools.uni2zaw(text_unicode)
print(converted_text) # Output: ကႏၲာ႐
Tokenize Unicode Text into Syllables
The uni_syllable function is a utility to tokenize Unicode text into syllable tokens, allowing for flexible splitting methods.
Features
- Tokenizes Unicode text into syllables.
- Provides two types of splitting:
- Type 1: Splits
ဂန္ဓာရinto['ဂ', 'န္ဓာ', 'ရ']. (default) - Type 2: Splits
ဂန္ဓာရinto['ဂန္', 'ဓာ', 'ရ'].
- Type 1: Splits
- Supports an optional transform to replace
္with်(applies only when type=2) d- default = True
text = "ကန္တာရ"
tokens = tools.uni_syllable(text, type=1)
print(tokens) # Output: ['က', 'န္တာ', 'ရ']
text = "ကန္တာရ"
tokens = tools.uni_syllable(text, type=2)
print(tokens) # Output: ['ကန်', 'တာ', 'ရ']
text = "ကန္တာရ"
tokens = tools.uni_syllable(text, type=2, transform=False)
print(tokens) # Output: ['ကန္', 'တာ', 'ရ']
Tokenize Zawgyi Text into Partial Syllables
text = "ကႏၲာရ"
tokens = tools.zaw_partial_syllable(text)
print(tokens) # Output: ['က', 'ႏၲာ', 'ရ'] in unicode ['က', 'န္တာ', 'ရ']
Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch for your feature/bug fix.
- Make your changes and test thoroughly.
- Submit a pull request.
License
This library is licensed under the MIT License. Feel free to use, modify, and distribute it.
Acknowledgments
This library was developed to simplify Burmese text processing for developers and linguists. Special thanks to Sa Phyo Thu Thet, from Simbolo for his invaluable guidance, kindness, and support in teaching me. His mentorship has been instrumental in shaping my understanding and skills.
Contributions and feedback from the community are also highly appreciated.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file burmese_tools-0.1.2.tar.gz.
File metadata
- Download URL: burmese_tools-0.1.2.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8b5055d01aae7e4db226d285f86cfa8a576f33fb2681ef28f5b6c1a77dbd636
|
|
| MD5 |
30a10162df0d1fc94b2a9c997d7e8bb9
|
|
| BLAKE2b-256 |
a867cfc3257d5530bf5c2d3186bb34e43536e6cdcffc38ed79f37f760de06cfa
|
File details
Details for the file burmese_tools-0.1.2-py3-none-any.whl.
File metadata
- Download URL: burmese_tools-0.1.2-py3-none-any.whl
- Upload date:
- Size: 5.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
158d031e2ffeb777c954352cc42cc04bbb4955baf649e8a3e47a6cc10021e436
|
|
| MD5 |
c1722f9a351fdb09fe13263792b0af8c
|
|
| BLAKE2b-256 |
96ca2da5d0f78ec19c6f8c49d900ee77f7e84fcb504724bc8fb3c527a59d7151
|