Skip to main content

Comprehensive tokenization library for Myanmar language

Project description

myTokenize

myTokenize is a Python library that tokenizes Myanmar text into syllables, words, phrases, and sentences. It supports multiple tokenization techniques using rule-based, statistical, and neural network-based approaches.

Features

  • Syllable Tokenization: Break text into syllables using regex rules.
  • BPE and Unigram Tokenization: Leverage SentencePiece models for tokenization.
  • Word Tokenization: Segment text into words using:
    • myWord: Dictionary-based tokenization.
    • CRF: Conditional Random Fields-based tokenization.
    • BiLSTM: Neural network-based tokenization.
  • Phrase Tokenization: Identify phrases in text using normalized pointwise mutual information (NPMI).
  • Sentence Tokenization: Use a BiLSTM model to segment text into sentences.

Installation

  1. Clone the repository:

    git clone https://github.com/ThuraAung1601/myTokenize.git
    cd myTokenize
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Install the library:

    pip install .
    

Usage

Syllable Tokenizer

from myTokenize import SyllableTokenizer

tokenizer = SyllableTokenizer()
syllables = tokenizer.tokenize("မြန်မာနိုင်ငံ။")
print(syllables)  # ['မြန်', 'မာ', 'နိုင်', 'ငံ', '။']

BPE Tokenizer

from myTokenize import BPETokenizer

tokenizer = BPETokenizer()
tokens = tokenizer.tokenize("ရွေးကောက်ပွဲမှာနိုင်ထားတဲ့ဒေါ်နယ်ထရမ့်")
print(tokens)  # ['▁ရွေးကောက်ပွဲ', 'မှာ', 'နိုင်', 'ထား', 'တဲ့', 'ဒေါ်', 'နယ်', 'ထ', 'ရ', 'မ့်']

Word Tokenizer

from myTokenize import WordTokenizer

tokenizer = WordTokenizer(engine="CRF")  # Use "myWord", "CRF", or "LSTM"
words = tokenizer.tokenize("မြန်မာနိုင်ငံ။")
print(words)  # ['မြန်မာ', 'နိုင်ငံ', '။']

Phrase Tokenizer

from myTokenize import PhraseTokenizer

tokenizer = PhraseTokenizer()
phrases = tokenizer.tokenize("ညာဘက်ကိုယူပြီးတော့တည့်တည့်သွားပါ")
print(phrases)  # ['ညာဘက်_ကို', 'ယူ', 'ပြီး_တော့', 'တည့်တည့်', 'သွား_ပါ']

Sentence Tokenizer

from myTokenize import SentenceTokenizer

tokenizer = SentenceTokenizer()
sentences = tokenizer.tokenize("ညာဘက်ကိုယူပြီးတော့တည့်တည့်သွားပါခင်ဗျားငါးမိနစ်လောက်ကြာလိမ့်မယ်")
print(sentences)  # [['ညာ', 'ဘက်', 'ကို', 'ယူ', 'ပြီး', 'တော့', 'တည့်တည့်', 'သွား', 'ပါ'], ['ခင်ဗျား', 'ငါး', 'မိနစ်', 'လောက်', 'ကြာ', 'လိမ့်', 'မယ်']]

Folder Structure

./myTokenize/
├── CRFTokenizer
│   └── wordseg_c2_crf.crfsuite
├── SentencePiece
│   ├── bpe_sentencepiece_model.model
│   ├── bpe_sentencepiece_model.vocab
│   ├── unigram_sentencepiece_model.model
│   └── unigram_sentencepiece_model.vocab
├── Tokenizer.py
└── myWord
    ├── phrase_segment.py
    └── word_segment.py

Dependencies

  • Python 3.7+
  • TensorFlow
  • SentencePiece
  • pycrfsuite
  • Numpy

License

This project is licensed under the MIT License. See the LICENSE file for details.

Authors

  • Ye Kyaw Thu
  • Thura Aung

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mytokenize-0.1.1.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

myTokenize-0.1.1-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file mytokenize-0.1.1.tar.gz.

File metadata

  • Download URL: mytokenize-0.1.1.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.20

File hashes

Hashes for mytokenize-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c27162a55518d74c65e002eca082fa5bed4ed992e3c118af1150e25fbda3df02
MD5 2043b0534acfdd2842ead2a6023a871e
BLAKE2b-256 de9ddc5aaf179b96b08929885c3f812054b27b950267416c1bd1efda5227efcd

See more details on using hashes here.

File details

Details for the file myTokenize-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: myTokenize-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.20

File hashes

Hashes for myTokenize-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d3a3dc867cb8c50ee2e13177d68f5c05431179d5d2f4ac836069ceb41f150fe0
MD5 43439346baa8c990acc69e4b5e0aa23b
BLAKE2b-256 99e006849746492e2a712a4a3720684d731ed80595d9ca575c7ba98582f58b60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page