Skip to main content

Comprehensive tokenization library for Myanmar language

Project description

myTokenizer

myTokenizer is a Python library that tokenizes Myanmar text into syllables, words, phrases, and sentences. It supports multiple tokenization techniques using rule-based, statistical, and neural network-based approaches.

Features

  • Syllable Tokenization: Break text into syllables using regex rules.
  • BPE and Unigram Tokenization: Leverage SentencePiece models for tokenization.
  • Word Tokenization: Segment text into words using:
    • myWord: Dictionary-based tokenization.
    • CRF: Conditional Random Fields-based tokenization.
    • BiLSTM: Neural network-based tokenization.
  • Phrase Tokenization: Identify phrases in text using normalized pointwise mutual information (NPMI).
  • Sentence Tokenization: Use a BiLSTM model to segment text into sentences.

Installation

  1. Clone the repository:

    git clone https://github.com/ThuraAung1601/myTokenizer.git
    cd myTokenizer
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Install the library:

    pip install .
    

Usage

Syllable Tokenizer

from myTokenizer import SyllableTokenizer

tokenizer = SyllableTokenizer()
syllables = tokenizer.tokenize("မြန်မာနိုင်ငံ။")
print(syllables)  # ['မြန်', 'မာ', 'နိုင်', 'ငံ', '။']

BPE Tokenizer

from myTokenizer import BPETokenizer

tokenizer = BPETokenizer()
tokens = tokenizer.tokenize("ရွေးကောက်ပွဲမှာနိုင်ထားတဲ့ဒေါ်နယ်ထရမ့်")
print(tokens)  # ['▁ရွေးကောက်ပွဲ', 'မှာ', 'နိုင်', 'ထား', 'တဲ့', 'ဒေါ်', 'နယ်', 'ထ', 'ရ', 'မ့်']

Word Tokenizer

from myTokenizer import WordTokenizer

tokenizer = WordTokenizer(engine="CRF")  # Use "myWord", "CRF", or "LSTM"
words = tokenizer.tokenize("မြန်မာနိုင်ငံ။")
print(words)  # ['မြန်မာ', 'နိုင်ငံ', '။']

Phrase Tokenizer

from myTokenizer import PhraseTokenizer

tokenizer = PhraseTokenizer()
phrases = tokenizer.tokenize("ညာဘက်ကိုယူပြီးတော့တည့်တည့်သွားပါ")
print(phrases)  # ['ညာဘက်_ကို', 'ယူ', 'ပြီး_တော့', 'တည့်တည့်', 'သွား_ပါ']

Sentence Tokenizer

from myTokenizer import SentenceTokenizer

tokenizer = SentenceTokenizer()
sentences = tokenizer.tokenize("ညာဘက်ကိုယူပြီးတော့တည့်တည့်သွားပါခင်ဗျားငါးမိနစ်လောက်ကြာလိမ့်မယ်")
print(sentences)  # [['ညာ', 'ဘက်', 'ကို', 'ယူ', 'ပြီး', 'တော့', 'တည့်တည့်', 'သွား', 'ပါ'], ['ခင်ဗျား', 'ငါး', 'မိနစ်', 'လောက်', 'ကြာ', 'လိမ့်', 'မယ်']]

Folder Structure

./myTokenizer/
├── CRFTokenizer
│   └── wordseg_c2_crf.crfsuite
├── SentencePiece
│   ├── bpe_sentencepiece_model.model
│   ├── bpe_sentencepiece_model.vocab
│   ├── unigram_sentencepiece_model.model
│   └── unigram_sentencepiece_model.vocab
├── Tokenizer.py
└── myWord
    ├── phrase_segment.py
    └── word_segment.py

Dependencies

  • Python 3.7+
  • ICU for Python (pyicu)
  • TensorFlow
  • SentencePiece
  • pycrfsuite
  • Numpy

License

This project is licensed under the MIT License. See the LICENSE file for details.

Authors

  • Ye Kyaw Thu
  • Thura Aung

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mytokenize-0.0.0.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

myTokenize-0.0.0-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file mytokenize-0.0.0.tar.gz.

File metadata

  • Download URL: mytokenize-0.0.0.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.20

File hashes

Hashes for mytokenize-0.0.0.tar.gz
Algorithm Hash digest
SHA256 275d12e2c16c5d1222ec843fea8b6f6ded529bae2d5d8a698df338e13bd106fa
MD5 8ebec809b69457caa91cdc74268fab42
BLAKE2b-256 8fbd2eecfb6c4ee8d90754eae2995cfa28d9969b2023b2bde85702c8eb888cfa

See more details on using hashes here.

File details

Details for the file myTokenize-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: myTokenize-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.20

File hashes

Hashes for myTokenize-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9454eaa38f1572b20730033542a4a0bd12bc8bdce55af5b868046278a092f3c7
MD5 26ae801cd55b8165d082c5ffda52ad8a
BLAKE2b-256 b88f2ade14ea27d4fb8d6a412178c26e03f9e5ddfd6d7a9204e30f8972b9d91e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page