Skip to main content

Tibetan sentence tokenizer for segmenting Tibetan text into sentences

Project description


OpenPecha

tibetan sentence tokenizer.

Description

Tibetan sentence tokenizer designed specifically for data preparation.

Project owner(s)

Installation

pip install git+https://github.com/OpenPecha/bo_sent_tokenizer.git

Usage

Important Note: If speed is essential, prioritize sentence segmentation over sentence tokenization.

1.Sentence tokenization

from bo_sent_tokenizer import tokenize

text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"

tokenized_text = tokenize(text)
print(tokenized_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n'

Explanation

code is refered from op_mt_tools and made minor changes to get the following desired output.

Output Explanation

The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།' is clean Tibetan text.

The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།' contains an illegal token 'བབབབབབབབནམ'.

The text 'ངའི་མིང་ལ་Thomas་ཟེར།' includes characters from another language.

The text 'ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།' contains non-Tibetan symbols '(', and ')'.

If the text is clean, it is retained. If a sentence contains an illegal token or characters from another language, that sentence is excluded. If a sentence contains non-Tibetan symbols, these symbols are filtered out, and the sentence is retained.

2.Sentence segmentation

from bo_sent_tokenizer import segment

text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"

segmented_text = segment(text)
print(segmented_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།\nངའི་མིང་ལ་ ་ཟེར།\nཁྱེད་དེ་རིང(བདེ་མོ་)ཡིན་ནམ།\n'

Terms:

Closing Punctuation: Characters in the Tibetan language that symbolize the end of a sentence, similar to a full stop in English.

Opening Punctuation: Characters in the Tibetan language that symbolize the start of a sentence.

How Sentence Segmentation Works:

  1. Preprocessing: All carriage returns and new lines are removed from the string.

  2. Splitting into Parts: The preprocessed text is then split by closing punctuation using a regular expression.

  3. Joining the Parts:

    • Empty parts are ignored.
    • In some cases, closing punctuation appears immediately after opening punctuation, so care is taken not to split these instances. Example of a valid Tibetan sentence: ༄༅།།བོད་ཀྱི་གསོ་བ་རིག་པའི་གཞུང་ལུགས་དང་དེའི་སྐོར་གྱི་དཔྱད་བརྗོད།
      • ༄༅ = opening punctuation
      • །། = closing punctuation
  4. Filtering Text: Only Tibetan characters and a few predefined symbols are retained; all other characters are removed.

Note:

  • Closing punctuation, opening punctuation, and predefined symbols are defined in the file vars.py
  • To have a better understanding of the code, refer to the test cases in test_segmenter.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bo_sent_tokenizer-0.0.1.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bo_sent_tokenizer-0.0.1-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file bo_sent_tokenizer-0.0.1.tar.gz.

File metadata

  • Download URL: bo_sent_tokenizer-0.0.1.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for bo_sent_tokenizer-0.0.1.tar.gz
Algorithm Hash digest
SHA256 6f758f9528ad22c4987e59ea5a06e27c63b5b108b3aa4ec5d677231a309835dd
MD5 8e847614e352d83b0757252aab653286
BLAKE2b-256 79b959c87e3aea9d9836da37c54cb775200c11af0a0de8acf2281280ee7ecd57

See more details on using hashes here.

File details

Details for the file bo_sent_tokenizer-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for bo_sent_tokenizer-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f2273deedd210dc270817ec59f1691d1d51484931b8fa917b27d9775b1a8511c
MD5 3f60ae231ab3061c7456fe39e3d51352
BLAKE2b-256 5e4ffb601903a25e3b8541b89bd8683e8b5197fa96a1c0e970a9eb05e7c06afb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page