Tibetan sentence tokenizer for segmenting Tibetan text into sentences
Project description
tibetan sentence tokenizer.
Description
Tibetan sentence tokenizer designed specifically for data preparation.
Project owner(s)
Installation
pip install git+https://github.com/OpenPecha/bo_sent_tokenizer.git
Usage
Important Note: If speed is essential, prioritize sentence segmentation over sentence tokenization.
1.Sentence tokenization
from bo_sent_tokenizer import tokenize
text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"
tokenized_text = tokenize(text)
print(tokenized_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n'
Explanation
code is refered from op_mt_tools and made minor changes to get the following desired output.
Output Explanation
The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།' is clean Tibetan text.
The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།' contains an illegal token 'བབབབབབབབནམ'.
The text 'ངའི་མིང་ལ་Thomas་ཟེར།' includes characters from another language.
The text 'ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།' contains non-Tibetan symbols '(', and ')'.
If the text is clean, it is retained. If a sentence contains an illegal token or characters from another language, that sentence is excluded. If a sentence contains non-Tibetan symbols, these symbols are filtered out, and the sentence is retained.
2.Sentence segmentation
from bo_sent_tokenizer import segment
text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"
segmented_text = segment(text)
print(segmented_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།\nངའི་མིང་ལ་ ་ཟེར།\nཁྱེད་དེ་རིང(བདེ་མོ་)ཡིན་ནམ།\n'
Terms:
Closing Punctuation: Characters in the Tibetan language that symbolize the end of a sentence, similar to a full stop in English.
Opening Punctuation: Characters in the Tibetan language that symbolize the start of a sentence.
How Sentence Segmentation Works:
-
Preprocessing: All carriage returns and new lines are removed from the string.
-
Splitting into Parts: The preprocessed text is then split by closing punctuation using a regular expression.
-
Joining the Parts:
- Empty parts are ignored.
- In some cases, closing punctuation appears immediately after opening punctuation, so care is taken not to split these instances.
Example of a valid Tibetan sentence: ༄༅།།བོད་ཀྱི་གསོ་བ་རིག་པའི་གཞུང་ལུགས་དང་དེའི་སྐོར་གྱི་དཔྱད་བརྗོད།
- ༄༅ = opening punctuation
- །། = closing punctuation
-
Filtering Text: Only Tibetan characters and a few predefined symbols are retained; all other characters are removed.
Note:
- Closing punctuation, opening punctuation, and predefined symbols are defined in the file
vars.py - To have a better understanding of the code, refer to the test cases in
test_segmenter.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bo_sent_tokenizer-0.0.1.tar.gz.
File metadata
- Download URL: bo_sent_tokenizer-0.0.1.tar.gz
- Upload date:
- Size: 7.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f758f9528ad22c4987e59ea5a06e27c63b5b108b3aa4ec5d677231a309835dd
|
|
| MD5 |
8e847614e352d83b0757252aab653286
|
|
| BLAKE2b-256 |
79b959c87e3aea9d9836da37c54cb775200c11af0a0de8acf2281280ee7ecd57
|
File details
Details for the file bo_sent_tokenizer-0.0.1-py3-none-any.whl.
File metadata
- Download URL: bo_sent_tokenizer-0.0.1-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2273deedd210dc270817ec59f1691d1d51484931b8fa917b27d9775b1a8511c
|
|
| MD5 |
3f60ae231ab3061c7456fe39e3d51352
|
|
| BLAKE2b-256 |
5e4ffb601903a25e3b8541b89bd8683e8b5197fa96a1c0e970a9eb05e7c06afb
|