bo-sent-tokenizer

Tibetan sentence tokenizer for segmenting Tibetan text into sentences

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Natural Language
- Tibetan
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Project description

tibetan sentence tokenizer.

Description

Tibetan sentence tokenizer designed specifically for data preparation.

Project owner(s)

@tenzin3

Installation

pip install git+https://github.com/OpenPecha/bo_sent_tokenizer.git

Usage

Important Note: If speed is essential, prioritize sentence segmentation over sentence tokenization.

1.Sentence tokenization

from bo_sent_tokenizer import tokenize

text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"

tokenized_text = tokenize(text)
print(tokenized_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n'

Explanation

code is refered from op_mt_tools and made minor changes to get the following desired output.

Output Explanation

The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།' is clean Tibetan text.

The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།' contains an illegal token 'བབབབབབབབནམ'.

The text 'ངའི་མིང་ལ་Thomas་ཟེར།' includes characters from another language.

The text 'ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།' contains non-Tibetan symbols '(', and ')'.

If the text is clean, it is retained. If a sentence contains an illegal token or characters from another language, that sentence is excluded. If a sentence contains non-Tibetan symbols, these symbols are filtered out, and the sentence is retained.

2.Sentence segmentation

from bo_sent_tokenizer import segment

text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"

segmented_text = segment(text)
print(segmented_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།\nངའི་མིང་ལ་ ་ཟེར།\nཁྱེད་དེ་རིང(བདེ་མོ་)ཡིན་ནམ།\n'

Terms:

Closing Punctuation: Characters in the Tibetan language that symbolize the end of a sentence, similar to a full stop in English.

Opening Punctuation: Characters in the Tibetan language that symbolize the start of a sentence.

How Sentence Segmentation Works:

Preprocessing: All carriage returns and new lines are removed from the string.
Splitting into Parts: The preprocessed text is then split by closing punctuation using a regular expression.
Joining the Parts:
- Empty parts are ignored.
- In some cases, closing punctuation appears immediately after opening punctuation, so care is taken not to split these instances. Example of a valid Tibetan sentence: ༄༅།།བོད་ཀྱི་གསོ་བ་རིག་པའི་གཞུང་ལུགས་དང་དེའི་སྐོར་གྱི་དཔྱད་བརྗོད།
  - ༄༅ = opening punctuation
  - །། = closing punctuation
Filtering Text: Only Tibetan characters and a few predefined symbols are retained; all other characters are removed.

Note:

Closing punctuation, opening punctuation, and predefined symbols are defined in the file vars.py
To have a better understanding of the code, refer to the test cases in test_segmenter.py

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Natural Language
- Tibetan
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

This version

0.0.1

May 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bo_sent_tokenizer-0.0.1.tar.gz (7.4 kB view details)

Uploaded May 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bo_sent_tokenizer-0.0.1-py3-none-any.whl (6.8 kB view details)

Uploaded May 7, 2025 Python 3

File details

Details for the file bo_sent_tokenizer-0.0.1.tar.gz.

File metadata

Download URL: bo_sent_tokenizer-0.0.1.tar.gz
Upload date: May 7, 2025
Size: 7.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for bo_sent_tokenizer-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`6f758f9528ad22c4987e59ea5a06e27c63b5b108b3aa4ec5d677231a309835dd`
MD5	`8e847614e352d83b0757252aab653286`
BLAKE2b-256	`79b959c87e3aea9d9836da37c54cb775200c11af0a0de8acf2281280ee7ecd57`

See more details on using hashes here.

File details

Details for the file bo_sent_tokenizer-0.0.1-py3-none-any.whl.

File metadata

Download URL: bo_sent_tokenizer-0.0.1-py3-none-any.whl
Upload date: May 7, 2025
Size: 6.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for bo_sent_tokenizer-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f2273deedd210dc270817ec59f1691d1d51484931b8fa917b27d9775b1a8511c`
MD5	`3f60ae231ab3061c7456fe39e3d51352`
BLAKE2b-256	`5e4ffb601903a25e3b8541b89bd8683e8b5197fa96a1c0e970a9eb05e7c06afb`

See more details on using hashes here.

bo-sent-tokenizer 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tibetan sentence tokenizer.

Description

Project owner(s)

Installation

Usage

1.Sentence tokenization

Explanation

Output Explanation

2.Sentence segmentation

Terms:

How Sentence Segmentation Works:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes