botok

Tibetan Word Tokenizer

These details have not been verified by PyPI

Project links

Project description

Botok – Python Tibetan Tokenizer

Description • Install • Example • Commented Example • Docs • Owners • Acknowledgements • Maintainance • License

Description

Botok tokenizes Tibetan text into words with optional attributes such as lemma, POS, clean form.

Install

Requires to have Python3 installed.

pip3 install botok

Example

from botok import WordTokenizer
from botok.config import Config
from pathlib import Path

def get_tokens(wt, text):
    tokens = wt.tokenize(text, split_affixes=False)
    return tokens

if __name__ == "__main__":
    config = Config(dialect_name="general", base_path= Path.home())
    wt = WordTokenizer(config=config)
    text = "བཀྲ་ཤིས་བདེ་ལེགས་ཞུས་རྒྱུ་ཡིན་ སེམས་པ་སྐྱིད་པོ་འདུག།"
    tokens = get_tokens(wt, text)
    for token in tokens:
        print(token)

https://user-images.githubusercontent.com/24893704/148767959-31cc0a69-4c83-4841-8a1d-028d376e4677.mp4

Commented Example

>>> from botok import Text

>>> # input is a multi-line input string
>>> in_str = """ལེ གས། བཀྲ་ཤིས་མཐའི་ ༆ ཤི་བཀྲ་ཤིས་  tr 
... བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། 
... མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།"""


### STEP1: instanciating Text

>>> # A. on a string
>>> t = Text(in_str)

>>> # B. on a file
... # note all following operations can be applied to files in this way.
>>> from pathlib import Path
>>> in_file = Path.cwd() / 'test.txt'

>>> # file content:
>>> in_file.read_text()
'བཀྲ་ཤིས་བདེ་ལེགས།།\n'

>>> t = Text(in_file)
>>> t.tokenize_chunks_plaintext

>>> # checking an output file has been written:
... # BOM is added by default so that notepad in Windows doesn't scramble the line breaks
>>> out_file = Path.cwd() / 'test_pybo.txt'
>>> out_file.read_text()
'\ufeffབཀྲ་ ཤིས་ བདེ་ ལེགས །།'

### STEP2: properties will perform actions on the input string:
### note: original spaces are replaced by underscores.

>>> # OUTPUT1: chunks are meaningful groups of chars from the input string.
... # see how punctuations, numerals, non-bo and syllables are all neatly grouped.
>>> t.tokenize_chunks_plaintext
'ལེ_གས །_ བཀྲ་ ཤིས་ མཐའི་ _༆_ ཤི་ བཀྲ་ ཤིས་__ tr_\n བདེ་་ ལེ_གས །_ བཀྲ་ ཤིས་ བདེ་ ལེགས་ ༡༢༣ ཀཀ །_\n མཐའི་ རྒྱ་ མཚོར་ གནས་ པའི་ ཉས་ ཆུ་ འཐུང་ །།_།། མཁའ །'

>>> # OUTPUT2: could as well be acheived by in_str.split(' ')
>>> t.tokenize_on_spaces
'ལེ གས། བཀྲ་ཤིས་མཐའི་ ༆ ཤི་བཀྲ་ཤིས་ tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།'

>>> # OUTPUT3: segments in words.
... # see how བདེ་་ལེ_གས was still recognized as a single word, even with the space and the double tsek.
... # the affixed particles are separated from the hosting word: མཐ འི་ རྒྱ་མཚོ ར་ གནས་པ འི་ ཉ ས་
>>> t.tokenize_words_raw_text
Loading Trie... (2s.)
'ལེ_གས །_ བཀྲ་ཤིས་ མཐ འི་ _༆_ ཤི་ བཀྲ་ཤིས་_ tr_ བདེ་་ལེ_གས །_ བཀྲ་ཤིས་ བདེ་ལེགས་ ༡༢༣ ཀཀ །_ མཐ འི་ རྒྱ་མཚོ ར་ གནས་པ འི་ ཉ ས་ ཆུ་ འཐུང་ །།_།། མཁའ །'
>>> t.tokenize_words_raw_lines
'ལེ_གས །_ བཀྲ་ཤིས་ མཐ འི་ _༆_ ཤི་ བཀྲ་ཤིས་__ tr_\n བདེ་་ལེ_གས །_ བཀྲ་ཤིས་ བདེ་ལེགས་ ༡༢༣ ཀཀ །_\n མཐ འི་ རྒྱ་མཚོ ར་ གནས་པ འི་ ཉ ས་ ཆུ་ འཐུང་ །།_།། མཁའ །'

>>> # OUTPUT4: segments in words, then calculates the number of occurences of each word found
... # by default, it counts in_str's substrings in the output, which is why we have བདེ་་ལེ གས	1, བདེ་ལེགས་	1
... # this behaviour can easily be modified to take into account the words that pybo recognized instead (see advanced usage)
>>> print(t.list_word_types)
འི་	3
། 	2
བཀྲ་ཤིས་	2
མཐ	2
ལེ གས	1
 ༆ 	1
ཤི་	1
བཀྲ་ཤིས་  	1
tr \n	1
བདེ་་ལེ གས	1
བདེ་ལེགས་	1
༡༢༣	1
ཀཀ	1
། \n	1
རྒྱ་མཚོ	1
ར་	1
གནས་པ	1
ཉ	1
ས་	1
ཆུ་	1
འཐུང་	1
།། །།	1
མཁའ	1
།	1

Custom dialect pack:

In order to use custom dialect pack:

You need to prepare your dialect pack in same folder structure like general dialect pack
Then you need to instaintiate a config object where you will pass dialect name and path
You can instaintiate your tokenizer object using that config object
Your tokenizer will be using your custom dialect pack and it will be using trie pickled file in future to build the custom trie.

Docs

No documentations.

Owners

Acknowledgements

botok is an open source library for Tibetan NLP.

We are always open to cooperation in introducing new features, tool integrations and testing solutions.

Many thanks to the companies and organizations who have supported botok's development, especially:

Khyentse Foundation for contributing USD22,000 to kickstart the project
The Barom/Esukhia canon project for sponsoring training data curation
BDRC for contributing 2 staff for 6 months for data curation

Maintainance

Build the source dist:

rm -rf dist/
python3 setup.py clean sdist

and upload on twine (version >= 1.11.0) with:

twine upload dist/*

License

contributors:

Drupchen
Élie Roux
Ngawang Trinley
Mikko Kotila
Thubten Rinzin
Tenzin
Joyce Mackzenzie for reworking the logo

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.2

Jan 16, 2026

1.1.1

Dec 15, 2025

0.9.0

Mar 9, 2025

0.8.12

May 17, 2023

This version

0.8.11

May 11, 2023

0.8.10

Apr 5, 2022

0.8.8

Oct 12, 2021

0.8.7

Jun 21, 2021

0.8.6

May 20, 2021

0.8.5

Apr 15, 2021

0.8.4

Apr 14, 2021

0.8.3

Mar 29, 2021

0.8.2

Mar 22, 2021

0.8.1

Jul 28, 2020

0.7.5

Dec 30, 2019

0.7.4

Dec 15, 2019

0.7.3

Dec 12, 2019

0.7.2

Dec 12, 2019

0.7.1

Dec 11, 2019

0.7.0

Dec 10, 2019

0.6.18

Nov 21, 2019

0.6.17

Nov 7, 2019

0.6.16

Nov 7, 2019

0.6.15

Nov 6, 2019

0.6.14

Nov 5, 2019

0.6.13

Nov 1, 2019

0.6.12

Oct 7, 2019

0.6.11

Oct 4, 2019

0.6.10

Sep 12, 2019

0.6.9

Sep 1, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

botok-0.8.11.tar.gz (66.2 kB view details)

Uploaded May 11, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

botok-0.8.11-py3-none-any.whl (76.8 kB view details)

Uploaded May 11, 2023 Python 3

File details

Details for the file botok-0.8.11.tar.gz.

File metadata

Download URL: botok-0.8.11.tar.gz
Upload date: May 11, 2023
Size: 66.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.30.0 requests-toolbelt/1.0.0 urllib3/2.0.2 tqdm/4.65.0 importlib-metadata/6.6.0 keyring/23.13.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.11

File hashes

Hashes for botok-0.8.11.tar.gz
Algorithm	Hash digest
SHA256	`b35129125b635a446cb93bae6cb6a5c77fa815ba6b062c6576defc2f637c442c`
MD5	`8913ce9cd74983fa739d434ede0cd9b1`
BLAKE2b-256	`156ae0e76c5daa1e8f539ad7950a7b7e81de06382a1d80b89f88597e2e979a4f`

See more details on using hashes here.

File details

Details for the file botok-0.8.11-py3-none-any.whl.

File metadata

Download URL: botok-0.8.11-py3-none-any.whl
Upload date: May 11, 2023
Size: 76.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.30.0 requests-toolbelt/1.0.0 urllib3/2.0.2 tqdm/4.65.0 importlib-metadata/6.6.0 keyring/23.13.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.11

File hashes

Hashes for botok-0.8.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`55ea75471061f544e55d87c94ae9c6d5c53afb2342f7b8ae2274ab5f17e559c1`
MD5	`c9c92af0c7fc51b1fa3d15d7f87e9cee`
BLAKE2b-256	`64d2488bae57c08b60dc368118b3fa3ac8ae99334a56de33cadf70dd8b09a5f1`

See more details on using hashes here.

botok 0.8.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Botok – Python Tibetan Tokenizer

Description

Install

Example

Commented Example

Custom dialect pack:

Docs

Owners

Acknowledgements

Maintainance

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes