Skip to main content

Tibetan Word Tokenizer

Project description

botok – Python Tibetan Tokenizer

GitHub release Documentation Status Build Status Coverage Status CodeFactor Code style: black

Overview

botok tokenizes Tibetan text into words.

Basic usage

Getting started

Requires to have Python3 installed.

pip3 install botok
>>> from botok import Text

>>> # input is a multi-line input string
>>> in_str = """ལེ གས། བཀྲ་ཤིས་མཐའི་ ༆ ཤི་བཀྲ་ཤིས་  tr 
... བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། 
... མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།"""


### STEP1: instanciating Text

>>> # A. on a string
>>> t = Text(in_str)

>>> # B. on a file
... # note all following operations can be applied to files in this way.
>>> from pathlib import Path
>>> in_file = Path.cwd() / 'test.txt'

>>> # file content:
>>> in_file.read_text()
'བཀྲ་ཤིས་བདེ་ལེགས།།\n'

>>> t = Text(in_file)
>>> t.tokenize_chunks_plaintext

>>> # checking an output file has been written:
... # BOM is added by default so that notepad in Windows doesn't scramble the line breaks
>>> out_file = Path.cwd() / 'test_pybo.txt'
>>> out_file.read_text()
'\ufeffབཀྲ་ ཤིས་ བདེ་ ལེགས །།'

### STEP2: properties will perform actions on the input string:
### note: original spaces are replaced by underscores.

>>> # OUTPUT1: chunks are meaningful groups of chars from the input string.
... # see how punctuations, numerals, non-bo and syllables are all neatly grouped.
>>> t.tokenize_chunks_plaintext
'ལེ_གས །_ བཀྲ་ ཤིས་ མཐའི་ _༆_ ཤི་ བཀྲ་ ཤིས་__ tr_\n བདེ་་ ལེ_གས །_ བཀྲ་ ཤིས་ བདེ་ ལེགས་ ༡༢༣ ཀཀ །_\n མཐའི་ རྒྱ་ མཚོར་ གནས་ པའི་ ཉས་ ཆུ་ འཐུང་ །།_།། མཁའ །'

>>> # OUTPUT2: could as well be acheived by in_str.split(' ')
>>> t.tokenize_on_spaces
'ལེ གས། བཀྲ་ཤིས་མཐའི་ ༆ ཤི་བཀྲ་ཤིས་ tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།'

>>> # OUTPUT3: segments in words.
... # see how བདེ་་ལེ_གས was still recognized as a single word, even with the space and the double tsek.
... # the affixed particles are separated from the hosting word: མཐ འི་ རྒྱ་མཚོ ར་ གནས་པ འི་ ཉ ས་
>>> t.tokenize_words_raw_text
Loading Trie... (2s.)
'ལེ_གས །_ བཀྲ་ཤིས་ མཐ འི་ _༆_ ཤི་ བཀྲ་ཤིས་_ tr_ བདེ་་ལེ_གས །_ བཀྲ་ཤིས་ བདེ་ལེགས་ ༡༢༣ ཀཀ །_ མཐ འི་ རྒྱ་མཚོ ར་ གནས་པ འི་ ཉ ས་ ཆུ་ འཐུང་ །།_།། མཁའ །'
>>> t.tokenize_words_raw_lines
'ལེ_གས །_ བཀྲ་ཤིས་ མཐ འི་ _༆_ ཤི་ བཀྲ་ཤིས་__ tr_\n བདེ་་ལེ_གས །_ བཀྲ་ཤིས་ བདེ་ལེགས་ ༡༢༣ ཀཀ །_\n མཐ འི་ རྒྱ་མཚོ ར་ གནས་པ འི་ ཉ ས་ ཆུ་ འཐུང་ །།_།། མཁའ །'

>>> # OUTPUT4: segments in words, then calculates the number of occurences of each word found
... # by default, it counts in_str's substrings in the output, which is why we have བདེ་་ལེ གས	1, བདེ་ལེགས་	1
... # this behaviour can easily be modified to take into account the words that pybo recognized instead (see advanced usage)
>>> print(t.list_word_types)
འི	3
 	2
བཀྲཤིས	2
མཐ	2
ལེ གས	1
  	1
ཤི	1
བཀྲཤིས  	1
tr \n	1
བདེ་་ལེ གས	1
བདེལེགས	1
༡༢༣	1
ཀཀ	1
 \n	1
རྒྱམཚོ	1
	1
གནས	1
	1
	1
ཆུ	1
འཐུང	1
།། །།	1
མཁའ	1
	1

Acknowledgements

botok is an open source library for Tibetan NLP.

We are always open to cooperation in introducing new features, tool integrations and testing solutions.

Many thanks to the companies and organizations who have supported botok's development, especially:

Maintainance

Build the source dist:

rm -rf dist/
python3 setup.py clean sdist

and upload on twine (version >= 1.11.0) with:

twine upload dist/*

License

The Python code is Copyright (C) 2019 Esukhia, provided under Apache 2.

contributors:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

botok-0.8.5.tar.gz (55.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

botok-0.8.5-py3-none-any.whl (70.1 kB view details)

Uploaded Python 3

File details

Details for the file botok-0.8.5.tar.gz.

File metadata

  • Download URL: botok-0.8.5.tar.gz
  • Upload date:
  • Size: 55.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for botok-0.8.5.tar.gz
Algorithm Hash digest
SHA256 e4d494d4af3115504796262ffe15bc63a64083d60b322e961dafd1b9027e2754
MD5 d56d37f1aa131c91e2e47883254dd950
BLAKE2b-256 32192f4bce329acc1861e5b6722589086e0a8d145adf052dad9fed76dec03760

See more details on using hashes here.

File details

Details for the file botok-0.8.5-py3-none-any.whl.

File metadata

  • Download URL: botok-0.8.5-py3-none-any.whl
  • Upload date:
  • Size: 70.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for botok-0.8.5-py3-none-any.whl
Algorithm Hash digest
SHA256 1833b6a269777b12a4585d1018de7af68728c68026bc3a86dad13497c0d690c3
MD5 1c7f79efdd47f0f0c2395aee936b2686
BLAKE2b-256 66bc2a3a317a8e23626b6941ae9b54eed72dfe87bc56230236eedb3ce43ca712

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page