Python utils for processing Tibetan
Project description
PYBO - Tibetan NLP in Python
Overview
bo tokenizes Tibetan text into words.
Basic usage
Getting started
Requires to have Python3 installed.
python3 -m pip install pybo
Tokenizing a string
drupchen@drupchen:~$ bo tok-string "༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །
སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་
སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །"
Loading Trie... (2s.)
༄༅།_། རྒྱ་གར་ སྐད་ དུ །_ བོ་ དྷི་ སཏྭ་ ཙརྻ་ ཨ་བ་ ཏ་ ར །_ བོད་སྐད་ དུ །_ བྱང་ཆུབ་ སེམས་དཔ འི་ སྤྱོད་པ་ ལ་ འཇུག་པ །_། སངས་རྒྱས་ དང་ བྱང་ཆུབ་
སེམས་དཔའ་ ཐམས་ཅད་ ལ་ ཕྱག་ འཚལ་ ལོ །_། བདེ་གཤེགས་ ཆོས་ ཀྱི་ སྐུ་ མངའ་ སྲས་ བཅས་ དང༌ །_། ཕྱག་འོས་ ཀུན་ ལ འང་ གུས་པ ར་ ཕྱག་ འཚལ་
ཏེ །_། བདེ་གཤེགས་ སྲས་ ཀྱི་ སྡོམ་ ལ་ འཇུག་པ་ ནི །_། ལུང་ བཞིན་ མདོར་བསྡུས་ ནས་ ནི་ བརྗོད་པ ར་ བྱ །_།
Tokenizing a list of files
The command to tokenize a list of files in a directory:
bo tok <path-to-directory>
For example to tokenize the file text.txt
in a directory ./document/
with the following content:
བཀྲ་ཤི་ས་བདེ་ལེགས་ཕུན་སུམ་ཚོགས། །རྟག་ཏུ་བདེ་བ་ཐོབ་པར་ཤོག། །
I use the command:
$ bo tok ./document/
...which create a file text.txt
in a directory ./document_pybo
containing:
བཀྲ་ ཤི་ ས་ བདེ་ལེགས་ ཕུན་སུམ་ ཚོགས །_། རྟག་ ཏུ་ བདེ་བ་ ཐོབ་པ ར་ ཤོག །_།
Sorting Tibetan words
$ bo kakha to-sort.txt
The expected input is one word or entry per line in a .txt file. The file will be overwritten.
FNR - Find and Replace with a list of regexes
bo fnr <in-dir> <regex-file> -o <out-dir> -t <tag>
-o
and -t
are optional
Text files should be UTF-8 plain text files. The regexes should be in the following format:
<find-pattern><tab>-<tab><replace-pattern>
Acknowledgements
- pybo is an open source library for Tibetan NLP.
We are always open to cooperation in introducing new features, tool integrations and testing solutions.
Many thanks to the companies and organizations who have supported pybo's development, especially:
- Khyentse Foundation for contributing USD22,000 to kickstart the project
- The Barom/Esukhia canon project for sponsoring training data curation
- BDRC for contributing 2 staff for 6 months for data curation
third_party/rules.txt
is taken from tibetan-collation.
Contributing
First clone this repo. Create virtual environment and activate it. Then install the dependencies
$ pip install -e .
$ pip install -r requirements-dev.txt
Next, setup up pre-commit by creating pre-commit git hook
$ pre-commit install
Please, follow augular commit message format for commit message. We have setup python-semantic-release to publish pybo package automatically based on commit messages.
That's all, Enjoy contributing 🎉🎉🎉
License
The Python code is Copyright (C) 2019 Esukhia, provided under Apache 2.
contributors:
- Drupchen
- Élie Roux
- Ngawang Trinley
- Tenzin
- Joyce Mackzenzie for reworking the logo
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pybo-0.8.0.tar.gz
.
File metadata
- Download URL: pybo-0.8.0.tar.gz
- Upload date:
- Size: 27.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40de8061c119a5414ea19a9f67772473f52a515bb1a2373e6aec0fce7a27d627 |
|
MD5 | b9edcb34f246bad807aa4c539d38800d |
|
BLAKE2b-256 | 24a8677337136d3aef16a786866ed6a7396fed20caa774e606a7a7d83ce0a69d |
File details
Details for the file pybo-0.8.0-py3-none-any.whl
.
File metadata
- Download URL: pybo-0.8.0-py3-none-any.whl
- Upload date:
- Size: 29.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f03b4d06d749d58c65565082c56642ccadce98fe97ea639a5589e1a4d1633eba |
|
MD5 | d93bc8b1a26ccd6247a5c148923701f3 |
|
BLAKE2b-256 | e7bfa95b233f29fbdb4437548e581b02718e1148269f591233df775128c73d5b |