Skip to main content

Python utils for processing Tibetan

Project description

PYBO - Tibetan NLP in Python

PyPI version Test Test Coverage Publish Code style: black

Overview

bo tokenizes Tibetan text into words.

Basic usage

Getting started

Requires to have Python3 installed.

python3 -m pip install pybo

Tokenizing a string

drupchen@drupchen:~$ bo tok-string "༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །
སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་
སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །"
Loading Trie... (2s.)
༄༅།_། རྒྱ་གར་ སྐད་ དུ །_ བོ་ དྷི་ སཏྭ་ ཙརྻ་ ཨ་བ་ ཏ་  །_ བོད་སྐད་ དུ །_ བྱང་ཆུབ་ སེམས་དཔ འི་ སྤྱོད་པ་ ལ་ འཇུག་པ །_། སངས་རྒྱས་ དང་ བྱང་ཆུབ་
སེམས་དཔའ་ ཐམས་ཅད་ ལ་ ཕྱག་ འཚལ་ ལོ །_། བདེ་གཤེགས་ ཆོས་ ཀྱི་ སྐུ་ མངའ་ སྲས་ བཅས་ དང༌ །_། ཕྱག་འོས་ ཀུན་  འང་ གུས་པ ར་ ཕྱག་ འཚལ་
ཏེ །_། བདེ་གཤེགས་ སྲས་ ཀྱི་ སྡོམ་ ལ་ འཇུག་པ་ ནི །_། ལུང་ བཞིན་ མདོར་བསྡུས་ ནས་ ནི་ བརྗོད་པ ར་ བྱ །_།

Tokenizing a list of files

The command to tokenize a list of files in a directory:

bo tok <path-to-directory>

For example to tokenize the file text.txt in a directory ./document/ with the following content:

བཀྲ་ཤི་ས་བདེ་ལེགས་ཕུན་སུམ་ཚོགས། །རྟག་ཏུ་བདེ་བ་ཐོབ་པར་ཤོག། །

I use the command:

$ bo tok ./document/

...which create a file text.txt in a directory ./document_pybo containing:

བཀྲ་ ཤི་ ས་ བདེ་ལེགས་ ཕུན་སུམ་ ཚོགས །_། རྟག་ ཏུ་ བདེ་བ་ ཐོབ་པ ར་ ཤོག །_།

Sorting Tibetan words

$ bo kakha to-sort.txt

The expected input is one word or entry per line in a .txt file. The file will be overwritten.

FNR - Find and Replace with a list of regexes

bo fnr <in-dir> <regex-file> -o <out-dir> -t <tag>

-o and -t are optional

Text files should be UTF-8 plain text files. The regexes should be in the following format:

<find-pattern><tab>-<tab><replace-pattern>

Acknowledgements

  • pybo is an open source library for Tibetan NLP.

We are always open to cooperation in introducing new features, tool integrations and testing solutions.

Many thanks to the companies and organizations who have supported pybo's development, especially:

Contributing

First clone this repo. Create virtual environment and activate it. Then install the dependencies

$ pip install -e .
$ pip install -r requirements-dev.txt

Next, setup up pre-commit by creating pre-commit git hook

$ pre-commit install

Please, follow augular commit message format for commit message. We have setup python-semantic-release to publish pybo package automatically based on commit messages.

That's all, Enjoy contributing 🎉🎉🎉

License

The Python code is Copyright (C) 2019 Esukhia, provided under Apache 2.

contributors:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pybo-0.8.0.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

pybo-0.8.0-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file pybo-0.8.0.tar.gz.

File metadata

  • Download URL: pybo-0.8.0.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.10

File hashes

Hashes for pybo-0.8.0.tar.gz
Algorithm Hash digest
SHA256 40de8061c119a5414ea19a9f67772473f52a515bb1a2373e6aec0fce7a27d627
MD5 b9edcb34f246bad807aa4c539d38800d
BLAKE2b-256 24a8677337136d3aef16a786866ed6a7396fed20caa774e606a7a7d83ce0a69d

See more details on using hashes here.

File details

Details for the file pybo-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: pybo-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 29.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.10

File hashes

Hashes for pybo-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f03b4d06d749d58c65565082c56642ccadce98fe97ea639a5589e1a4d1633eba
MD5 d93bc8b1a26ccd6247a5c148923701f3
BLAKE2b-256 e7bfa95b233f29fbdb4437548e581b02718e1148269f591233df775128c73d5b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page