Skip to main content

Python utils for processing Tibetan

Project description

PYBO - Tibetan NLP in Python

Build Status Coverage Status GitHub release CodeFactor

Overview

pybo is a word tokenizer for the Tibetan language written in Python. pybo takes in chunks of text and returns lists of words. It provides an easy-to-use, high-performance tokenization pipeline that can serve as a stand-alone solution or be adapted as a complement.

Getting started

pip install pybo

Or to install from the latest master branch:

pip install git+https://github.com/Esukhia/pybo.git

How to use pybo

To initiate the tokenizer together with part-of-speech capability:

# Initialize the tokenizer
tok = bo.WordTokenizer('POS')

# Feed it some Tibetan text
input_str = '༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །'

# Run the tokenizer
tokens = tok.tokenize(input_str)

Now in 'tokens' you have an iterable where each token consist of several meta-data:

# Access the first token in the iterable
tokens[0]

This yields:

text: "༄༅། "
char_types: |PUNCT|PUNCT|PUNCT|SPACE|
chunk_type: PUNCT
start: 0
len: 4
syls: None
pos: PUNCT
skrt: False
freq: 0

notes:

  • start is the starting index of the current token in the input string.
  • syls is a list of cleaned syllables, each syllable being represented as a list of indices. Each index leads to a constituting character within the input string.

How to access all the words in a list

# iterate through the tokens object to get all the words in a list
[t.content for t in tokens]

How to get all the nouns in a text

# extract nouns from the tokens
[t.content for t in tokens if t.tag == 'NOUNᛃᛃᛃ']

These examples highlight the basic principle of accessing attributes within each token object.

Acknowledgements

pybo is an open source library for Tibetan NLP.

We are always open to cooperation in introducing new features, tool integrations and testing solutions.

Many thanks to the companies and organizations who have supported pybo's development, especially:

Maintainance

Build the source dist:

rm -rf dist/
python3 setup.py clean sdist

and upload on twine (version >= 1.11.0) with:

twine upload dist/*

License

The Python code is Copyright (C) 2019 Esukhia, provided under Apache 2.

author: Drupchen

contributors:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pybo-0.6.2.tar.gz (1.1 MB view details)

Uploaded Source

File details

Details for the file pybo-0.6.2.tar.gz.

File metadata

  • Download URL: pybo-0.6.2.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.18.4 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for pybo-0.6.2.tar.gz
Algorithm Hash digest
SHA256 fa7bcc6a91d080ac5cb5fe09c01839041934def32ba6224ed3f39ae5a1d7ccf3
MD5 034fa2058c6e5089250eb57872f169fe
BLAKE2b-256 47817d3065882581a03564007ec79cc2ecc1d0da3d1b442920be6d427826edfb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page