Skip to main content

Python utils for processing Tibetan

Project description

Pybo is a python tokenizer for Tibetan built as a tokenizer plugin for spaCy and the Tibetan editor. It takes in a string of raw tibetan text and spits out a list of Token objects.

Goals of pybo

Using pybo, one should be able to:

  1. pre-process any Tibetan string for tokenization

  2. tokenize any pre-processed string into a sequence of word/non-word tokens

  3. apply matchers over the tokenized text (list of Token objects) to modify the segmentation

  4. transform the list of Token objects to and from a spaCy Doc object.

1. Tibetan String Pre-processing

Status: Done.

Strategy: - BoString attributes a type to every char in the input string. - BoChunk creates chunks of similar chars. (subclass of BoString) - PyBoChunk creates meaningful chunks for Tibetan language: syl / (other)bo / punct / non-bo. (subclass of BoChunk) - PyBoTextChunks provides cleaned content for syllable chunks (no punct no space). (subclass of PyBoChunk)

2. Tokenization of pre-processed string

Status: Implemented. Cleanup to be done.

Strategy:

  • SylComponents gives morphologic information about a Tibetan syllable.

  • BoSyl (uses SylComponents):

    • BoSyl.is_affixable(): tells whether a given syllable can be affixed or not

    • BoSyl.get_all_affixed(): for a given syllable, gives all affixed variants. for each variant, gives: 1. the final form, 2. the particle used, 3. True if its non-affixed version ends with འ, False otherwise

  • Trie + Node: Object Oriented Trie implementation

  • PyBoTrie (subclass of Trie, uses BoSyl):

    • builds a trie from a lexicon

    • adds affixed particle and POS information in the trie

    • allows to dynamically add / deactivate entries in a trie

    • walks an existing trie to find the longest possible match

  • Tokenizer (uses PyBoTrie and Token):

    • input: pre-processed syllables from PyBoTextChunks

    • parses sequences of clean syllables to find the longest word inside the loaded trie

    • builds a word token from a sequence of syllable chunks.

    • builds a Token object from individual chunks (non-bo, punct) and from a word token.

  • SplitAffixed splits Token objects that end with an affixed particle into 1. the token, 2. a token for the affixed particle

Todo:

  • rename Token object into Token

  • remove tibetaneditor specific attributes and properties + find a way of reimplementing them within tibetaneditor

  • implement SplitAffixed:

    • check in Token.tag if there is an affixed particle

    • use the given particle type to reconstruct the lemma

    • use the given particle length to know where Token.content should be split

    • add a final འ to the lemma of the host Token if needed

3. Applying Matchers

Status: Implementation ongoing.

Strategy:

  • Matcher finds sequences of Token objects that match a given input CQL query

  • Splitter takes a Token object and splits it in two.

    • input: the index in Token.content where to split

  • Merger takes two consecutive Token objects to create a merged Token object. (the metadata attributes from one token are discarded, the content attributes are concatenated)

  • BoMatcher (uses Matcher and either Splitter or Merger):

    • input: a list of Token object, a matcher, a Splitter / a Merger

    • loops over a list of Token objects (output of Tokenizer)

      • checks whether the sublist[current index: current index + len(matcher)] satisfies Matcher

      • applies either Splitter or Merger if necessary

    • output: the modified list of Token objects

Todo:

  • replace the basic query parser by third-party/cql.py

  • move the matching logic from BoMatcher to Matcher

  • implement Splitter

  • implement Merger

  • implement BoMatcher

4. To and From spaCy

Status: To do.

Strategy:

  • use the spaCy api to make the conversion

Licence

The code is Copyright 2018 Esukhia, and is provided under Apache Licence 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pybo-0.1.6.tar.gz (380.4 kB view details)

Uploaded Source

File details

Details for the file pybo-0.1.6.tar.gz.

File metadata

  • Download URL: pybo-0.1.6.tar.gz
  • Upload date:
  • Size: 380.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pybo-0.1.6.tar.gz
Algorithm Hash digest
SHA256 029895640ecef6297e40d42755219881c750dfaad9d1544a9a535a9155e01d9f
MD5 575bdba1fa9a5497c7a1f0c78749093d
BLAKE2b-256 e322f118ed882090b6a8d54ac937e0aac9d6bfeed2cbcc22c9a7304cd9e6e78f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page