pybo · PyPI

Python utils for processing Tibetan

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: Apache Software License
Natural Language
- Tibetan
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Project description

Pybo is a python tokenizer for Tibetan built as a tokenizer plugin for spaCy and the Tibetan editor. It takes in a string of raw tibetan text and spits out a list of Token objects.

Goals of pybo

Using pybo, one should be able to:

pre-process any Tibetan string for tokenization
tokenize any pre-processed string into a sequence of word/non-word tokens
apply matchers over the tokenized text (list of Token objects) to modify the segmentation
transform the list of Token objects to and from a spaCy Doc object.

1. Tibetan String Pre-processing

Status: Done.

Strategy: - BoString attributes a type to every char in the input string. - BoChunk creates chunks of similar chars. (subclass of BoString) - PyBoChunk creates meaningful chunks for Tibetan language: syl / (other)bo / punct / non-bo. (subclass of BoChunk) - PyBoTextChunks provides cleaned content for syllable chunks (no punct no space). (subclass of PyBoChunk)

2. Tokenization of pre-processed string

Status: Implemented. Cleanup to be done.

Strategy:

SylComponents gives morphologic information about a Tibetan syllable.
BoSyl (uses SylComponents):
- BoSyl.is_affixable(): tells whether a given syllable can be affixed or not
- BoSyl.get_all_affixed(): for a given syllable, gives all affixed variants. for each variant, gives: 1. the final form, 2. the particle used, 3. True if its non-affixed version ends with འ, False otherwise
Trie + Node: Object Oriented Trie implementation
PyBoTrie (subclass of Trie, uses BoSyl):
- builds a trie from a lexicon
- adds affixed particle and POS information in the trie
- allows to dynamically add / deactivate entries in a trie
- walks an existing trie to find the longest possible match
Tokenizer (uses PyBoTrie and Token):
- input: pre-processed syllables from PyBoTextChunks
- parses sequences of clean syllables to find the longest word inside the loaded trie
- builds a word token from a sequence of syllable chunks.
- builds a Token object from individual chunks (non-bo, punct) and from a word token.
SplitAffixed splits Token objects that end with an affixed particle into 1. the token, 2. a token for the affixed particle

Todo:

rename Token object into Token
remove tibetaneditor specific attributes and properties + find a way of reimplementing them within tibetaneditor
implement SplitAffixed:
- check in Token.tag if there is an affixed particle
- use the given particle type to reconstruct the lemma
- use the given particle length to know where Token.content should be split
- add a final འ to the lemma of the host Token if needed

3. Applying Matchers

Status: Implementation ongoing.

Strategy:

Matcher finds sequences of Token objects that match a given input CQL query
Splitter takes a Token object and splits it in two.
- input: the index in Token.content where to split
Merger takes two consecutive Token objects to create a merged Token object. (the metadata attributes from one token are discarded, the content attributes are concatenated)
BoMatcher (uses Matcher and either Splitter or Merger):
- input: a list of Token object, a matcher, a Splitter / a Merger
- loops over a list of Token objects (output of Tokenizer)
  - checks whether the sublist[current index: current index + len(matcher)] satisfies Matcher
  - applies either Splitter or Merger if necessary
- output: the modified list of Token objects

Todo:

replace the basic query parser by third-party/cql.py
move the matching logic from BoMatcher to Matcher
implement Splitter
implement Merger
implement BoMatcher

4. To and From spaCy

Status: To do.

Strategy:

use the spaCy api to make the conversion

Licence

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: Apache Software License
Natural Language
- Tibetan
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

0.8.0

Apr 20, 2021

0.7.18

Apr 19, 2021

0.7.17

Apr 16, 2021

0.7.16

Apr 15, 2021

0.7.15

Apr 12, 2021

0.7.14

Apr 9, 2021

0.7.13

Apr 9, 2021

0.7.12

Apr 6, 2021

0.7.11

Apr 6, 2021

0.7.10

Mar 24, 2021

0.7.9

Mar 24, 2021

0.7.8

Mar 23, 2021

0.7.7

Mar 22, 2021

0.7.6

Mar 22, 2021

0.7.5

Feb 26, 2021

0.7.4

Aug 12, 2020

0.7.3

Aug 11, 2020

0.7.2

Aug 8, 2020

0.7.1

Aug 8, 2020

0.7.0

Aug 7, 2020

0.6.23

Jul 14, 2020

0.6.22

Jul 10, 2020

0.6.21

Dec 15, 2019

0.6.20

Dec 13, 2019

0.6.19

Dec 10, 2019

0.6.18

Dec 10, 2019

0.6.17

Nov 22, 2019

0.6.16

Nov 22, 2019

0.6.15

Nov 22, 2019

0.6.14

Nov 22, 2019

0.6.13

Nov 9, 2019

0.6.12

Nov 9, 2019

0.6.11

Oct 30, 2019

0.6.10

Sep 1, 2019

0.6.9

Sep 1, 2019

0.6.8

Aug 26, 2019

0.6.7

Aug 21, 2019

0.6.6

Aug 20, 2019

0.6.5

Aug 16, 2019

0.6.4

Aug 15, 2019

0.6.3

Aug 14, 2019

0.6.2

Aug 14, 2019

0.6.1

Aug 13, 2019

0.6.0

Jul 1, 2019

0.5.1

Jun 29, 2019

0.5.0

Jun 27, 2019

0.4.3

May 17, 2019

0.4.2

Mar 6, 2019

0.4.1

Mar 5, 2019

0.4.0

Mar 5, 2019

0.3.0

Feb 1, 2019

0.2.21

Jan 13, 2019

0.2.20

Dec 21, 2018

0.2.19

Dec 7, 2018

0.2.18

Oct 26, 2018

0.2.17

Oct 26, 2018

0.2.16

Oct 23, 2018

0.2.15

Oct 22, 2018

0.2.2.2

Jul 31, 2018

0.2.2.1

Jul 31, 2018

0.2.2

Jul 11, 2018

0.2.0

Jul 9, 2018

This version

0.1.6

May 1, 2018

0.1.5

Apr 27, 2018

0.1.4

Apr 25, 2018

0.1.3

Apr 25, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pybo-0.1.6.tar.gz (380.4 kB view hashes)

Uploaded May 1, 2018 Source

Hashes for pybo-0.1.6.tar.gz

Hashes for pybo-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`029895640ecef6297e40d42755219881c750dfaad9d1544a9a535a9155e01d9f`
MD5	`575bdba1fa9a5497c7a1f0c78749093d`
BLAKE2b-256	`e322f118ed882090b6a8d54ac937e0aac9d6bfeed2cbcc22c9a7304cd9e6e78f`