Rule-based morphological analysis for Turoyo
Project description
Turoyo morphological analyzer
This is a rule-based morphological analyzer for Ṭuroyo (tru
, Afro-Asiatic > Central Neo-Aramaic). It is based on a formalized description of Turoyo morphology and uses uniparser-morph for parsing. It performs full morphological analysis of Turoyo words (lemmatization, POS tagging, grammatical tagging). The text to be analyzed should be written in a version of Latin Turoyo alphabet which is somewhat closer to IPA: it uses ʔ instead of ', ʕ instead of c, ə insteadt of ë etc.
How to use
Python package
The analyzer is available as a Python package. If you want to analyze Turoyo texts in Python, install the module:
pip3 install uniparser-turoyo
Import the module and create an instance of TuroyoAnalyzer
class. Set mode='strict'
if you are going to process text in standard Latin Turoyo alphabet, or mode='nodiacritics'
if you expect some words to lack the diacritics (e.g. t instead of ṭ). After that, you can either parse tokens or lists of tokens with analyze_words()
, or parse a frequency list with analyze_wordlist()
. Here is a simple example:
from uniparser_turoyo import TuroyoAnalyzer
a = TuroyoAnalyzer(mode='strict')
analyses = a.analyze_words('koroḥamnux')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)
# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
print(ana.wf, ana.lemma, ana.gramm)
# You can also pass lists (even nested lists) and specify
# output format ('xml', 'json' or 'conll')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['koroḥamnux'], ['ʕəbarwo', 'lab', 'bote', '.']],
format='xml')
analyses = a.analyze_words([['koroḥamnux'], ['ʕəbarwo', 'lab', 'bote', '.']],
format='conll')
analyses = a.analyze_words(['koroḥamnux', [['laḥmawo'], ['ʕəbarwo', 'lab', 'bote', '.']]],
format='json')
Refer to the uniparser-morph documentation for the full list of options.
Word lists
Alternatively, you can use a preprocessed word list. The wordlists
directory contains a list of words from a 600-thousand-word Ṭuroyo corpus (wordlist.csv
) with 53,000 unique tokens, list of analyzed tokens (wordlist_analyzed.txt
; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt
). The recall of the analyzer on the corpus texts is about 90%. (This number is somewhat low due to orthographic variability in the texts.)
Description format
The description is carried out in the uniparser-morph
format and involves a description of the inflection (paradigms/paradigms_XXX.txt) and a grammatical dictionary (lexemes-XXX.txt files). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical information, its consonant root, its inflectional type (paradigm), and English and/or German translations. See more about the format in the uniparser-morph documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for uniparser_turoyo-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3cbc21162e49781641bc0d8160b53c1b387b4ba818790280d90ae6dfbf44fb13 |
|
MD5 | fcdd0a6027381ed7cabb84127a46d2bd |
|
BLAKE2b-256 | 4a06e93216a9a4d0cb0becea57da4e0ce22a1448ab384f81d31a62766bd747df |