Rule-based morphological analysis for Komi-Zyrian
Project description
Komi-Zyrian morphological analyzer
This is a rule-based morphological analyzer for Komi-Zyrian literary standard (kpv
) of the Komi language continuum (Uralic > Permic). It is based on a formalized description of literary Komi-Zyrian morphology, which also includes a number of dialectal elements, and uses uniparser-morph for parsing. It performs full morphological analysis of Komi-Zyrian words (lemmatization, POS tagging, grammatical tagging, glossing).
How to use
Python package
The analyzer is available as a Python package. If you want to analyze Komi texts in Python, install the module:
pip3 install uniparser-komi-zyrian
Import the module and create an instance of KomiZyrianAnalyzer
class. Set mode='strict'
if you are going to process text in standard orthography, or mode='nodiacritics'
if you expect some words to lack the diacritics (which often happens in social media). After that, you can either parse tokens or lists of tokens with analyze_words()
, or parse a frequency list with analyze_wordlist()
. Here is a simple example:
from uniparser_komi_zyrian import KomiZyrianAnalyzer
a = KomiZyrianAnalyzer(mode='strict')
analyses = a.analyze_words('Морфологияса')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)
# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
print(ana.wf, ana.lemma, ana.gramm, ana.gloss)
# You can also pass lists (even nested lists) and specify
# output format ('xml' or 'json')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['А'], ['Ме', 'тэнӧ', 'радейта', '.']],
format='xml')
analyses = a.analyze_words(['Морфологияса', [['А'], ['Ме', 'тэнӧ', 'радейта', '.']]],
format='json')
Refer to the uniparser-morph documentation for the full list of options.
Disambiguation
Disambiguation is not yet available for this language.
Word lists
Alternatively, you can use a preprocessed word list. The wordlists
directory contains a list of words from a 1.8-million-word Komi-Zyrian corpus (wordlist_main.csv
), list of analyzed tokens (wordlist_analyzed.txt
; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt
). The recall of the analyzer is 92.2% on literary texts and 89.1% on social media texts.
Description format
The description is carried out in the uniparser-morph
format and involves a description of the inflection (paradigms.txt), a grammatical dictionary (kpv_lexemes_XXX.txt files), a list of rules that annotate combinations of lexemes and grammatical values with additional Russian translations (lex_rules.txt), and a short list of analyses that should be avoided (bad_analyses.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical/borrowing information, its inflectional type (paradigm), and Russian translation. See more about the format in the uniparser-morph documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file uniparser-komi-zyrian-1.1.21.tar.gz
.
File metadata
- Download URL: uniparser-komi-zyrian-1.1.21.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60756a0ec92d552189340d5289221287dabb2b86a58a5c0ae4f2f2ec2975421d |
|
MD5 | 2a56543b0dcc2c03fd4c39c50d3d1831 |
|
BLAKE2b-256 | 91b35eb2273aa385ed53d9271cb135dd1fbdc39ed8905cd54b3f98ec00017ff2 |
File details
Details for the file uniparser_komi_zyrian-1.1.21-py3-none-any.whl
.
File metadata
- Download URL: uniparser_komi_zyrian-1.1.21-py3-none-any.whl
- Upload date:
- Size: 1.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c00eb51c193bce74c10cd93170ab76d00b680a5911c0968d38990e08f90397f |
|
MD5 | 31a18d6b7bb1a88e90d41ef2cac4d118 |
|
BLAKE2b-256 | f593ea063677732440663588a731c39fed1091cd96b47579e24b7d66336527d2 |