Skip to main content

Rule-based morphological analysis for Komi-Zyrian

Project description

Komi-Zyrian morphological analyzer

This is a rule-based morphological analyzer for Komi-Zyrian literary standard (kpv) of the Komi language continuum (Uralic > Permic). It is based on a formalized description of literary Komi-Zyrian morphology, which also includes a number of dialectal elements, and uses uniparser-morph for parsing. It performs full morphological analysis of Komi-Zyrian words (lemmatization, POS tagging, grammatical tagging, glossing).

How to use

Python package

The analyzer is available as a Python package. If you want to analyze Komi texts in Python, install the module:

pip3 install uniparser-komi-zyrian

Import the module and create an instance of KomiZyrianAnalyzer class. Set mode='strict' if you are going to process text in standard orthography, or mode='nodiacritics' if you expect some words to lack the diacritics (which often happens in social media). After that, you can either parse tokens or lists of tokens with analyze_words(), or parse a frequency list with analyze_wordlist(). Here is a simple example:

from uniparser_komi_zyrian import KomiZyrianAnalyzer
a = KomiZyrianAnalyzer(mode='strict')

analyses = a.analyze_words('Морфологияса')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)

# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
        print(ana.wf, ana.lemma, ana.gramm, ana.gloss)

# You can also pass lists (even nested lists) and specify
# output format ('xml' or 'json')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['А'], ['Ме', 'тэнӧ', 'радейта', '.']],
	                       format='xml')
analyses = a.analyze_words(['Морфологияса', [['А'], ['Ме', 'тэнӧ', 'радейта', '.']]],
	                       format='json')

Refer to the uniparser-morph documentation for the full list of options.

Disambiguation

Disambiguation is not yet available for this language.

Word lists

Alternatively, you can use a preprocessed word list. The wordlists directory contains a list of words from a 1.8-million-word Komi-Zyrian corpus (wordlist_main.csv), list of analyzed tokens (wordlist_analyzed.txt; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt). The recall of the analyzer is 92.2% on literary texts and 89.1% on social media texts.

Description format

The description is carried out in the uniparser-morph format and involves a description of the inflection (paradigms.txt), a grammatical dictionary (kpv_lexemes_XXX.txt files), a list of rules that annotate combinations of lexemes and grammatical values with additional Russian translations (lex_rules.txt), and a short list of analyses that should be avoided (bad_analyses.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical/borrowing information, its inflectional type (paradigm), and Russian translation. See more about the format in the uniparser-morph documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniparser-komi-zyrian-1.1.21.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

uniparser_komi_zyrian-1.1.21-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file uniparser-komi-zyrian-1.1.21.tar.gz.

File metadata

File hashes

Hashes for uniparser-komi-zyrian-1.1.21.tar.gz
Algorithm Hash digest
SHA256 60756a0ec92d552189340d5289221287dabb2b86a58a5c0ae4f2f2ec2975421d
MD5 2a56543b0dcc2c03fd4c39c50d3d1831
BLAKE2b-256 91b35eb2273aa385ed53d9271cb135dd1fbdc39ed8905cd54b3f98ec00017ff2

See more details on using hashes here.

File details

Details for the file uniparser_komi_zyrian-1.1.21-py3-none-any.whl.

File metadata

File hashes

Hashes for uniparser_komi_zyrian-1.1.21-py3-none-any.whl
Algorithm Hash digest
SHA256 4c00eb51c193bce74c10cd93170ab76d00b680a5911c0968d38990e08f90397f
MD5 31a18d6b7bb1a88e90d41ef2cac4d118
BLAKE2b-256 f593ea063677732440663588a731c39fed1091cd96b47579e24b7d66336527d2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page