Skip to main content

Rule-based morphological analysis for Erzya

Project description

Erzya morphological analyzer

This is a rule-based morphological analyzer for Erzya (myv; Uralic > Mordvinic). It is based on a formalized description of literary Erzya morphology, which also includes a number of dialectal elements, and uses uniparser-morph for parsing. It performs full morphological analysis of Erzya words (lemmatization, POS tagging, grammatical tagging, glossing).

How to use

Python package

The analyzer is available as a Python package. If you want to analyze Erzya texts in Python, install the module:

pip3 install uniparser-erzya

Import the module and create an instance of ErzyaAnalyzer class. Set mode='strict' if you are going to process text in standard orthography, or mode='nodiacritics' if you expect some words to lack the diacritics (which often happens in social media). After that, you can either parse tokens or lists of tokens with analyze_words(), or parse a frequency list with analyze_wordlist(). Here is a simple example:

from uniparser_erzya import ErzyaAnalyzer
a = ErzyaAnalyzer(mode='strict')

analyses = a.analyze_words('Морфологиянть')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)

# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
        print(ana.wf, ana.lemma, ana.gramm, ana.gloss)

# You can also pass lists (even nested lists) and specify
# output format ('xml' or 'json')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['А'], ['Мон', 'тонь', 'вечктян', '.']],
	                       format='xml')
analyses = a.analyze_words(['Морфологиянть', [['А'], ['Мон', 'тонь', 'вечктян', '.']]],
	                       format='json')

Refer to the uniparser-morph documentation for the full list of options.

Disambiguation

Disambiguation is not yet available for this language.

Word lists

Alternatively, you can use a preprocessed word list. The wordlists directory contains a list of words from a 2.3-million-word Erzya corpus (wordlist_main.csv), list of analyzed tokens (wordlist_analyzed.txt; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt). The recall of the analyzer is 93.6% on literary texts and 90.7% on social media texts.

Description format

The description is carried out in the uniparser-morph format and involves a description of the inflection (paradigms.txt), a grammatical dictionary (kpv_lexemes_XXX.txt files), a list of rules that annotate combinations of lexemes and grammatical values with additional Russian translations (lex_rules.txt), and a short list of analyses that should be avoided (bad_analyses.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical/borrowing information, its inflectional type (paradigm), and Russian translation. See more about the format in the uniparser-morph documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniparser-erzya-1.1.3.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

uniparser_erzya-1.1.3-py3-none-any.whl (1.1 MB view details)

Uploaded Python 3

File details

Details for the file uniparser-erzya-1.1.3.tar.gz.

File metadata

  • Download URL: uniparser-erzya-1.1.3.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.28.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.1

File hashes

Hashes for uniparser-erzya-1.1.3.tar.gz
Algorithm Hash digest
SHA256 c438d39bc1de15eb42a62962c11ef09c01f3f8371dbe50434fb582ca040357f4
MD5 cfa98c02509da246ffd7123aee1e913b
BLAKE2b-256 026682b0c7254c40954e7121854b4d81381c560b8e617e76d9ba42d64934bd3c

See more details on using hashes here.

File details

Details for the file uniparser_erzya-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: uniparser_erzya-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.28.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.1

File hashes

Hashes for uniparser_erzya-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1d8006b49c701dc9612c1cf58f59da1a59c7d7376ddbfb4609591ac1b2ce58f3
MD5 2005c9dc1d2e87ed8a924841624d9b7b
BLAKE2b-256 bae6b95c4945f6c44e603291d642289ec67ce9e7cfd0232b5f71580e4276a6f6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page