Skip to main content

Rule-based morphological analysis for Eastern Armenian

Project description

Eastern Armenian morphological analyzer

This is a rule-based morphological analyzer for Modern Eastern Armenian (hye). It is based on a formalized description of literary Eastern Armenian morphology, which also includes a number of dialectal elements, and uses uniparser-morph for parsing. It performs full morphological analysis of Eastern Armenian words (lemmatization, POS tagging, grammatical tagging, glossing).

How to use

Python package

The analyzer is available as a Python package. If you want to analyze Eastern Armenian texts in Python, install the module:

pip3 install uniparser-eastern-armenian

Import the module and create an instance of EasternArmenianAnalyzer class. After that, you can either parse tokens or lists of tokens with analyze_words(), or parse a frequency list with analyze_wordlist(). Here is a simple example:

from uniparser_eastern_armenian import EasternArmenianAnalyzer
a = EasternArmenianAnalyzer()

analyses = a.analyze_words('Ձևաբանություն')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)

# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
        print(ana.wf, ana.lemma, ana.gramm, ana.gloss)

# You can also pass lists (even nested lists) and specify
# output format ('xml' or 'json')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['և'], ['Ես', 'սիրում', 'եմ', 'քեզ', ':']],
	                       format='xml')
analyses = a.analyze_words(['Ձևաբանություն', [['և'], ['Ես', 'սիրում', 'եմ', 'քեզ', ':']]],
	                       format='json')

Refer to the uniparser-morph documentation for the full list of options.

Disambiguation

Apart from the analyzer, this repository contains a small set of Constraint Grammar rules that can be used for partial disambiguation of analyzed Armenian texts. If you want to use them, set disambiguation=True when calling analyze_words:

analyses = a.analyze_words(['Ես', 'սիրում', 'եմ', 'քեզ'], disambiguate=True)

In order for this to work, you have to install the cg3 executable separately. On Ubuntu/Debian, you can use apt-get:

sudo apt-get install cg3

On Windows, download the binary and add the path to the PATH environment variable. See the documentation for other options.

Note that each time you call analyze_words() with disambiguate=True, the CG grammar is loaded and compiled from scratch, which makes the analysis even slower. If you are analyzing a large text, it would make sense to pass the entire text contents in a single function call rather than do it sentence-by-sentence, for optimal performance.

Word lists

Alternatively, you can use a preprocessed word list. The wordlists directory contains a list of words from a 100-million-word Eastern Armenian National Corpus (wordlist.csv), list of analyzed tokens (wordlist_analyzed.txt; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt). The recall of the analyzer on literary texts is about 93%, i.e. 93% of the tokens receive at least one analysis.

Description format

The description is carried out in the uniparser-morph format and involves a description of the inflection (paradigms.txt), a grammatical dictionary (hye_lexemes_XXX.txt files), and a short list of analyses that should be avoided (bad_analyses.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical/borrowing information, its inflectional type (paradigm), its English translation and (in some cases) its stem gloss. See more about the format in the uniparser-morph documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniparser-eastern-armenian-2.1.2.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

File details

Details for the file uniparser-eastern-armenian-2.1.2.tar.gz.

File metadata

  • Download URL: uniparser-eastern-armenian-2.1.2.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/56.0.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.8.10

File hashes

Hashes for uniparser-eastern-armenian-2.1.2.tar.gz
Algorithm Hash digest
SHA256 985e1d9f6577e20e9be3cddb610252eb755251d866488a6a58162ff61e094f08
MD5 8612cf65006cd7a6c72fafee15209554
BLAKE2b-256 f6cac5ccc85e2a5205b77d691d7a417e6274d4d47a6452dbcf58cb0b35b14afb

See more details on using hashes here.

File details

Details for the file uniparser_eastern_armenian-2.1.2-py3-none-any.whl.

File metadata

  • Download URL: uniparser_eastern_armenian-2.1.2-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/56.0.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.8.10

File hashes

Hashes for uniparser_eastern_armenian-2.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 aae171a2f09aa546c7522a7ce59b9e96a4eaac1415818b282a170a81b50a4ddc
MD5 79c40641ddf40b5641f8b6bbe51edcdf
BLAKE2b-256 9748c9b803e0238fef6924bc75da3524825c02aeccc05e5af59a46618c80ddeb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page