Skip to main content

Rule-based morphological analysis for Beserman (Latin-based script)

Project description

Beserman morphological analyzer

This is a rule-based morphological analyzer for Beserman (formerly a dialect of Udmurt udm; Uralic > Permic). It contains a formalized description of Beserman morphology as established by the Beserman documentation project, based mostly on the spoken data from the village of Shamardan (Yukamenskoye district, Udmurtia). It uses uniparser-morph for parsing. It performs full morphological analysis of Beserman words (lemmatization, POS tagging, grammatical tagging, glossing).

This package uses a project-internal Latin-based spelling system. Cyrillic and UPA analyzers will hopefully follow later. Right now, see translit-udmurt for transliteration options.

Warning: This is a project-internal tool. If you think you might need it, you are probably wrong. If what you need is standard Udmurt analyzer, you can find one here. If you are not sure, feel free to send an email to the developer (Timofey Arkhangelskiy, timarkh@gmail.com).

How to use

Python package

The analyzer is available as a Python package. If you want to analyze Beserman texts in Python, install the module:

pip3 install uniparser-beserman-lat

Import the module and create an instance of BesermanLatAnalyzer class. After that, you can either parse tokens or lists of tokens with analyze_words(), or parse a frequency list with analyze_wordlist(). Here is a simple example:

from uniparser_udmurt import BesermanLatAnalyzer
a = BesermanLatAnalyzer()

analyses = a.analyze_words('Gožtemjosəz')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)

# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
        print(ana.wf, ana.lemma, ana.gramm, ana.gloss)

# You can also pass lists (even nested lists) and specify
# output format ('xml' or 'json')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['A'], ['Mon', 'tone', 'jaratišʼko', '.']],
	                       format='xml')
analyses = a.analyze_words(['Gožtemjosəz', [['A'], ['Mon', 'tone', 'jaratišʼko', '.']]],
	                       format='json')

Refer to the uniparser-morph documentation for the full list of options.

Disambiguation

Apart from the analyzer, this repository contains a set of Constraint Grammar rules that can be used for partial disambiguation of analyzed Beserman texts. They reduce the average number of different analyses per analyzed token from about 1.7 to about 1.4. If you want to use them, set disambiguation=True when calling analyze_words:

analyses = a.analyze_words(['Mon', 'tone', 'jaratišʼko'], disambiguate=True)

In order for this to work, you have to install the cg3 executable separately. On Ubuntu/Debian, you can use apt-get:

sudo apt-get install cg3

On Windows, download the binary and add the path to the PATH environment variable. See the documentation for other options.

Note that each time you call analyze_words() with disambiguate=True, the CG grammar is loaded and compiled from scratch, which makes the analysis even slower. If you are analyzing a large text, it would make sense to pass the entire text contents in a single function call rather than do it sentence-by-sentence, for optimal performance.

Word lists

Alternatively, you can use a preprocessed word list. The wordlists directory contains a list of words from a 200-thousand-word Beserman multimedia corpus (wordlist.csv), list of analyzed tokens (wordlist_analyzed.txt; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt). The recall of the analyzer on the corpus texts is about 98%.

Description format

The description is carried out in the uniparser-morph format and involves a description of the inflection (paradigms.txt), a grammatical dictionary (lexemes.txt), and a short list of analyses that should be avoided (bad_analyses.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical/borrowing information, its inflectional type (paradigm), and Russian translation. See more about the format in the uniparser-morph documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniparser_beserman_lat-2.1.9.tar.gz (211.5 kB view details)

Uploaded Source

Built Distribution

uniparser_beserman_lat-2.1.9-py3-none-any.whl (212.4 kB view details)

Uploaded Python 3

File details

Details for the file uniparser_beserman_lat-2.1.9.tar.gz.

File metadata

  • Download URL: uniparser_beserman_lat-2.1.9.tar.gz
  • Upload date:
  • Size: 211.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.28.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for uniparser_beserman_lat-2.1.9.tar.gz
Algorithm Hash digest
SHA256 265095f8aa55560402a48187ebbcce2bf7e7709aa55a0819302a1272691cd250
MD5 f609d57da1ce929148185b09f7939bf0
BLAKE2b-256 7ee4ac9f7f307812ed9551012c098d9a5e2d44b4137f5b0307b62ff59eb67e2f

See more details on using hashes here.

File details

Details for the file uniparser_beserman_lat-2.1.9-py3-none-any.whl.

File metadata

  • Download URL: uniparser_beserman_lat-2.1.9-py3-none-any.whl
  • Upload date:
  • Size: 212.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.28.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for uniparser_beserman_lat-2.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 41298b7eef4eaa7f8428de79c9133d7bb88d60d81565a866b92169daba053624
MD5 d9acfc1969805e2983c27c0a90cd43ba
BLAKE2b-256 98b9e0fa3eca90f111952d39493fd3e8b9b632688005421390d4c6031340d9f0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page