Rule-based morphological analysis for Udmurt

These details have not been verified by PyPI

Project links

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Project description

Udmurt morphological analyzer

This is a rule-based morphological analyzer for Udmurt (udm; Uralic > Permic). It is based on a formalized description of literary Udmurt morphology, which also includes a number of dialectal elements, and uses uniparser-morph for parsing. It performs full morphological analysis of Udmurt words (lemmatization, POS tagging, grammatical tagging, glossing; Russian and, in many cases, English translations).

How to use

Python package

The analyzer is available as a Python package. If you want to analyze Udmurt texts in Python, install the module:

pip3 install uniparser-udmurt

Import the module and create an instance of UdmurtAnalyzer class. Set mode='strict' if you are going to process text in the standard orthography (default value). Set mode='nodiacritics' if you expect some words to lack the diacritics (which often happens in social media), e.g. сыче instead of the correct сыӵе. Set mode='oldorth' if you are processing texts written in one of the older, pre-standardized orthographies (earlier than late 1930s). Right now, apostrophes in place of ъ and some features of the pre-revolution orthography are accounted for, but not all of them.

After that, you can either parse tokens or lists of tokens with analyze_words(), or parse a frequency list with analyze_wordlist(). Here is a simple example:

from uniparser_udmurt import UdmurtAnalyzer
a = UdmurtAnalyzer(mode='strict')

analyses = a.analyze_words('Морфологиез')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)

# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
        print(ana.wf, ana.lemma, ana.gramm, ana.gloss)

# You can also pass lists (even nested lists) and specify
# output format ('xml' or 'json')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['А'], ['Мон', 'тонэ', 'яратӥсько', '.']],
	                       format='xml')
analyses = a.analyze_words(['Морфологиез', [['А'], ['Мон', 'тонэ', 'яратӥсько', '.']]],
	                       format='json')

Refer to the uniparser-morph documentation for the full list of options.

Disambiguation

Apart from the analyzer, this repository contains a set of Constraint Grammar rules that can be used for partial disambiguation of analyzed Udmurt texts. They reduce the average number of different analyses per analyzed token from about 1.6 to about 1.3. If you want to use them, set disambiguation=True when calling analyze_words:

analyses = a.analyze_words(['Мон', 'тонэ', 'яратӥсько'], disambiguate=True)

In order for this to work, you have to install the cg3 executable separately. On Ubuntu/Debian, you can use apt-get:

sudo apt-get install cg3

On Windows, download the binary and add the path to the PATH environment variable. See the documentation for other options.

Note that each time you call analyze_words() with disambiguate=True, the CG grammar is loaded and compiled from scratch, which makes the analysis even slower. If you are analyzing a large text, it would make sense to pass the entire text contents in a single function call rather than do it sentence-by-sentence, for optimal performance.

Word lists

Alternatively, you can use a preprocessed word list. The wordlists directory contains a list of words from a 10-million-word Udmurt corpus (wordlist.csv), list of analyzed tokens (wordlist_analyzed.txt; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt). The recall of the analyzer on the corpus texts is about 96% and the corpus is sufficiently large, so if you just use the analyzed word list, the recall on your texts will almost definitely exceed 90%.

Description format

The description is carried out in the uniparser-morph format and involves a description of the inflection (paradigms.txt), a grammatical dictionary (udm_lexemes_XXX.txt files), a list of rules that annotate combinations of lexemes and grammatical values with additional Russian translations (lex_rules.txt), and a short list of analyses that should be avoided (bad_analyses.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical/borrowing information, its inflectional type (paradigm), and Russian translation. See more about the format in the uniparser-morph documentation.

Project details

These details have not been verified by PyPI

Project links

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

This version

2.3.1

May 23, 2026

2.3.0

May 3, 2026

2.2.3

Nov 5, 2025

2.2.2

Aug 27, 2025

2.2.1

Jun 27, 2025

2.2.0

May 9, 2025

2.1.39

Mar 31, 2025

2.1.38

Jan 24, 2025

2.1.37

Jan 24, 2025

2.1.36

Jan 23, 2025

2.1.35

Sep 5, 2024

2.1.34

Aug 27, 2024

2.1.33

Jan 4, 2024

2.1.32

Dec 22, 2023

2.1.31

Nov 3, 2023

2.1.30

Jun 7, 2023

2.1.29

Jun 5, 2023

2.1.28

Mar 22, 2023

2.1.27

Mar 22, 2023

2.1.26

Mar 22, 2023

2.1.25

Mar 16, 2023

2.1.24

Feb 14, 2023

2.1.23

Feb 13, 2023

2.1.22

Dec 13, 2022

2.1.21

Nov 3, 2022

2.1.20

Jul 26, 2022

2.1.19

Jul 7, 2022

2.1.18

Jul 5, 2022

2.1.17

Jun 23, 2022

2.1.16

Jun 23, 2022

2.1.15

Jun 16, 2022

2.1.14

Jun 8, 2022

2.1.13

May 18, 2022

2.1.12

Mar 31, 2022

2.1.11

Mar 7, 2022

2.1.10

Feb 17, 2022

2.1.9

Nov 22, 2021

2.1.8

Nov 19, 2021

2.1.7

Nov 18, 2021

2.1.6

Jul 18, 2021

2.1.5

Jul 14, 2021

2.1.4

Jul 9, 2021

2.1.3

Jul 8, 2021

2.1.2

Jul 5, 2021

2.1.1

May 19, 2021

2.1.0

Mar 5, 2021

2.0.0

Mar 4, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniparser_udmurt-2.3.1.tar.gz (3.5 MB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uniparser_udmurt-2.3.1-py3-none-any.whl (3.5 MB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file uniparser_udmurt-2.3.1.tar.gz.

File metadata

Download URL: uniparser_udmurt-2.3.1.tar.gz
Upload date: May 23, 2026
Size: 3.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for uniparser_udmurt-2.3.1.tar.gz
Algorithm	Hash digest
SHA256	`d50b836db90a6b059b80a5a201ea322ed59244047a8c05010845f5e69981f2fa`
MD5	`10a324351a8cb0ab6917e7e1eb777efa`
BLAKE2b-256	`4677230021a5522baa23b738627aea6277fd7cc1df42de3833739f84dd7775e9`

See more details on using hashes here.

File details

Details for the file uniparser_udmurt-2.3.1-py3-none-any.whl.

File metadata

Download URL: uniparser_udmurt-2.3.1-py3-none-any.whl
Upload date: May 23, 2026
Size: 3.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for uniparser_udmurt-2.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c7963ce2eb4df5432b250415c9a155178190cf3d0aa16391f0690ac8d98b512`
MD5	`4df22636be50a0e52051dd8a76234a22`
BLAKE2b-256	`f203ee1eb517b2d38a6248c0c8538e432541ac6ea560d4954ab9fd331355cd99`

See more details on using hashes here.

uniparser-udmurt 2.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Udmurt morphological analyzer

How to use

Python package

Disambiguation

Word lists

Description format

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes