Rule-based, linguist-friendly (and rather slow) morphological analysis

These details have not been verified by PyPI

Project links

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Project description

uniparser-morph

This is yet another rule-based morphological analysis tool. No built-in rules are provided; you will have to write some if you want to parse texts in your language. Uniparser-morph was developed primarily for under-resourced languages, which don't have enough data for training statistical parsers. Here's how it's different from other similar tools:

It is designed to be usable by theoretical linguists with no prior knowledge of NLP (and has been successfully used by them with minimal guidance). So it's not just another way of defining an FST; the way you describe lexemes and morphology resembles what you do in a traditional theoretical description, at least in part.
It was developed with a large variety of linguistic phenomena in mind and is easily applicable to most languages -- not just the Standard Average European.
Apart from POS-tagging and full morphological tagging, there is a glossing option (words can be split into morphemes).
Lexemes can carry any number of attributes that have to end up in the annotation, e.g. translations into the metalanguage.
Ambiguity is allowed: all words you analyze will receive all theoretically possible analyses regardless of the context. (You can then use e.g. CG for rule-based disambiguation.)
While, in computational terms, the language described by uniparser-morph rules is certainly regular, the description is actually NOT entirely converted into an FST. Therefore, it's not nearly as fast as FST-based analyzers. The speed varies depending on the language structure and hardware characteristics, but you can hardly expect to parse more than 20,000 words per second. For heavily polysynthetic languages that figure can go as low as 200 words per second. So it's not really designed for industrial use.

The primary usage scenario I was thinking about is the following:

You have a corpus of texts where you want to add morphological annotation (this includes POS-tagging).
You manually prepare a grammar for the language in uniparser-morph format (probably making use of existing digital dictionaries of the language).
You compile a list of unique words in your corpus and parse it.
Then you annotate your texts based on this wordlist with any software you want.

Of course, you can do other things with uniparser-morph, e.g. make it a part of a more complex NLP pipeline; just make sure low speed is not an issue in your case.

uniparser-morph is distributed under the MIT license (see LICENSE).

Usage

Import the Analyzer class from the package. Here is a basic usage example:

from uniparser_morph import Analyzer
a = Analyzer()

# Put your grammar files in the current folder or set paths as properties of the Analyzer class (see below)
a.load_grammar()

analyses = a.analyze_words('Морфологиез')
# The parser is initialized before first use, so expect some delay here (usually several seconds)
# You will get a list of Wordform objects

# You can also pass lists (even nested lists) and specify output format ('xml' or 'json'):
analyses = a.analyze_words([['А'], ['Мон', 'тонэ', 'яратӥсько', '.']], format='xml')
analyses = a.analyze_words(['Морфологиез', [['А'], ['Мон', 'тонэ', 'яратӥсько', '.']]], format='json')

If you need to parse a frequency list, use analyze_wordlist() instead.

See the documentation for the full list of options.

Format

If you want to create a uniparser-morph analyzer for your language, you will have to write a set of rules that describe the vocabulary and the morphology of your language in uniparser-morph format. For the description of the format, refer to documentation .

Disambiguation with CG

If you have disambiguation rules in the Constraint Grammar format, you can use them in the following way when calling analyze_words():

analyses = a.analyze_words(['Мон', 'морфологиез', 'яратӥсько', '.'],
                           cgFile=os.path.abspath('disambiguation.cg3'),
                           disambiguate=True)

In order for this to work, you have to install the cg3 executable separately. On Ubuntu/Debian, you can use apt-get:

sudo apt-get install cg3

On Windows, download the binary and add the path to the PATH environment variable. See the documentation for other options.

Note that each time you call analyze_words() with disambiguate=True, the CG grammar is loaded and compiled from scratch, which makes the analysis even slower. If you are analyzing a large text, it would make sense to pass the entire text contents in a single function call rather than do it sentence-by-sentence, for optimal performance.

Project details

These details have not been verified by PyPI

Project links

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

2.9.4

May 9, 2025

2.9.3

May 9, 2025

2.9.2

May 9, 2025

2.9.1

May 9, 2025

2.9.0

May 9, 2025

2.8.0

Apr 25, 2025

2.7.6

Apr 25, 2025

2.7.5

Jan 3, 2024

2.7.4

Oct 11, 2023

2.7.3

Dec 19, 2022

2.7.2

Dec 13, 2022

2.7.1

Nov 23, 2022

2.7.0

Nov 23, 2022

2.6.4

Jun 8, 2022

2.6.3

Jun 8, 2022

2.6.2

May 18, 2022

2.6.1

Apr 25, 2022

2.6.0

Apr 14, 2022

This version

2.5.0

Mar 11, 2022

2.4.3

Nov 23, 2021

2.4.2

Oct 11, 2021

2.4.1

Sep 13, 2021

2.4.0

Sep 10, 2021

2.3.0

Jun 8, 2021

2.2.1

Mar 10, 2021

2.2.0

Mar 5, 2021

2.1.0

Mar 3, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniparser-morph-2.5.0.tar.gz (51.6 kB view details)

Uploaded Mar 11, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uniparser_morph-2.5.0-py3-none-any.whl (56.6 kB view details)

Uploaded Mar 11, 2022 Python 3

File details

Details for the file uniparser-morph-2.5.0.tar.gz.

File metadata

Download URL: uniparser-morph-2.5.0.tar.gz
Upload date: Mar 11, 2022
Size: 51.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for uniparser-morph-2.5.0.tar.gz
Algorithm	Hash digest
SHA256	`859b7198d2cb80158c84c115fd3d9a84d3d298b9bf9ea00891b8ae6aa58f507a`
MD5	`6a1b6284ccbe2cf97c9f39cebfa63d8a`
BLAKE2b-256	`31596d42d9bd82f3456ac1a7aff4a1cdd9ffb7e3c7c8dbe38d6876ec279b3e23`

See more details on using hashes here.

File details

Details for the file uniparser_morph-2.5.0-py3-none-any.whl.

File metadata

Download URL: uniparser_morph-2.5.0-py3-none-any.whl
Upload date: Mar 11, 2022
Size: 56.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for uniparser_morph-2.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fa68c3ee3b4d3f0275fd320a4a9384ae695d8de20f0a875d8bf2a7fe3dd5afab`
MD5	`c259054ebefdc299fdbf2caeed3d2c70`
BLAKE2b-256	`6e9cebe760e1a092f000b62cb9d8a35382f4f89efed3c10ec2d9ecf9fb324a4f`

See more details on using hashes here.

uniparser-morph 2.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

uniparser-morph

Usage

Format

Disambiguation with CG

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes