Skip to main content

Rule-based morphological analysis for Albanian

Project description

Albanian morphological analyzer

This is a rule-based morphological analyzer for Albanian (sqi). It is based on a formalized description of literary Albanian morphology, which also includes a number of dialectal (Gheg) elements, and uses uniparser-morph for parsing. It performs full morphological analysis of Albanian words (lemmatization, POS tagging, grammatical tagging).

How to use

Python package

The analyzer is available as a Python package. If you want to analyze Albanian texts in Python, install the module:

pip3 install uniparser-albanian

Import the module and create an instance of AlbanianAnalyzer class. Set mode='strict' if you are going to process text in standard orthography, or mode='nodiacritics' if you expect some words to lack the diacritics (c instead of ç and e instead of ë). After that, you can either parse tokens or lists of tokens with analyze_words(), or parse a frequency list with analyze_wordlist(). Here is a simple example:

from uniparser_albanian import AlbanianAnalyzer
a = AlbanianAnalyzer(mode='strict')

analyses = a.analyze_words('Morfologjinë')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)

# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
        print(ana.wf, ana.lemma, ana.gramm)

# You can also pass lists (even nested lists) and specify
# output format ('xml', 'json' or 'conll')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['i'], ['Të', 'dua', '.']],
	                       format='xml')
analyses = a.analyze_words([['i'], ['Të', 'dua', '.']],
	                       format='conll')
analyses = a.analyze_words(['Morfologjinë', [['i'], ['Të', 'dua', '.']]],
	                       format='json')

Refer to the uniparser-morph documentation for the full list of options.

Word lists

Alternatively, you can use a preprocessed word list. The wordlists directory contains a list of words from a 31-million-word Albanian corpus (wordlist.csv) with 456,000 unique tokens, list of analyzed tokens (wordlist_analyzed.txt; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt). The recall of the analyzer on the corpus texts is about 93% and the corpus is sufficiently large, so if you just use the analyzed word list, the recall on your texts will probably exceed 90%.

Description format

The description is carried out in the uniparser-morph format and involves a description of the inflection (paradigms.txt), a grammatical dictionary (sqi_lexemes_XXX.txt files), a list of productive lemma-changing derivations (derivations.txt), and a short list of analyses that should be avoided (bad_analyses.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical/dialectal information, its inflectional type (paradigm), and English translation. See more about the format in the uniparser-morph documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniparser-albanian-2.1.4.tar.gz (2.7 MB view details)

Uploaded Source

Built Distribution

uniparser_albanian-2.1.4-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file uniparser-albanian-2.1.4.tar.gz.

File metadata

  • Download URL: uniparser-albanian-2.1.4.tar.gz
  • Upload date:
  • Size: 2.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.28.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for uniparser-albanian-2.1.4.tar.gz
Algorithm Hash digest
SHA256 5ca186b04d5c87f6e70217b37fccc6cde848b42e6c465f1d8a8e4a116ba62c01
MD5 a4e35670e180ca0f1d51ee8f7e61ebe8
BLAKE2b-256 b1e5e48433e131bc24e471dbf1706b06f2599d7469a19a49cb0fce9288c9f496

See more details on using hashes here.

File details

Details for the file uniparser_albanian-2.1.4-py3-none-any.whl.

File metadata

  • Download URL: uniparser_albanian-2.1.4-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.28.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for uniparser_albanian-2.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 45aa1bc020bb48a8da5343969d0aabfa8721d879280f8dd656c4a0e5b98464e5
MD5 0d366f1612a08b9461ff0c4b6645a761
BLAKE2b-256 d7f7bf65fdc597bc32b104542c763b30a4f939665edc50bb5c74e551eae17744

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page