A greek stemmer that utilizes Part of Speech (PoS) tags
Project description
Greek Stemmer
Python stemming library that given a word and the Part Of Speech tagging (POS) of the word removes its inflectional suffix according to a set of rules based algorithm. The algorithm is developed according to the grammatical rules of the Modern Greek language [1]. An extended documentation of the removal process, as well as a short evaluation of the system is showing the algorithm accuracy that works with better performance than other past stemming algorithms for the Greek language giving 99.4 percent correct results in a dataset of 5000 of words.
[1] David Holton, Peter Mackridge, Ειρήνη Φιλιππάκη-Warburton (2002), "Γραμματική της Ελληνικής Γλώσσας".
POS: The system uses the POS tagger of Ellogon with the following categories for the rules:
-
Verbs: VB, VBD, VBF, MD, VBS, VBDS, VBFS
-
Definite Article : DDT
-
Indefinite Article: IDT
-
Nouns: NNM, NNF, NNN, NNSM, NNSF, NNSN, NNPM, NNPF, NNPN, NNPSM, NNPSF, NNPSN
-
Adjectives: JJM, JJF, JJN, CD, JJSM, JJSF, JJSN
-
Pronouns: PRP, PP, REP, DP, IP, WP, QP, INP
-
Participles: VBG, VBP, VBPD
-
Adverb: RB
-
Preposition: INP
Although there is a variety of stemmers, the unique morphological system of each language doesn't allow the creation of a global rule-based algorithm which would be able to find out the stem of each word. Especially, in some languages with a rich morphological system, like greek, is even more difficult to find the word stem by reducing the suffix from inflected or derived words. At this point, it would be useful to be mentioned that the greek morphological system may appear a wide variety of suffixes, some of them may appear in different parts of speech. For this reason it is necessary to point out the part of speech of the certain word before trying to find out the root of the concrete word. Let's exam some typical examples. For instance, the word "εργαζόμενος" is the participle of the verb "εργάζομαι". Although, the typical suffix of the present participle is "-όμενος", it may be confused with the basic suffix of adjectives "-ος". As a result can be erroneously be identified as the root of the word "εργαζόμενος", the stem "εργαζόμεν", while in fact its stem is "εργαζ". Moreover, there are numerals or adverbs which may appear typical verbal suffixes. So, the number "οκτώ" or the adverb "παραπάνω" seem to have the same suffix with the verbal forms of the first, singular person of the present tense of the active voice. For this reason, it is appropriate to know the part of speech of the word in order to find the stem as in the certain case the suffix of the verb is actually "-ω", while the numerals and adverbs are non declinable parts of speech and as a consequence their stem is the word itself.
Examples
WORD | CONFUSED WITH THE STEM OF ANOTHER POS | REAL STEM |
---|---|---|
εργαζόμενος (employee) | εργαζόμεν (confused with the stem of the adjectives) | εργαζ |
οκτώ (eight) | οκτ (confused with the stem of the verbs) | οκτώ |
παραπάνω (more) | παραπάν (confused with the stem of the verbs) | παραπάνω |
Install
The recommended installation is via pip
:
Simply type:
$ pip install greek-stemmer-pos
Usage
from greek_stemmer import stemmer
stemmer.stem_word('εργαζόμενος', 'VBG')
// ΕΡΓΑΖ
How to contribute
If you wish to contribute, you can start from here !
Run Test
- You can run the available tests with pytest
- There are 149 available unittests
Code coverage
- Code coverage metrics are also available via pytest-cov.
- Existing code coverage --> 100%
Python Package Index (PyPI)
- Library is available here
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file greek-stemmer-pos-1.1.2.tar.gz
.
File metadata
- Download URL: greek-stemmer-pos-1.1.2.tar.gz
- Upload date:
- Size: 23.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 276ceffa134a921e0dcf2ddefc7a9e24cddc9d54fbb4020d67c957157e035922 |
|
MD5 | 1a0c39d60c9f2f531aa3beca12506ac2 |
|
BLAKE2b-256 | f7a4e74e8ad14bb4fb51f63a3074111b238fedd294652a25a9537d601e35147c |
File details
Details for the file greek_stemmer_pos-1.1.2-py3-none-any.whl
.
File metadata
- Download URL: greek_stemmer_pos-1.1.2-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8528f4f1ce7b9b0e396190ad3a9c471c042ad7d9c4ef279bf6ea7c84acf2a21 |
|
MD5 | 26227fcac66d46dcea193c86e012984a |
|
BLAKE2b-256 | af4a8e52f9cf2a5810e370e5aee8f67f1f5da714c64921625810d8e4b9ca02ed |