Skip to main content

A greek stemmer that utilizes Part of Speech (PoS) tags

Project description

Greek Stemmer

PyPI - Python Version PyPI

Coverage PyPI - Wheel

Python stemming library that given a word and the Part Of Speech tagging (POS) of the word removes its inflectional suffix according to a set of rules based algorithm. The algorithm is developed according to the grammatical rules of the Modern Greek language [1]. An extended documentation of the removal process, as well as a short evaluation of the system is showing the algorithm accuracy that works with better performance than other past stemming algorithms for the Greek language giving 99.4 percent correct results in a dataset of 5000 of words.

[1] David Holton, Peter Mackridge, Ειρήνη Φιλιππάκη-Warburton (2002), "Γραμματική της Ελληνικής Γλώσσας".

POS: The system uses the POS tagger of Ellogon with the following categories for the rules:

  • Verbs: VB, VBD, VBF, MD, VBS, VBDS, VBFS

  • Definite Article : DDT

  • Indefinite Article: IDT

  • Nouns: NNM, NNF, NNN, NNSM, NNSF, NNSN, NNPM, NNPF, NNPN, NNPSM, NNPSF, NNPSN

  • Adjectives: JJM, JJF, JJN, CD, JJSM, JJSF, JJSN

  • Pronouns: PRP, PP, REP, DP, IP, WP, QP, INP

  • Participles: VBG, VBP, VBPD

  • Adverb: RB

  • Preposition: INP

Although there is a variety of stemmers, the unique morphological system of each language doesn't allow the creation of a global rule-based algorithm which would be able to find out the stem of each word. Especially, in some languages with a rich morphological system, like greek, is even more difficult to find the word stem by reducing the suffix from inflected or derived words. At this point, it would be useful to be mentioned that the greek morphological system may appear a wide variety of suffixes, some of them may appear in different parts of speech. For this reason it is necessary to point out the part of speech of the certain word before trying to find out the root of the concrete word. Let's exam some typical examples. For instance, the word "εργαζόμενος" is the participle of the verb "εργάζομαι". Although, the typical suffix of the present participle is "-όμενος", it may be confused with the basic suffix of adjectives "-ος". As a result can be erroneously be identified as the root of the word "εργαζόμενος", the stem "εργαζόμεν", while in fact its stem is "εργαζ". Moreover, there are numerals or adverbs which may appear typical verbal suffixes. So, the number "οκτώ" or the adverb "παραπάνω" seem to have the same suffix with the verbal forms of the first, singular person of the present tense of the active voice. For this reason, it is appropriate to know the part of speech of the word in order to find the stem as in the certain case the suffix of the verb is actually "-ω", while the numerals and adverbs are non declinable parts of speech and as a consequence their stem is the word itself.

Examples

WORD CONFUSED WITH THE STEM OF ANOTHER POS REAL STEM
εργαζόμενος (employee) εργαζόμεν (confused with the stem of the adjectives) εργαζ
οκτώ (eight) οκτ (confused with the stem of the verbs) οκτώ
παραπάνω (more) παραπάν (confused with the stem of the verbs) παραπάνω

Install

The recommended installation is via pip:

Simply type:

$ pip install greek-stemmer-pos

Usage

from greek_stemmer import stemmer

stemmer.stem_word('εργαζόμενος', 'VBG')
// ΕΡΓΑΖ

How to contribute

If you wish to contribute, you can start from here !

Run Test

  • You can run the available tests with pytest
  • There are 149 available unittests

Code coverage

  • Code coverage metrics are also available via pytest-cov.
  • Existing code coverage --> 100%

Python Package Index (PyPI)

  • Library is available here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

greek-stemmer-pos-1.1.2.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

greek_stemmer_pos-1.1.2-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file greek-stemmer-pos-1.1.2.tar.gz.

File metadata

  • Download URL: greek-stemmer-pos-1.1.2.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.4

File hashes

Hashes for greek-stemmer-pos-1.1.2.tar.gz
Algorithm Hash digest
SHA256 276ceffa134a921e0dcf2ddefc7a9e24cddc9d54fbb4020d67c957157e035922
MD5 1a0c39d60c9f2f531aa3beca12506ac2
BLAKE2b-256 f7a4e74e8ad14bb4fb51f63a3074111b238fedd294652a25a9537d601e35147c

See more details on using hashes here.

File details

Details for the file greek_stemmer_pos-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: greek_stemmer_pos-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.4

File hashes

Hashes for greek_stemmer_pos-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a8528f4f1ce7b9b0e396190ad3a9c471c042ad7d9c4ef279bf6ea7c84acf2a21
MD5 26227fcac66d46dcea193c86e012984a
BLAKE2b-256 af4a8e52f9cf2a5810e370e5aee8f67f1f5da714c64921625810d8e4b9ca02ed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page