Skip to main content

A Transformer-based Natural Language Processing Pipeline for Greek

Project description

gr-nlp-toolkit

A Transformer-based natural language processing toolkit for (modern) Greek. The toolkit has state-of-the art performance in Greek and supports named entity recognition, part-of-speech tagging, morphological tagging, as well as dependency parsing. For more information, please consult the following theses:

C. Dikonimaki, "A Transformer-based natural language processing toolkit for Greek -- Part of speech tagging and dependency parsing", BSc thesis, Department of Informatics, Athens University of Economics and Business, 2021. http://nlp.cs.aueb.gr/theses/dikonimaki_bsc_thesis.pdf

N. Smyrnioudis, "A Transformer-based natural language processing toolkit for Greek -- Named entity recognition and multi-task learning", BSc thesis, Department of Informatics, Athens University of Economics and Business, 2021. http://nlp.cs.aueb.gr/theses/smyrnioudis_bsc_thesis.pdf

Installation

You can install the toolkit by executing the following in the command line:

pip install gr-nlp-toolkit

Usage

To use the toolkit first initialize a Pipeline specifying which processors you need. Each processor annotates the text with a specific task's annotations.

  • To obtain Part-of-Speech and Morphological Tagging annotations add the pos processor
  • To obtain Named Entity Recognition annotations add the ner processor
  • To obtain Dependency Parsing annotations add the dp processor
from gr_nlp_toolkit import Pipeline
nlp = Pipeline("pos,ner,dp") # Use ner,pos,dp processors
# nlp = Pipeline("ner,dp") # Use only ner and dp processors

The first time you use a processor, the data files of that processor are cached in the .cache folder of your home directory, so that you will not have to download them again. Each processor is about 500 MB in size, so the maximum download size can be up to 1.5 GB.

Generating the annotations

After creating the pipeline you can annotate a text by calling the pipeline's __call__ method.

doc = nlp('Η Ιταλία κέρδισε την Αγγλία στον τελικό του Euro 2020')

A Document object is then created and is annotated. The original text is tokenized and split to tokens

Accessing the annotations

The following code explains how you can access the annotations generated by the toolkit.

for token in doc.tokens:
  print(token.text) # the text of the token
  
  print(token.ner) # the named entity label in IOBES encoding : str
  
  print(token.upos) # the UPOS tag of the token
  print(token.feats) # the morphological features for the token
  
  print(token.head) # the head of the token
  print(token.deprel) # the dependency relation between the current token and its head

token.ner is set by the ner processor, token.upos and token.feats are set by the pos processor and token.head and token.deprel are set by the dp processor.

A small detail is that to get the Token object that is the head of another token you need to access doc.tokens[head-1]. The reason for this is that the enumeration of the tokens starts from 1 and when the field token.head is set to 0, that means the token is the root of the word.

Alternative download methods for the toolkit models

Currently the models are served in a Google Drive folder. In case they become unavailable from that source, the models can be found via archive.org at the following links:

The toolkit currently cannot download the models from these sources, but if you have downloaded the toolkit models via an alternative source you can place the files with their names in the .cache/gr_nlp_toolkit/ directory of your home folder (~/.cache/gr_nlp_toolkit in Linux systems). Be sure to name the Dependency Parsing model file as toolkit-dp, the Named Entity Recognition model file as toolkit-ner and the Part-of-Speech and morphological tagging model file as toolkit-pos. This way, the toolkit will not download any models from the Internet and will use the local ones instead.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gr_nlp_toolkit-0.0.4.tar.gz (31.7 kB view details)

Uploaded Source

Built Distribution

gr_nlp_toolkit-0.0.4-py3-none-any.whl (44.6 kB view details)

Uploaded Python 3

File details

Details for the file gr_nlp_toolkit-0.0.4.tar.gz.

File metadata

  • Download URL: gr_nlp_toolkit-0.0.4.tar.gz
  • Upload date:
  • Size: 31.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for gr_nlp_toolkit-0.0.4.tar.gz
Algorithm Hash digest
SHA256 e7c33fa53bc9f31527eee4c7d5af280240f986f7c0aff6f9c3fa83a56308fb39
MD5 701c949d4ae85cfbae172fbadb4b1d25
BLAKE2b-256 da0a36c1251c2270c0f8abc4d3f1458a0656e6af87b73bb558b9669b0fcc6206

See more details on using hashes here.

File details

Details for the file gr_nlp_toolkit-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for gr_nlp_toolkit-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 062d0d9d99496d7c4fc3fb92a692feaa4b79ccef50f935f8c689c2d701b9bffc
MD5 8374eb5ca63fd9459e5206e0fdbf5064
BLAKE2b-256 211ae7e6ecb2ae3f61e218036b7960aaabb44465089bbfc1a9bb334735c1c775

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page