Skip to main content

A Transformer-based Natural Language Processing Pipeline for Greek

Project description

gr-nlp-toolkit

A Transformer-based Natural Language Processing Pipeline for Greek. This toolkit has state-of-the art accuracies in Greek and offers predictions for Named Entity Recognition, Part-of-Speech tagging, Morphological Tagging as well as Dependency Parsing.

Installation

You can install the toolkit by executing the following in the command line:

pip install gr-nlp-toolkit

Usage

To use the toolkit first initialize a Pipeline specifying which processors you need. Each processor annotates the text with a specific task's annotations.

  • To obtain Part-of-Speech and Morphological Tagging annotations add the pos processor
  • To obtain Named Entity Recognition annotations add the ner processor
  • To obtain Dependency Parsing annotations add the dp processor
from gr_nlp_toolkit import Pipeline
nlp = Pipeline("pos,ner,dp") # Use ner,pos,dp processors
# nlp = Pipeline("ner,dp") # Use only ner and dp processors

The first time you use a processor, that processors data files are cached in the .cache folder of your home directory so you will not have to download them again.

Generating the annotations

After creating the pipeline you can annotate a text by calling the pipeline's __call__ method.

doc = nlp('Η Ιταλία κέρδισε την Αγγλία στον τελικό του Euro το 2021')

A Document object is then created and is annotated. The original text is tokenized and split to tokens

Accessing the annotations

The following code explains how you can access the annotations generated by the toolkit.

for token in doc.tokens:
  token.text # the text of the token
  
  token.ner # the named entity label in IOBES encoding : str
  
  token.upos # the UPOS tag of the token
  token.feats # the morphological features for the token
  
  token.head # the head of the token
  token.deprel # the dependency relation between the current token and its head

token.ner is set by the ner processor, token.upos and token.feats are set by the pos processor and token.head and token.deprel are set by the dp processor.

A small detail is that to get the Token object that is the head of another token you need to access doc.tokens[head-1]. The reason for this is that the enumeration of the tokens starts from 1 and when the field token.head is set to 0, that means the token is the root of the word.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gr-nlp-toolkit-0.0.3.tar.gz (19.1 kB view hashes)

Uploaded Source

Built Distribution

gr_nlp_toolkit-0.0.3-py3-none-any.whl (27.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page