A Transformer-based Natural Language Processing Pipeline for Greek
Project description
gr-nlp-toolkit
A Transformer-based Natural Language Processing Pipeline for Greek. This toolkit has state-of-the art accuracies in Greek and offers predictions for Named Entity Recognition, Part-of-Speech tagging, Morphological Tagging as well as Dependency Parsing.
Installation
You can install the toolkit by executing the following in the command line:
pip install gr-nlp-toolkit
Usage
To use the toolkit first initialize a Pipeline specifying which processors you need. Each processor annotates the text with a specific task's annotations.
- To obtain Part-of-Speech and Morphological Tagging annotations add the
pos
processor - To obtain Named Entity Recognition annotations add the
ner
processor - To obtain Dependency Parsing annotations add the
dp
processor
from gr_nlp_toolkit import Pipeline
nlp = Pipeline("pos,ner,dp") # Use ner,pos,dp processors
# nlp = Pipeline("ner,dp") # Use only ner and dp processors
The first time you use a processor, that processors data files are cached in the .cache folder of your home directory so you will not have to download them again.
Generating the annotations
After creating the pipeline you can annotate a text by calling the pipeline's __call__
method.
doc = nlp('Η Ιταλία κέρδισε την Αγγλία στον τελικό του Euro το 2021')
A Document
object is then created and is annotated. The original text is tokenized
and split to tokens
Accessing the annotations
The following code explains how you can access the annotations generated by the toolkit.
for token in doc.tokens:
token.text # the text of the token
token.ner # the named entity label in IOBES encoding : str
token.upos # the UPOS tag of the token
token.feats # the morphological features for the token
token.head # the head of the token
token.deprel # the dependency relation between the current token and its head
token.ner
is set by the ner
processor, token.upos
and token.feats
are set by the pos
processor
and token.head
and token.deprel
are set by the dp
processor.
A small detail is that to get the Token
object that is the head of another token you need to access
doc.tokens[head-1]
. The reason for this is that the enumeration of the tokens starts from 1 and when the
field token.head
is set to 0, that means the token is the root of the word.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gr_nlp_toolkit-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 401170b8b872bcad9d72f7da57327d8a4434c897336a11662df0d6fe56d934da |
|
MD5 | f5e9a2e24dc41a0b78d6141e75cc68f9 |
|
BLAKE2b-256 | c43a92cff76adba996f24a662ca66f52cc85bd6fdb66fbae0926eedde271285f |