A Transformer-based Natural Language Processing Pipeline for Greek
Project description
gr-nlp-toolkit
A Transformer-based natural language processing toolkit for (modern) Greek. The toolkit has state-of-the art performance in Greek and supports named entity recognition, part-of-speech tagging, morphological tagging, as well as dependency parsing. For more information, please consult the following theses:
C. Dikonimaki, "A Transformer-based natural language processing toolkit for Greek -- Part of speech tagging and dependency parsing", BSc thesis, Department of Informatics, Athens University of Economics and Business, 2021. http://nlp.cs.aueb.gr/theses/dikonimaki_bsc_thesis.pdf
N. Smyrnioudis, "A Transformer-based natural language processing toolkit for Greek -- Named entity recognition and multi-task learning", BSc thesis, Department of Informatics, Athens University of Economics and Business, 2021. http://nlp.cs.aueb.gr/theses/smyrnioudis_bsc_thesis.pdf
Installation
You can install the toolkit by executing the following in the command line:
pip install gr-nlp-toolkit
Usage
To use the toolkit first initialize a Pipeline specifying which processors you need. Each processor annotates the text with a specific task's annotations.
- To obtain Part-of-Speech and Morphological Tagging annotations add the
pos
processor - To obtain Named Entity Recognition annotations add the
ner
processor - To obtain Dependency Parsing annotations add the
dp
processor
from gr_nlp_toolkit import Pipeline
nlp = Pipeline("pos,ner,dp") # Use ner,pos,dp processors
# nlp = Pipeline("ner,dp") # Use only ner and dp processors
The first time you use a processor, the data files of that processor are cached in the .cache folder of your home directory, so that you will not have to download them again. Each processor is about 500 MB in size, so the maximum download size can be up to 1.5 GB.
Generating the annotations
After creating the pipeline you can annotate a text by calling the pipeline's __call__
method.
doc = nlp('Η Ιταλία κέρδισε την Αγγλία στον τελικό του Euro 2020')
A Document
object is then created and is annotated. The original text is tokenized
and split to tokens
Accessing the annotations
The following code explains how you can access the annotations generated by the toolkit.
for token in doc.tokens:
print(token.text) # the text of the token
print(token.ner) # the named entity label in IOBES encoding : str
print(token.upos) # the UPOS tag of the token
print(token.feats) # the morphological features for the token
print(token.head) # the head of the token
print(token.deprel) # the dependency relation between the current token and its head
token.ner
is set by the ner
processor, token.upos
and token.feats
are set by the pos
processor
and token.head
and token.deprel
are set by the dp
processor.
A small detail is that to get the Token
object that is the head of another token you need to access
doc.tokens[head-1]
. The reason for this is that the enumeration of the tokens starts from 1 and when the
field token.head
is set to 0, that means the token is the root of the word.
Alternative download methods for the toolkit models
Currently the models are served in a Google Drive folder. In case they become unavailable from that source, the models can be found via archive.org at the following links:
- Dependency Parsing model: https://archive.org/details/toolkit-dp
- Named Entity Recognition model: https://archive.org/details/toolkit-ner
- Part-of-Speech and morphological tagging model: https://archive.org/details/toolkit-pos
The toolkit currently cannot download the models from these sources, but if you have downloaded the toolkit models via an alternative source you can place the files with their names in the .cache/gr_nlp_toolkit/
directory of your home folder (~/.cache/gr_nlp_toolkit
in Linux systems). Be sure to name the Dependency Parsing model file as toolkit-dp
, the Named Entity Recognition model file as toolkit-ner
and the Part-of-Speech and morphological tagging model file as toolkit-pos
. This way, the toolkit will not download any models from the Internet and will use the local ones instead.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gr_nlp_toolkit-0.0.4.tar.gz
.
File metadata
- Download URL: gr_nlp_toolkit-0.0.4.tar.gz
- Upload date:
- Size: 31.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7c33fa53bc9f31527eee4c7d5af280240f986f7c0aff6f9c3fa83a56308fb39 |
|
MD5 | 701c949d4ae85cfbae172fbadb4b1d25 |
|
BLAKE2b-256 | da0a36c1251c2270c0f8abc4d3f1458a0656e6af87b73bb558b9669b0fcc6206 |
File details
Details for the file gr_nlp_toolkit-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: gr_nlp_toolkit-0.0.4-py3-none-any.whl
- Upload date:
- Size: 44.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 062d0d9d99496d7c4fc3fb92a692feaa4b79ccef50f935f8c689c2d701b9bffc |
|
MD5 | 8374eb5ca63fd9459e5206e0fdbf5064 |
|
BLAKE2b-256 | 211ae7e6ecb2ae3f61e218036b7960aaabb44465089bbfc1a9bb334735c1c775 |