A Transformer-based Natural Language Processing Pipeline for Greek
Project description
gr-nlp-toolkit
A Transformer-based natural language processing toolkit for (modern) Greek. The toolkit has state-of-the-art performance in Greek and supports named entity recognition, part-of-speech tagging, morphological tagging, as well as dependency parsing. Additionally, the toolkit can convert Greeklish text (Greek written using Latin characters) into standard Greek
Installation
You can install the toolkit from PyPi by executing the following in the command line:
pip install gr-nlp-toolkit
Alternatively, you can clone this repository and set up a virtual environment using the requirements.txt file. (Development was done using Python version 3.9)
Usage
Available Processors
To use the toolkit first initialize a Pipeline specifying which processors you need. Each processor annotates the text with a specific task's annotations.
- To obtain Part-of-Speech and Morphological Tagging annotations, add the
pos
processor - To obtain Named Entity Recognition annotations, add the
ner
processor - To obtain Dependency Parsing annotations, add the
dp
processor - To enable the transliteration from Greeklish to Greek, add the
g2g
processor or theg2g_lite
processor for a lighter but less accurate model (Greeklish to Greek transliteration example : Thessalonikh -> Θεσσαλονίκη)
Example Usage Scenarios
-
Greeklish to Greek Conversion
from gr_nlp_toolkit import Pipeline nlp = Pipeline("g2g") # Instantiate the pipeline with the g2g processor doc = nlp("O Volos kai h Larisa einai sthn Thessalia") # Apply the pipeline to a sentence print(doc.text) # Access the transliterated text
-
DP, POS, NER processors
nlp = Pipeline("pos,ner,dp") # Instantiate the Pipeline with the DP, POS and NER processors doc = nlp("Η Ιταλία κέρδισε την Αγγλία στον τελικό του Euro 2020.") # Apply the pipeline to a sentence
A
Document
object is created and is annotated. The original text is tokenized and split to tokens# Iterate over the generated tokens for token in doc.tokens: print(token.text) # the text of the token print(token.ner) # the named entity label in IOBES encoding : str print(token.upos) # the UPOS tag of the token print(token.feats) # the morphological features for the token print(token.head) # the head of the token print(token.deprel) # the dependency relation between the current token and its head
token.ner
is set by thener
processor,token.upos
andtoken.feats
are set by thepos
processor andtoken.head
andtoken.deprel
are set by thedp
processor.A small detail is that to get the
Token
object that is the head of another token you need to accessdoc.tokens[head-1]
. The reason for this is that the enumeration of the tokens starts from 1 and when the fieldtoken.head
is set to 0, that means the token is the root of the word. -
Use all the processors together
nlp = Pipeline("pos,ner,dp,g2g") # Instantiate the Pipeline with the G2G, DP, POS and NER processors doc = nlp("O Volos kai h Larisa einai sthn Thessalia") # Apply the pipeline to a sentence print(doc.text) # Print the transliterated text # Iterate over the generated tokens for token in doc.tokens: print(token.text) # the text of the token print(token.ner) # the named entity label in IOBES encoding : str print(token.upos) # the UPOS tag of the token print(token.feats) # the morphological features for the token print(token.head) # the head of the token print(token.deprel) # the dependency relation between the current token and its head
Notes:
- If the input text is already in greek, the G2G processor is skipped
- The first time you use a processor, the models are downloaded from Hugging Face and stored into the .cache folder. The NER, DP and POS processors are each about 500 MB, while the G2G processor is about 1.2 GB in size
- If your machine has an accelerator but you want to run the process on the CPU, you can pass the flag use_cpu=True to the Pipeline object. By default, this flag is set to False.
Hugging Face repositories
- ByT5-g2g: https://huggingface.co/AUEB-NLP/ByT5_g2g
- gr-nlp-toolkit: https://huggingface.co/AUEB-NLP/ByT5_g2g
References
C. Dikonimaki, "A Transformer-based natural language processing toolkit for Greek -- Part of speech tagging and dependency parsing", BSc thesis, Department of Informatics, Athens University of Economics and Business, 2021. http://nlp.cs.aueb.gr/theses/dikonimaki_bsc_thesis.pdf
N. Smyrnioudis, "A Transformer-based natural language processing toolkit for Greek -- Named entity recognition and multi-task learning", BSc thesis, Department of Informatics, Athens University of Economics and Business, 2021. http://nlp.cs.aueb.gr/theses/smyrnioudis_bsc_thesis.pdf
Toumazatos, A., Pavlopoulos, J., Androutsopoulos, I., & Vassos, S. (2024). Still All Greeklish to Me: Greeklish to Greek Transliteration. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 15309–15319). ELRA and ICCL.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gr_nlp_toolkit-0.1.1.tar.gz
.
File metadata
- Download URL: gr_nlp_toolkit-0.1.1.tar.gz
- Upload date:
- Size: 32.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee4ac97ace89520d0a333793413a91fe83f5520f957679f02f96d60ba6c8d9ee |
|
MD5 | 7b0c6ae97685ce2b46feee1d63b68960 |
|
BLAKE2b-256 | 51b8d5a32b6d20d0783e4d0737d24465b8ee750f06d96fa4beb34452825bb745 |
File details
Details for the file gr_nlp_toolkit-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: gr_nlp_toolkit-0.1.1-py3-none-any.whl
- Upload date:
- Size: 45.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9706b269687a83d03734d43eb237daf5bc1d4efe4100391ab55fd53076a6f93e |
|
MD5 | f85885aca66d1c12732d14d08cc757db |
|
BLAKE2b-256 | 8234574e905523fa8449c20b2ca6a807b40d4097ff9fce5d961bc4f7508179e5 |