Adapted Stanford NLP Python Library with improvements for specific languages.
Project description
A CLASSLA Fork of Stanza for Processing Slovenian, Croatian, Serbian, Macedonian and Bulgarian
Description
This pipeline allows for processing of standard Slovenian, Croatian, Serbian and Bulgarian on the levels of
- tokenization and sentence splitting
- part-of-speech tagging
- lemmatization
- dependency parsing
- named entity recognition
It also allows for processing of standard Macedonian on the levels of
- tokenization and sentence splitting
- part-of-speech tagging
- lemmatization
Finally, it allows for processing of non-standard (Internet) Slovenian, Croatian and Serbian on the same levels as standard language (all models are tailored to non-standard language except for dependency parsing where the standard module is used).
Differences to Stanza
The differences of this pipeline to the original Stanza pipeline are the following:
- usage of language-specific rule-based tokenizers and sentence splitters, obeliks for standard Slovenian and reldi-tokeniser for the remaining varieties and languages (Stanza uses inferior machine-learning-based tokenization and sentence splitting trained on UD data)
- default pre-tagging and pre-lemmatization on the level of tokenizers for the following phenomena: punctuation, symbol, e-mail, URL, mention, hashtag, emoticon, emoji (usage documented here)
- optional control of the tagger for Slovenian via an inflectional lexicon on the levels of XPOS, UPOS, FEATS (usage documented here)
- closed class handling depending on the usage of the options described in the last two bullets, as documented here
- usage of external inflectional lexicons for lookup lemmatization, seq2seq being used very infrequently on OOVs only (Stanza uses only UD training data for lookup lemmatization)
- morphosyntactic tagging models based on larger quantities of training data than is available in UD (training data that are morphosyntactically tagged, but not UD-parsed)
- lemmatization models based on larger quantities of training data than is available in UD (training data that are lemmatized, but not UD-parsed)
- optional JOS-project-based parsing of Slovenian (usage documented here)
- named entity recognition models for all languages except Macedonian (Stanza does not cover named entity recognition for any of the languages supported by classla)
- Macedonian models (Macedonian is not available in UD yet)
- non-standard models for Croatian, Slovenian, Serbian (there is no UD data for these varieties)
The above modifications led to some important improvements in the tool's performance in comparison to original Stanza. For standard Slovenian, comparing the CLASSLA-Stanza tool with Stanza on the SloBENCH benchmark, shows relative error reduction (part of the error removed by moving from Stanza to CLASSLA-Stanza) on sentence segmentation to be 98%, on token segmentation 50%, on lemmatization 69%, on morphosyntactic XPOS tagging 65%, and on dependency parsing 34%.
Citing
If you use this tool, please cite the following papers:
@inproceedings{ljubesic-dobrovoljc-2019-neural,
title = "What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of {S}lovenian, {C}roatian and {S}erbian",
author = "Ljube{\v{s}}i{\'c}, Nikola and
Dobrovoljc, Kaja",
booktitle = "Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing",
month = aug,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W19-3704",
doi = "10.18653/v1/W19-3704",
pages = "29--34"
}
@misc{terčon2023classlastanza,
title={CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages},
author={Luka Terčon and Nikola Ljubešić},
year={2023},
eprint={2308.04255},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Installation
pip
We recommend that you install CLASSLA via pip, the Python package manager. To install, run:
pip install classla
This will also resolve all dependencies.
NOTE TO EXISTING USERS: Once you install this classla version, you will HAVE TO re-download the models. All previously downloaded models will not be used anymore. We suggest you delete the old models. Their default location is at ~/classla_resources
.
Running CLASSLA
Getting started
To run the CLASSLA pipeline for the first time on processing standard Slovenian, follow these steps:
>>> import classla
>>> classla.download('sl') # download standard models for Slovenian, use hr for Croatian, sr for Serbian, bg for Bulgarian, mk for Macedonian
>>> nlp = classla.Pipeline('sl') # initialize the default Slovenian pipeline, use hr for Croatian, sr for Serbian, bg for Bulgarian, mk for Macedonian
>>> doc = nlp("France Prešeren je rojen v Vrbi.") # run the pipeline
>>> print(doc.to_conll()) # print the output in CoNLL-U format
# newpar id = 1
# sent_id = 1.1
# text = France Prešeren je rojen v Vrbi.
1 France France PROPN Npmsn Case=Nom|Gender=Masc|Number=Sing 4 nsubj _ NER=B-PER
2 Prešeren Prešeren PROPN Npmsn Case=Nom|Gender=Masc|Number=Sing 1 flat:name _ NER=I-PER
3 je biti AUX Va-r3s-n Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin 4 cop _ NER=O
4 rojen rojen ADJ Appmsnn Case=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part 0 root _ NER=O
5 v v ADP Sl Case=Loc 6 case _ NER=O
6 Vrbi Vrba PROPN Npfsl Case=Loc|Gender=Fem|Number=Sing 4 obl _ NER=B-LOC|SpaceAfter=No
7 . . PUNCT Z _ 4 punct _ NER=O
You can find examples of standard language processing for Croatian, Serbian, Macedonian and Bulgarian at the end of this document.
Processing non-standard language
Processing non-standard Slovenian differs to the above standard example just by an additional argument type="nonstandard"
:
>>> import classla
>>> classla.download('sl', type='nonstandard') # download non-standard models for Slovenian, use hr for Croatian and sr for Serbian
>>> nlp = classla.Pipeline('sl', type='nonstandard') # initialize the default non-standard Slovenian pipeline, use hr for Croatian and sr for Serbian
>>> doc = nlp("kva smo mi zurali zadnje leto v zagrebu...") # run the pipeline
>>> print(doc.to_conll()) # print the output in CoNLL-U format
# newpar id = 1
# sent_id = 1.1
# text = kva smo mi zurali zadnje leto v zagrebu...
1 kva kaj PRON Pq-nsa Case=Acc|Gender=Neut|Number=Sing|PronType=Int 4 obj _ NER=O
2 smo biti AUX Va-r1p-n Mood=Ind|Number=Plur|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin 4 aux _ NER=O
3 mi jaz PRON Pp1mpn Case=Nom|Gender=Masc|Number=Plur|Person=1|PronType=Prs 4 nsubj _ NER=O
4 zurali zurati VERB Vmpp-pm Aspect=Imp|Gender=Masc|Number=Plur|VerbForm=Part 0 root _ NER=O
5 zadnje zadnji ADJ Agpnsa Case=Acc|Degree=Pos|Gender=Neut|Number=Sing 6 amod _ NER=O
6 leto leto NOUN Ncnsa Case=Acc|Gender=Neut|Number=Sing 4 obl _ NER=O
7 v v ADP Sl Case=Loc 8 case _ NER=O
8 zagrebu Zagreb PROPN Npmsl Case=Loc|Gender=Masc|Number=Sing 4 obl _ NER=B-LOC|SpaceAfter=No
9 ... ... PUNCT Z _ 4 punct _ NER=O
You can find examples of non-standard language processing for Croatian and Serbian at the end of this document.
For additional usage examples you can also consult the pipeline_demo.py
file.
Processing online texts
A special web processing mode for processing texts obtained from the internet can be activated with the type="web"
argument:
>>> import classla
>>> classla.download('sl', type='web') # download web models for Slovenian, use hr for Croatian and sr for Serbian
>>> nlp = classla.Pipeline('sl', type='web') # initialize the default Slovenian web pipeline, use hr for Croatian and sr for Serbian
>>> doc = nlp("Kdor hoce prenesti preko racunalnika http://t.co/LwWyzs0cA0") # run the pipeline
>>> print(doc.to_conll()) # print the output in CoNLL-U format
# newpar id = 1
# sent_id = 1.1
# text = Kdor hoce prenesti preko racunalnika http://t.co/LwWyzs0cA0
1 Kdor kdor PRON Pr-msn Case=Nom|Gender=Masc|Number=Sing|PronType=Rel 2 nsubj _ NER=O
2 hoce hoteti VERB Vmpr3s-n Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin 0 root _ NER=O
3 prenesti prenesti VERB Vmen Aspect=Perf|VerbForm=Inf 2 xcomp _ NER=O
4 preko preko ADP Sg Case=Gen 5 case _ NER=O
5 racunalnika računalnik NOUN Ncmsg Case=Gen|Gender=Masc|Number=Sing 3 obl _ NER=O
6 http://t.co/LwWyzs0cA0 http://t.co/LwWyzs0cA0 SYM Xw _ 5 nmod _ NER=O
Processors
The CLASSLA pipeline is built from multiple units. These units are called processors. By default CLASSLA runs the tokenize
, ner
, pos
, lemma
and depparse
processors.
You can specify which processors CLASSLA should run, via the processors
attribute as in the following example, performing tokenization, named entity recognition, part-of-speech tagging and lemmatization.
>>> nlp = classla.Pipeline('sl', processors='tokenize,ner,pos,lemma')
Another popular option might be to perform tokenization, part-of-speech tagging, lemmatization and dependency parsing.
>>> nlp = classla.Pipeline('sl', processors='tokenize,pos,lemma,depparse')
Tokenization and sentence splitting
The tokenization and sentence splitting processor tokenize
is the first processor and is required for any further processing.
In case you already have tokenized text, you should separate tokens via spaces and pass the attribute tokenize_pretokenized=True
.
By default CLASSLA uses a rule-based tokenizer - obeliks for Slovenian standard language pipeline. In other cases we use reldi-tokeniser.
Part-of-speech tagging
The POS tagging processor pos
will general output that contains morphosyntactic description following the MULTEXT-East standard and universal part-of-speech tags and universal features following the Universal Dependencies standard. This processing requires the usage of the tokenize
processor.
Lemmatization
The lemmatization processor lemma
will produce lemmas (basic forms) for each token in the input. It requires the usage of both the tokenize
and pos
processors.
Dependency parsing
The dependency parsing processor depparse
performs syntactic dependency parsing of sentences following the Universal Dependencies formalism. It requires the tokenize
and pos
processors.
Named entity recognition
The named entity recognition processor ner
identifies named entities in text following the IOB2 format. It requires only the tokenize
processor.
Croatian examples
Example of standard Croatian
>>> import classla
>>> nlp = classla.Pipeline('hr') # run classla.download('hr') beforehand if necessary
>>> doc = nlp("Ante Starčević rođen je u Velikom Žitniku.")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = Ante Starčević rođen je u Velikom Žitniku.
1 Ante Ante PROPN Npmsn Case=Nom|Gender=Masc|Number=Sing 3 nsubj _ NER=B-PER
2 Starčević Starčević PROPN Npmsn Case=Nom|Gender=Masc|Number=Sing 1 flat _ NER=I-PER
3 rođen roditi ADJ Appmsnn Case=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass 0 root _ NER=O
4 je biti AUX Var3s Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 aux _ NER=O
5 u u ADP Sl Case=Loc 7 case _ NER=O
6 Velikom velik ADJ Agpmsly Case=Loc|Definite=Def|Degree=Pos|Gender=Masc|Number=Sing 7 amod _ NER=B-LOC
7 Žitniku Žitnik PROPN Npmsl Case=Loc|Gender=Masc|Number=Sing 3 obl _ NER=I-LOC|SpaceAfter=No
8 . . PUNCT Z _ 3 punct _ NER=O
Example of non-standard Croatian
>>> import classla
>>> nlp = classla.Pipeline('hr', type='nonstandard') # run classla.download('hr', type='nonstandard') beforehand if necessary
>>> doc = nlp("kaj sam ja tulumaril jucer u ljubljani...")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = kaj sam ja tulumaril jucer u ljubljani...
1 kaj što PRON Pq3n-a Case=Acc|Gender=Neut|PronType=Int,Rel 4 obj _ NER=O
2 sam biti AUX Var1s Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 4 aux _ NER=O
3 ja ja PRON Pp1-sn Case=Nom|Number=Sing|Person=1|PronType=Prs 4 nsubj _ NER=O
4 tulumaril tulumariti VERB Vmp-sm Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act 0 root _ NER=O
5 jucer jučer ADV Rgp Degree=Pos 4 advmod _ NER=O
6 u u ADP Sl Case=Loc 7 case _ NER=O
7 ljubljani Ljubljana PROPN Npfsl Case=Loc|Gender=Fem|Number=Sing 4 obl _ NER=B-LOC|SpaceAfter=No
8 ... ... PUNCT Z _ 4 punct _ NER=O
Serbian examples
Example of standard Serbian
>>> import classla
>>> nlp = classla.Pipeline('sr') # run classla.download('sr') beforehand if necessary
>>> doc = nlp("Slobodan Jovanović rođen je u Novom Sadu.")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = Slobodan Jovanović rođen je u Novom Sadu.
1 Slobodan Slobodan PROPN Npmsn Case=Nom|Gender=Masc|Number=Sing 3 nsubj _ NER=B-PER
2 Jovanović Jovanović PROPN Npmsn Case=Nom|Gender=Masc|Number=Sing 1 flat _ NER=I-PER
3 rođen roditi ADJ Appmsnn Case=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass 0 root _ NER=O
4 je biti AUX Var3s Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 aux _ NER=O
5 u u ADP Sl Case=Loc 7 case _ NER=O
6 Novom nov ADJ Agpmsly Case=Loc|Definite=Def|Degree=Pos|Gender=Masc|Number=Sing 7 amod _ NER=B-LOC
7 Sadu Sad PROPN Npmsl Case=Loc|Gender=Masc|Number=Sing 3 obl _ NER=I-LOC|SpaceAfter=No
8 . . PUNCT Z _ 3 punct _ NER=O
Example of non-standard Serbian
>>> import classla
>>> nlp = classla.Pipeline('sr', type='nonstandard') # run classla.download('sr', type='nonstandard') beforehand if necessary
>>> doc = nlp("ne mogu da verujem kakvo je zezanje bilo prosle godine u zagrebu...")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = ne mogu da verujem kakvo je zezanje bilo prosle godine u zagrebu...
1 ne ne PART Qz Polarity=Neg 2 advmod _ NER=O
2 mogu moći VERB Vmr1s Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 0 root _ NER=O
3 da da SCONJ Cs _ 4 mark _ NER=O
4 verujem verovati VERB Vmr1s Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 2 xcomp _ NER=O
5 kakvo kakav DET Pi-nsn Case=Nom|Gender=Neut|Number=Sing|PronType=Int,Rel 4 ccomp _ NER=O
6 je biti AUX Var3s Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 5 aux _ NER=O
7 zezanje zezanje NOUN Ncnsn Case=Nom|Gender=Neut|Number=Sing 8 nsubj _ NER=O
8 bilo biti AUX Vap-sn Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act 5 cop _ NER=O
9 prosle prošli ADJ Agpfsgy Case=Gen|Definite=Def|Degree=Pos|Gender=Fem|Number=Sing 10 amod _ NER=O
10 godine godina NOUN Ncfsg Case=Gen|Gender=Fem|Number=Sing 8 obl _ NER=O
11 u u ADP Sl Case=Loc 12 case _ NER=O
12 zagrebu Zagreb PROPN Npmsl Case=Loc|Gender=Masc|Number=Sing 8 obl _ NER=B-LOC|SpaceAfter=No
13 ... ... PUNCT Z _ 2 punct _ NER=O
Bulgarian examples
Example of standard Bulgarian
>>> import classla
>>> nlp = classla.Pipeline('bg') # run classla.download('bg') beforehand if necessary
>>> doc = nlp("Алеко Константинов е роден в Свищов.")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = Алеко Константинов е роден в Свищов.
1 Алеко алеко PROPN Npmsi Definite=Ind|Gender=Masc|Number=Sing 4 nsubj:pass _ NER=B-PER
2 Константинов константинов PROPN Hmsi Definite=Ind|Gender=Masc|Number=Sing 1 flat _ NER=I-PER
3 е съм AUX Vxitf-r3s Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 4 aux:pass _ NER=O
4 роден родя-(се) VERB Vpptcv--smi Aspect=Perf|Definite=Ind|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass 0 root _ NER=O
5 в в ADP R _ 6 case _ NER=O
6 Свищов свищов PROPN Npmsi Definite=Ind|Gender=Masc|Number=Sing 4 iobj _ NER=B-LOC|SpaceAfter=No
7 . . PUNCT punct _ 4 punct _ NER=O
Macedonian examples
Example of standard Macedonian
>>> import classla
>>> nlp = classla.Pipeline('mk') # run classla.download('mk') beforehand if necessary
>>> doc = nlp('Крсте Петков Мисирков е роден во Постол.')
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = Крсте Петков Мисирков е роден во Постол.
1 Крсте Крсте PROPN Npmsnn Case=Nom|Definite=Ind|Gender=Masc|Number=Sing _ _ _ _
2 Петков Петков PROPN Npmsnn Case=Nom|Definite=Ind|Gender=Masc|Number=Sing _ _ _ _
3 Мисирков Мисирков PROPN Npmsnn Case=Nom|Definite=Ind|Gender=Masc|Number=Sing _ _ _ _
4 е сум AUX Vapip3s-n Aspect=Prog|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres _ _ _ _
5 роден роден ADJ Ap-ms-n Definite=Ind|Gender=Masc|Number=Sing|VerbForm=Part _ _ _ _
6 во во ADP Sps AdpType=Prep _ _ _ _
7 Постол Постол PROPN Npmsnn Case=Nom|Definite=Ind|Gender=Masc|Number=Sing _ _ _ SpaceAfter=No
8 . . PUNCT Z _ _ _ _ _
Training instructions
Superuser instructions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file classla-2.1.1.tar.gz
.
File metadata
- Download URL: classla-2.1.1.tar.gz
- Upload date:
- Size: 199.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6aafcf510cfc2b7a38f1523632b8e4f915a63cc8553da9f7399cb326f5231bb |
|
MD5 | 61b98a9e559a9ab90682050ac9fc2ad7 |
|
BLAKE2b-256 | 1ed0809a13045aea6aa2f3d7a94c6b3c8093c05715de032ec672ff70bc183aea |
File details
Details for the file classla-2.1.1-py3-none-any.whl
.
File metadata
- Download URL: classla-2.1.1-py3-none-any.whl
- Upload date:
- Size: 249.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b880460b3c7ba9584e8c6f163c80b915d347c89f9671b9577d23fd80a6b12cc6 |
|
MD5 | ea5ecd3768e1b4bb2ab5fb8f114b67d5 |
|
BLAKE2b-256 | 267610be6c766785dd3d79f7a2b7966cc40e8cd4a34ca13a46826598bfc43884 |