Library and CLI to for simple morphological annotation of spoken Macedonian
Project description
Annotation of Spoken Macedonian
A simple morphological tagger for annotation of Macedonian written and spoken language data
This is a package for simple annotation of macedonian spoken (also dialectal) texts.
The package uses the token-dictionary created during the development
of the Macedonian Spoken Corpus
and will annotate the tokens of your text if they are present in that dictionary.
See the list of the tags here .
The dictionary is constantly being expanded in new releases of the tagger.
The tagger is designed for smaller projects since it contains several limitations which would require a manual after-processing from the user. First, this concerns the processing homonyms.
Dealing with homonyms
The tagger can not distinguish homonyms
- The word 'se' is always marked as a participle Q (tag for participle). You need to manually correct the cases where it stands for the third person plural for on the verb "to be" in presence
- The word 'si' is always marked as a participle P (tag for pronoun). You need to manually correct the cases where it stands for the second person singular for on the verb "to be" in presence
The annotation for homonyms can be either marked with the string "HOMONYM" or left empty in the output.
Dealing with unknown words
The annotation for unknown words can be either marked with the string "UNKNOWN" or left empty in the output.
Usage
###Installation
pip install spoken_macedonian_annotation
In a code editor, you can annotate texts by passing the string to the annotate
method of a MacAnnotator
object:
from spoken_macedonian_annotation.annotate import MacAnnotator
text = 'Ова е мојата куќа.'
annotator = MacAnnotator(print_to_txt_file=True, mark_homonyms=False, mark_unknown_tokens=False)
result = annotator.annotate(text)
print(result)
In command line, you can pass a file with a plain text to the command line script annotateMac
:
annotateMac -i your_text_to_annotate.txt --print_to_txt
The argument --print_to_txt
creates an output file in the working directory and writes the result in it.
You can also use optional arguments for marking homonyms and/or unknown words:
annotateMac -i your_text_to_annotate.txt --print_to_txt --mark_homonyms --mark_unknown
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spoken_macedonian_annotation-1.0.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | adae509e2fa0b05c42b563e95358b41be913764495336e4d3585d18d21168b3d |
|
MD5 | e877f1e3cc2b8171e4df44caf843ed25 |
|
BLAKE2b-256 | 7782830b3260244aa3922c63dfb5e93250c84377451bd3cb72c4e035504800c5 |
Hashes for spoken_macedonian_annotation-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3121c94cc1abf44b106e95493632db41eeb611dcf7cae31a971a8814b489eb97 |
|
MD5 | 28188e40e82d5f9deeeef22461f27a66 |
|
BLAKE2b-256 | a9a9224cead09e966ce840f360db8cb774388785d82c7d8da0e50f7f1ab5cad3 |