Sentence splitting and tokenization for South Slavic languages
Project description
reldi-tokeniser
A tokeniser developed inside the ReLDI project. Supports currently five languages -- Slovene, Croatian, Serbian, Macedonian and Bulgarian, and two modes -- standard and non-standard text.
Usage
Command line
$ echo 'kaj sad s tim.daj se nasmij ^_^.' | ./tokeniser.py hr -n
1.1.1.1-3 kaj
1.1.2.5-7 sad
1.1.3.9-9 s
1.1.4.11-13 tim
1.1.5.14-14 .
1.2.1.15-17 daj
1.2.2.19-20 se
1.2.3.22-27 nasmij
1.2.4.29-31 ^_^
1.2.5.32-32 .
Language is a positional argument while tokenisation of non-standard text, tagging and lemmatization of symbols and punctuation, and diferent output formats are an optional one.
$ python tokeniser.py -h
usage: tokeniser.py [-h] [-c] [-b] [-d] [-n] [-t] {sl,hr,sr,mk,bg}
Tokeniser for (non-)standard Slovene, Croatian, Serbian, Macedonian and
Bulgarian
positional arguments:
{sl,hr,sr,mk,bg} language of the text
optional arguments:
-h, --help show this help message and exit
-c, --conllu generates CONLLU output
-b, --bert generates BERT-compatible output
-d, --document passes through ConLL-U-style document boundaries
-n, --nonstandard invokes the non-standard mode
-t, --tag adds tags and lemmas to punctuations and symbols
Python module
# string mode
import reldi_tokeniser
text = 'kaj sad s tim.daj se nasmij ^_^.'
output = reldi_tokeniser.run(text, 'hr', nonstandard=True, tag=True)
# object mode
from reldi_tokeniser.tokeniser import ReldiTokeniser
reldi = ReldiTokeniser('hr', conllu=True, nonstandard=True, tag=True)
list_of_lines = [el + '\n' for el in text.split('\n')]
test = reldi.run(list_of_lines, mode='object')
Python module has two mandatory parameters - text and language. Other optional parameters are conllu
, bert
, document
, nonstandard
and tag
.
CoNLL-U output
This tokeniser outputs also CoNLL-U format (flag -c
/--conllu
). If the additional -d
/--document
flag is given, the tokeniser passes through lines starting with # newdoc id =
to preserve document structure.
$ echo '# newdoc id = prvi
kaj sad s tim.daj se nasmij ^_^.
haha
# newdoc id = gidru
štaš' | ./tokeniser.py hr -n -c -d
# newdoc id = prvi
# newpar id = 1
# sent_id = 1.1
# text = kaj sad s tim.
1 kaj _ _ _ _ _ _ _ _
2 sad _ _ _ _ _ _ _ _
3 s _ _ _ _ _ _ _ _
4 tim _ _ _ _ _ _ _ SpaceAfter=No
5 . _ _ _ _ _ _ _ SpaceAfter=No
# sent_id = 1.2
# text = daj se nasmij ^_^.
1 daj _ _ _ _ _ _ _ _
2 se _ _ _ _ _ _ _ _
3 nasmij _ _ _ _ _ _ _ _
4 ^_^ _ _ _ _ _ _ _ SpaceAfter=No
5 . _ _ _ _ _ _ _ _
# newpar id = 2
# sent_id = 2.1
# text = haha
1 haha _ _ _ _ _ _ _ _
# newdoc id = gidru
# newpar id = 1
# sent_id = 1.1
# text = štaš
1 štaš _ _ _ _ _ _ _ _
Pre-tagging
The tokeniser can also pre-annotate text on the part-of-speech (UPOS and XPOS) and lemma level (flag -t
or --tag
), if the available tokenisation regexes have sufficient evidence (punctuations, mentions, hashtags, URL-s, e-mails, emoticons, emojis). Default output format in case of pre-tagging is CoNLL-U.
$ echo -e "kaj sad s tim.daj se nasmij ^_^. haha" | python tokeniser.py hr -n -t
# newpar id = 1
# sent_id = 1.1
# text = kaj sad s tim.
1 kaj _ _ _ _ _ _ _ _
2 sad _ _ _ _ _ _ _ _
3 s _ _ _ _ _ _ _ _
4 tim _ _ _ _ _ _ _ SpaceAfter=No
5 . . PUNCT Z _ _ _ _ SpaceAfter=No
# sent_id = 1.2
# text = daj se nasmij ^_^.
1 daj _ _ _ _ _ _ _ _
2 se _ _ _ _ _ _ _ _
3 nasmij _ _ _ _ _ _ _ _
4 ^_^ ^_^ SYM Xe _ _ _ _ SpaceAfter=No
5 . . PUNCT Z _ _ _ _ _
# sent_id = 1.3
# text = haha
1 haha _ _ _ _ _ _ _ _
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file reldi-tokeniser-1.0.3.tar.gz
.
File metadata
- Download URL: reldi-tokeniser-1.0.3.tar.gz
- Upload date:
- Size: 17.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db76ede15e77cc642bb973ebd93a6fcc2de4c2f3e6a5e4a1a6483d3d2d1be062 |
|
MD5 | 7da2ba6130227f06b1e0b95e8f358ceb |
|
BLAKE2b-256 | 1502f1aebc83789d692ea14723e3875bcdd2d0d25b8147eedf6d2149a717addb |
File details
Details for the file reldi_tokeniser-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: reldi_tokeniser-1.0.3-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97be388d5e06519fc0c2f8dce4702c0912f5a279d24c61df37c5fcbfaccef774 |
|
MD5 | d5e7188c8f58d2b42f8da4200a37b61d |
|
BLAKE2b-256 | a74bee1f1685fe3769cb9319fb83b626aeab86d9512b0d0ecee2d024419a1d4a |