Skip to main content

Sentence splitting and tokenization for South Slavic languages

Project description

reldi-tokeniser

A tokeniser developed inside the ReLDI project. Supports currently five languages -- Slovene, Croatian, Serbian, Macedonian and Bulgarian, and two modes -- standard and non-standard text.

Usage

Command line

$ echo 'kaj sad s tim.daj se nasmij ^_^.' | ./tokeniser.py hr -n
1.1.1.1-3	kaj
1.1.2.5-7	sad
1.1.3.9-9	s
1.1.4.11-13	tim
1.1.5.14-14	.

1.2.1.15-17	daj
1.2.2.19-20	se
1.2.3.22-27	nasmij
1.2.4.29-31	^_^
1.2.5.32-32	.


Language is a positional argument while tokenisation of non-standard text, tagging and lemmatization of symbols and punctuation, and diferent output formats are an optional one.

$ python tokeniser.py -h
usage: tokeniser.py [-h] [-c] [-b] [-d] [-n] [-t] {sl,hr,sr,mk,bg}

Tokeniser for (non-)standard Slovene, Croatian, Serbian, Macedonian and
Bulgarian

positional arguments:
  {sl,hr,sr,mk,bg}   language of the text

optional arguments:
  -h, --help         show this help message and exit
  -c, --conllu       generates CONLLU output
  -b, --bert         generates BERT-compatible output
  -d, --document     passes through ConLL-U-style document boundaries
  -n, --nonstandard  invokes the non-standard mode
  -t, --tag          adds tags and lemmas to punctuations and symbols

Python module

# string mode
import reldi_tokeniser

text = 'kaj sad s tim.daj se nasmij ^_^.'

output = reldi_tokeniser.run(text, 'hr', nonstandard=True, tag=True)

# object mode
from reldi_tokeniser.tokeniser import ReldiTokeniser

reldi = ReldiTokeniser('hr', conllu=True, nonstandard=True, tag=True)
list_of_lines = [el + '\n' for el in text.split('\n')]
test = reldi.run(list_of_lines, mode='object')

Python module has two mandatory parameters - text and language. Other optional parameters are conllu, bert, document, nonstandard and tag.

CoNLL-U output

This tokeniser outputs also CoNLL-U format (flag -c/--conllu). If the additional -d/--document flag is given, the tokeniser passes through lines starting with # newdoc id = to preserve document structure.

$ echo '# newdoc id = prvi
kaj sad s tim.daj se nasmij ^_^.
haha
# newdoc id = gidru
štaš' | ./tokeniser.py hr -n -c -d
# newdoc id = prvi
# newpar id = 1
# sent_id = 1.1
# text = kaj sad s tim.
1	kaj	_	_	_	_	_	_	_	_
2	sad	_	_	_	_	_	_	_	_
3	s	_	_	_	_	_	_	_	_
4	tim	_	_	_	_	_	_	_	SpaceAfter=No
5	.	_	_	_	_	_	_	_	SpaceAfter=No

# sent_id = 1.2
# text = daj se nasmij ^_^.
1	daj	_	_	_	_	_	_	_	_
2	se	_	_	_	_	_	_	_	_
3	nasmij	_	_	_	_	_	_	_	_
4	^_^	_	_	_	_	_	_	_	SpaceAfter=No
5	.	_	_	_	_	_	_	_	_

# newpar id = 2
# sent_id = 2.1
# text = haha
1	haha	_	_	_	_	_	_	_	_

# newdoc id = gidru
# newpar id = 1
# sent_id = 1.1
# text = štaš
1	štaš	_	_	_	_	_	_	_	_

Pre-tagging

The tokeniser can also pre-annotate text on the part-of-speech (UPOS and XPOS) and lemma level (flag -t or --tag), if the available tokenisation regexes have sufficient evidence (punctuations, mentions, hashtags, URL-s, e-mails, emoticons, emojis). Default output format in case of pre-tagging is CoNLL-U.

$ echo -e "kaj sad s tim.daj se nasmij ^_^. haha" | python tokeniser.py hr -n -t
# newpar id = 1
# sent_id = 1.1
# text = kaj sad s tim.
1	kaj	_	_	_	_	_	_	_	_
2	sad	_	_	_	_	_	_	_	_
3	s	_	_	_	_	_	_	_	_
4	tim	_	_	_	_	_	_	_	SpaceAfter=No
5	.	.	PUNCT	Z	_	_	_	_	SpaceAfter=No

# sent_id = 1.2
# text = daj se nasmij ^_^.
1	daj	_	_	_	_	_	_	_	_
2	se	_	_	_	_	_	_	_	_
3	nasmij	_	_	_	_	_	_	_	_
4	^_^	^_^	SYM	Xe	_	_	_	_	SpaceAfter=No
5	.	.	PUNCT	Z	_	_	_	_	_

# sent_id = 1.3
# text = haha
1	haha	_	_	_	_	_	_	_	_

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reldi-tokeniser-1.0.3.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

reldi_tokeniser-1.0.3-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file reldi-tokeniser-1.0.3.tar.gz.

File metadata

  • Download URL: reldi-tokeniser-1.0.3.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for reldi-tokeniser-1.0.3.tar.gz
Algorithm Hash digest
SHA256 db76ede15e77cc642bb973ebd93a6fcc2de4c2f3e6a5e4a1a6483d3d2d1be062
MD5 7da2ba6130227f06b1e0b95e8f358ceb
BLAKE2b-256 1502f1aebc83789d692ea14723e3875bcdd2d0d25b8147eedf6d2149a717addb

See more details on using hashes here.

File details

Details for the file reldi_tokeniser-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for reldi_tokeniser-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 97be388d5e06519fc0c2f8dce4702c0912f5a279d24c61df37c5fcbfaccef774
MD5 d5e7188c8f58d2b42f8da4200a37b61d
BLAKE2b-256 a74bee1f1685fe3769cb9319fb83b626aeab86d9512b0d0ecee2d024419a1d4a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page