Hungarian tokenizer based on quex and huntoken.
Project description
quntoken
New Hungarian tokenizer based on quex and huntoken. This tool is also integrated into the e-magyar language processing system under the name emToken.
Requirements
- OS: linux x86-64
- python 3.6+
Developer requirements:
- python 2.7 (for quex)
- g++ = 5
Install
pip3 install quntoken
Usage
Command line
quntoken reads plain text in UTF-8 from STDIN and writes to STDOUT.
The default (and recommended) format of output is TSV. It has two columns. The first contains the token, the second contains the white space sequence after the token. Sentence boundaries are marked with empty lines.
Example: tokenizing input.txt file, writing the TSV output into output.tsv file.
quntoken <input.txt >output.tsv
Optional arguments:
-h, --help show this help message and exit
-f FORM, --form FORM Valid formats: json, tsv, xml and spl (sentence per
line). Default format: tsv.
-m MODE, --mode MODE Modes: sentence or token. Default: token
-w, --word-break Eliminate word break from end of lines.
-v, --version show program's version number and exit
Python API
quntoken.tokenize(inp=sys.stdin, form='tsv', mode='token', word_break=False)
Entry point, returns an iterator object. Parameters:
- inp: Input iterator, default: sys.stdin.
- form: Format of output. Valid formats:
'tsv'
(default),'json'
,'xml'
and'spl'
(sentence per line).- mode:
'sentence'
(only sentence segmenting) or'token'
(full tokenization - default).- word_break: If
'True'
, eliminates word break from end of lines. Default:'False'
.
Example:
from quntoken import tokenize
for tok in tokenize(open('input.txt')):
print(tok, end='')
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file quntoken-3.1.3.tar.gz
.
File metadata
- Download URL: quntoken-3.1.3.tar.gz
- Upload date:
- Size: 9.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab3a2e0216c666c5b1997fe2348485ebb3470d7e613d1f985218eb7f8deaa1ed |
|
MD5 | 8db9056868fab67f5ce9075e40647f25 |
|
BLAKE2b-256 | c70524c0011c0f4fbaec4444ca2fdb273d5143ea2bc006b367a2964742bab136 |
File details
Details for the file quntoken-3.1.3-py3-none-any.whl
.
File metadata
- Download URL: quntoken-3.1.3-py3-none-any.whl
- Upload date:
- Size: 9.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e65193779cf3eb65f0a7ce0dcecf00ca0b9c01a2db6faf0d8c9c549a3c0c8217 |
|
MD5 | 09e94af54e1b8b66f54fbe6a554370f7 |
|
BLAKE2b-256 | 08cf5117dde0ccc175db167a20d6e7fce878c03c024f8368fb43f0e38df1ac4c |