Skip to main content

Hungarian tokenizer based on quex and huntoken.

Project description

quntoken

New Hungarian tokenizer based on quex and huntoken. This tool is also integrated into the e-magyar language processing system under the name emToken.

Requirements

  • OS: linux x86-64
  • python 3.6+

Developer requirements:

  • python 2.7 (for quex)
  • g++ = 5

Install

pip3 install quntoken

Usage

Command line

quntoken reads plain text in UTF-8 from STDIN and writes to STDOUT.

The default (and recommended) format of output is TSV. It has two columns. The first contains the token, the second contains the white space sequence after the token. Sentence boundaries are marked with empty lines.

Example: tokenizing input.txt file, writing the TSV output into output.tsv file.

quntoken <input.txt >output.tsv

Optional arguments:

  -h, --help            show this help message and exit
  -f FORM, --form FORM  Valid formats: json, tsv, xml and spl (sentence per
                        line). Default format: tsv.
  -m MODE, --mode MODE  Modes: sentence or token. Default: token
  -w, --word-break      Eliminate word break from end of lines.
  -v, --version         show program's version number and exit

Python API

quntoken.tokenize(inp=sys.stdin, form='tsv', mode='token', word_break=False)

Entry point, returns an iterator object. Parameters:

  • inp: Input iterator, default: sys.stdin.
  • form: Format of output. Valid formats: 'tsv' (default), 'json', 'xml' and 'spl' (sentence per line).
  • mode: 'sentence' (only sentence segmenting) or 'token' (full tokenization - default).
  • word_break: If 'True', eliminates word break from end of lines. Default: 'False'.

Example:

from quntoken import tokenize

for tok in tokenize(open('input.txt')):
    print(tok, end='')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quntoken-3.1.2.tar.gz (9.8 MB view details)

Uploaded Source

Built Distribution

quntoken-3.1.2-py3-none-any.whl (9.8 MB view details)

Uploaded Python 3

File details

Details for the file quntoken-3.1.2.tar.gz.

File metadata

  • Download URL: quntoken-3.1.2.tar.gz
  • Upload date:
  • Size: 9.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for quntoken-3.1.2.tar.gz
Algorithm Hash digest
SHA256 d73bfc66f1412ba9e3d87b21d9e0f9e44e55523ed0efbd8c50e56e4d564b515c
MD5 7392d0068df4096fbd41e3fbb0e6df2f
BLAKE2b-256 e5122a7645652b889c50d63058e7dc7cfe2b92aab7c55f7aa50de1ca2d9e9291

See more details on using hashes here.

File details

Details for the file quntoken-3.1.2-py3-none-any.whl.

File metadata

  • Download URL: quntoken-3.1.2-py3-none-any.whl
  • Upload date:
  • Size: 9.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for quntoken-3.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 00295891a8a359bf639c90fad9bd3fb6e0231cc09f7a9da2c992adf443e1e3de
MD5 b9b8c9dd74e612d605cac4c4514e90de
BLAKE2b-256 e5e795ab13bbc19d8f85fb7e4fb57dd44036994140e184e97b21a44bc9451838

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page