Skip to main content

Hungarian tokenizer based on quex and huntoken.

Project description

quntoken

New Hungarian tokenizer based on quex and huntoken. This tool is also integrated into the e-magyar language processing system under the name emToken.

Requirements

  • OS: linux x86-64
  • python 3.6+

Developer requirements:

  • python 2.7 (for quex)
  • g++ = 5

WARNING: It is recommended to use Docker to build the wheel! (use make build-docker, wheel will be created in release folder) For detailed build instructions see Dockerfile.

Install

pip3 install quntoken

Usage

Command line

quntoken reads plain text in UTF-8 from STDIN and writes to STDOUT.

The default (and recommended) format of output is TSV. It has two columns. The first contains the token, the second contains the white space sequence after the token. Sentence boundaries are marked with empty lines.

Example: tokenizing input.txt file, writing the TSV output into output.tsv file.

quntoken <input.txt >output.tsv

Optional arguments:

  -h, --help            Show this help message and exit
  -f {json,raw,spl,tsv,xml}, --form {json,raw,spl,tsv,xml}
                        Valid formats: json, tsv, xml and spl (sentence per
                        line, ignores mode). Default format: tsv.
  -m {sentence,token}, --mode {sentence,token}
                        Modes: sentence or token (does not apply for
                        form=spl). Default: token
  -c, --conll-text      Add CoNLL text metafield to contain the detokenized
                        sentence (only for mode == token and format == tsv).
                        Default: False
  -i, --input           One or more input files. ('-' for STDIN) Default: STDIN
  -o, --output          One output file. ('-' for STDOUT) Default: STDOUT
  -s, --separate-lines  Separate processing of each line.
                        (Starts new tokenizer for each line.) Default: False
  -w, --word-break      Eliminate word break from end of lines.
  -v, --version         show program's version number and exit

Python API

quntoken.tokenize(inp=sys.stdin, form='tsv', mode='token', word_break=False, conll_text=False)

Entry point, returns an iterator object. Parameters:

  • inp: Input iterator, default: sys.stdin.
  • form: Format of output. Valid formats: 'tsv' (default), 'json', 'xml' and 'spl' (sentence per line, ignores mode).
  • mode: 'sentence' (only sentence segmenting) or 'token' (full tokenization - default, does not apply for form=spl).
  • word_break: If True, eliminates word break from end of lines. Default: False.
  • conll_text: If True, add CoNLL text metafield to contain the detokenized sentence (Only for mode == token and format == tsv). Default: False.

Example:

from quntoken import tokenize

for tok in tokenize(open('input.txt')):
    print(tok, end='')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quntoken-3.3.2.tar.gz (9.8 MB view details)

Uploaded Source

Built Distribution

quntoken-3.3.2-py3-none-any.whl (9.8 MB view details)

Uploaded Python 3

File details

Details for the file quntoken-3.3.2.tar.gz.

File metadata

  • Download URL: quntoken-3.3.2.tar.gz
  • Upload date:
  • Size: 9.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for quntoken-3.3.2.tar.gz
Algorithm Hash digest
SHA256 a4ee061a1aa94223ae1f8d3dbf4da3dd79f8ae5ab009d26ec832341c635c93e4
MD5 08383e5795d8226255b72c99f1f07c76
BLAKE2b-256 369305cc71084fe908c92e5acbe8d716115afc16509cc918442f750d05ed63fb

See more details on using hashes here.

File details

Details for the file quntoken-3.3.2-py3-none-any.whl.

File metadata

  • Download URL: quntoken-3.3.2-py3-none-any.whl
  • Upload date:
  • Size: 9.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for quntoken-3.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ce55dc95549eb3a28d5cc591d018e9f51f3501bbd34abb446f3706a71390cdf7
MD5 7742bf0c89a92e6f28c16aa77b6eda5e
BLAKE2b-256 4d3665848735e398c01dc050b7ff0edc727faa1a570a26226324cce9231e0e3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page