CocCocTokenizer

CocCocTokenizer deployed by trituenhantao.io.

Project description

This project provides tokenizer library for Vietnamese language and 2 command line tools for tokenization and some simple Vietnamese-specific operations with text (i.e. remove diacritics). It is used in Cốc Cốc Search and Ads systems and the main goal in its development was to reach high performance while keeping the quality reasonable for search ranking needs.

Deployed by trituenhantao.io

Installing

$ pip install CocCocTokenizer

Using Python bindings

from CocCocTokenizer import PyTokenizer

# load_nontone_data is True by default
T = PyTokenizer(load_nontone_data=True)

# tokenize_option:
# 	0: TOKENIZE_NORMAL (default)
#	1: TOKENIZE_HOST
#	2: TOKENIZE_URL
print(T.word_tokenize("xin chào, tôi là người Việt Nam", tokenize_option=0))

# output: ['xin', 'chào', ',', 'tôi', 'là', 'người', 'Việt_Nam']

Using the tools

Both tools will show their usage with --help option. Both tools can accept either command line arguments or stdin as an input (if both provided, command line arguments are preferred). If stdin is used, each line is considered as one separate argument. The output format is TAB-separated tokens of the original phrase (note that Vietnamese tokens can have whitespaces inside). There's a few examples of usage below.

Tokenize command line argument:

$ tokenizer "Từng bước để trở thành một lập trình viên giỏi"
từng	bước	để	trở thành	một	lập trình	viên	giỏi

Note that it may take one or two seconds for tokenizer to load due to one comparably big dictionary used to tokenize "sticky phrases" (when people write words without spacing). You can disable it by using -n option and the tokenizer will be up in no time. The default behaviour about "sticky phrases" is to only try to split them within urls or domains. With -n you can disable it completely and with -u you can force using it for the whole text. Compare:

$ tokenizer "toisongohanoi, tôi đăng ký trên thegioididong.vn"
toisongohanoi	tôi	đăng ký	trên	the gioi	di dong	vn

$ tokenizer -n "toisongohanoi, tôi đăng ký trên thegioididong.vn"
toisongohanoi	tôi	đăng ký	trên	thegioididong	vn

$ tokenizer -u "toisongohanoi, tôi đăng ký trên thegioididong.vn"
toi	song	o	ha noi	tôi	đăng ký	trên	the gioi	di dong	vn

To avoid reloading dictionaries for every phrase, you can pass phrases from stdin. Here's an example (note that the first line of output is empty - that means empty result for "/" input line):

$ echo -ne "/\nanh yêu em\nbún chả ở nhà hàng Quán Ăn Ngon ko ngon\n" | tokenizer

anh	yêu	em
bún	chả	ở	nhà hàng	quán ăn	ngon	ko	ngon

Whitespaces and punctuations are ignored during normal tokenization, but are kept during tokenization for transformation, which is used internally by Coc Coc search engine. To keep punctuations during normal tokenization, except those in segmented URLs, use -k. To run tokenization for transformation, use -t - notice that this will format result by replacing spaces in multi-syllable tokens with _ and _ with ~.

$ tokenizer "toisongohanoi, tôi đăng ký trên thegioididong.vn" -k
toisongohanoi   ,       tôi     đăng ký trên    the gioi        di dong vn

$ tokenizer "toisongohanoi, tôi đăng ký trên thegioididong.vn" -t
toisongohanoi   ,               tôi             đăng_ký         trên            the_gioi        di_dong vn

The usage of vn_lang_tool is pretty similar, you can see full list of options for both tools by using:

$ tokenizer --help
$ vn_lang_tool --help

Project details

Release history Release notifications | RSS feed

0.1.1

Nov 7, 2022

This version

0.1.0 yanked

Nov 5, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

CocCocTokenizer-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB view hashes)

Uploaded Nov 5, 2022 CPython 3.8 manylinux: glibc 2.17+ x86-64

Hashes for CocCocTokenizer-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Hashes for CocCocTokenizer-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`bfc9ef0cd2b91d89a36148f3e430a4604e48442abc93e89a3dcbe4d9173fda1f`
MD5	`719cf90c573f5cf1107c3946676580d6`
BLAKE2b-256	`ca6f24f84de0013473abe4f1b958d2be35eb3cc9867138665ef505f6a38ce549`