ZiTokenizer: tokenize world text as Zi
Project description
ZiTokenizer
ZiTokenizer: tokenize word as Zi
word => prefix + root + suffix
support 316 languages, including global
use
- pip install ZiTokenizer
- toeknize language frequency and count word frequency (https://github.com/laohur/UnicodeTokenizer/blob/master/test/count_lang/count_word.py)
from ZiTokenizer.ZiTokenizer import ZiTokenizer
# use
tokenizer = ZiTokenizer(lang="global") # lang='ar', 'en', 'fr', 'ru', 'zh' ...
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)熵😀'\x0000熇"
tokens = tokenizer.tokenize(line)
print(' '.join(tokens)) # ' 〇 ㎡ [ ค ณ-- จ-- ะ --จ ด-- พ ธ แต ง-- งา-- น-- เม อไ-- ร --ค --ะ ] ##s ht pays - g [ ran ] d - blanc - eleve » ( 白 高 大 夏 國 ) ⿰ 火 商 ##g ce ' 00 ⿰ 火 高
# build
tokenizer = ZiTokenizer(mydir) # mydir include "word_frequency.tsv"
tokenizer.build(min_ratio=1.5e-6, min_freq=3)
tokenizer = ZiTokenizer(dir=mydir)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiTokenizer-0.0.5.tar.gz
(23.5 MB
view hashes)
Built Distribution
Close
Hashes for ZiTokenizer-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70348131a80c129fc16195f9ebaacffd4049d55d8be6d0c26085f38e5dffd21a |
|
MD5 | ecc8d20312638ac2aec3ca582eecea99 |
|
BLAKE2b-256 | 631792a85625763ea82ecf039fc14dc6ac7102c5e76bdba0eee212f35c2c4850 |