ZiTokenizer: tokenize world text as Zi
Project description
ZiTokenizer
ZiTokenizer: tokenize word as Zi
read word as Zi
word = prefix + root + suffix
support 175 languages + global
use
- pip install ZiTokenizer
- toeknize language frequency and count word frequency (https://github.com/laohur/UnicodeTokenizer/blob/master/test/count_lang/count_word.py)
from ZiTokenizer.ZiTokenizer import ZiTokenizer
# use
tokenizer = ZiTokenizer(lang="global") # lang='ar', 'en', 'fr', 'ru', 'zh' ...
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)熵😀'\x0000熇"
tokens = tokenizer.tokenize(line)
print(' '.join(tokens)) # ' 〇 ㎡ [ ค ณ-- จ-- ะ --จ ด-- พ ธ แต ง-- งา-- น-- เม อไ-- ร --ค --ะ ] ##s ht pays - g [ ran ] d - blanc - eleve » ( 白 高 大 夏 國 ) ⿰ 火 商 ##g ce ' 00 ⿰ 火 高
# build
tokenizer = ZiTokenizer(mydir) # mydir include "word_frequency.tsv"
tokenizer.build(min_ratio=2e-6, min_freq=1)
tokenizer = ZiTokenizer(dir=mydir)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiTokenizer-0.0.1.tar.gz
(14.3 MB
view hashes)
Built Distribution
Close
Hashes for ZiTokenizer-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2476a22174bf4d01b412adc1474acc3552a7ac05917c66cd9cb3f1739664eb5 |
|
MD5 | ea9d70a56ae033106ec430c9bbd98ca9 |
|
BLAKE2b-256 | 6ecb9954eb888443955a24387f42a321d63fdb7636c050d2ca4b487053eeb8d2 |