ZiTokenizer: tokenize world text as Zi
Project description
ZiTokenizer
ZiTokenizer: tokenize word as Zi
read word as Zi
word = prefix + root + suffix
support 175 languages + global
use
- pip install ZiTokenizer
- toeknize language frequency and count word frequency (https://github.com/laohur/UnicodeTokenizer/blob/master/test/count_lang/count_word.py)
from ZiTokenizer.ZiTokenizer import ZiTokenizer
# use
tokenizer = ZiTokenizer(lang="global") # lang='ar', 'en', 'fr', 'ru', 'zh' ...
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)熵😀'\x0000熇"
tokens = tokenizer.tokenize(line)
print(' '.join(tokens)) # ' 〇 ㎡ [ ค ณ-- จ-- ะ --จ ด-- พ ธ แต ง-- งา-- น-- เม อไ-- ร --ค --ะ ] ##s ht pays - g [ ran ] d - blanc - eleve » ( 白 高 大 夏 國 ) ⿰ 火 商 ##g ce ' 00 ⿰ 火 高
# build
tokenizer = ZiTokenizer(mydir) # mydir include "word_frequency.tsv"
tokenizer.build(min_ratio=2e-6, min_freq=1)
tokenizer = ZiTokenizer(dir=mydir)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiTokenizer-0.0.4.tar.gz
(20.0 MB
view hashes)
Built Distribution
Close
Hashes for ZiTokenizer-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58b1cc87ae2809af0593307c32a4e88f09bb7faf41de56207d2d15384582b3cf |
|
MD5 | 52bcab298cd822eb772b472e6f04b372 |
|
BLAKE2b-256 | 9a729c19af570e17ffad8fa4e1c88c3ff815a54a765b4d4188a142640a467b18 |