ZiTokenizer: tokenize world text as Zi
Project description
ZiTokenizer
ZiTokenizer: tokenize word as Zi
word => prefix + root + suffix
support 325 languages + global, including global
use
- pip install ZiTokenizer
- toeknize language frequency and count word frequency (https://github.com/laohur/UnicodeTokenizer/blob/master/test/count_lang/count_word.py)
from ZiTokenizer.ZiTokenizer import ZiTokenizer
# use
tokenizer = ZiTokenizer(lang="global") # lang='ar', 'en', 'fr', 'ru', 'zh' ...
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)熵😀'\x0000熇"
tokens = tokenizer.tokenize(line)
print(' '.join(tokens)) # ' 〇 ㎡ [ ค ณ-- จ-- ะ --จ ด-- พ ธ แต ง-- งา-- น-- เม อไ-- ร --ค --ะ ] ##s ht pays - g [ ran ] d - blanc - eleve » ( 白 高 大 夏 國 ) ⿰ 火 商 ##g ce ' 00 ⿰ 火 高
# build
tokenizer = ZiTokenizer(mydir) # mydir include "word_frequency.tsv"
tokenizer.build(min_ratio=1.5e-6, min_freq=3)
tokenizer = ZiTokenizer(dir=mydir)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiTokenizer-0.0.7.tar.gz
(16.2 kB
view hashes)
Built Distribution
Close
Hashes for ZiTokenizer-0.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd62d57eef0044930684cd5e12e7a66c1e391f0fc739fb1aabce3496fc2d6b12 |
|
MD5 | 60be80246983d67b67ecbd86571d49ae |
|
BLAKE2b-256 | b1b22ef5670a4e4caed338a8240606c24c41e71f0e5f43c2deb29e11784b9439 |