ZiTokenizer: tokenize world text as Zi
Project description
ZiTokenizer
ZiTokenizer: tokenize word as Zi
read word as Zi
word = prefix + root + suffix
support 175 languages + global
use
- pip install ZiTokenizer
- toeknize language frequency and count word frequency (https://github.com/laohur/UnicodeTokenizer/blob/master/test/count_lang/count_word.py)
from ZiTokenizer.ZiTokenizer import ZiTokenizer
# use
tokenizer = ZiTokenizer(lang="global") # lang='ar', 'en', 'fr', 'ru', 'zh' ...
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)熵😀'\x0000熇"
tokens = tokenizer.tokenize(line)
print(' '.join(tokens)) # ' 〇 ㎡ [ ค ณ-- จ-- ะ --จ ด-- พ ธ แต ง-- งา-- น-- เม อไ-- ร --ค --ะ ] ##s ht pays - g [ ran ] d - blanc - eleve » ( 白 高 大 夏 國 ) ⿰ 火 商 ##g ce ' 00 ⿰ 火 高
# build
tokenizer = ZiTokenizer(mydir) # mydir include "word_frequency.tsv"
tokenizer.build(min_ratio=2e-6, min_freq=1)
tokenizer = ZiTokenizer(dir=mydir)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiTokenizer-0.0.2.tar.gz
(14.3 MB
view hashes)
Built Distribution
Close
Hashes for ZiTokenizer-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 006803a2d60777bef6039c6a7f207a3d277601d4463b4a3f0ecbbab3f16bb7c0 |
|
MD5 | f7363d6b285acbc171df499076f258f0 |
|
BLAKE2b-256 | 23f1ce16933574bce867718c80d5d23247fcec3d6289a20f56509250c021ac39 |