ZiTokenizer: tokenize word as Zi
Project description
ZiTokenizer
ZiTokenizer: tokenize word as Zi
read word as Zi
word = prefix + root + suffix
use
- pip install ZiTokenizer
- toeknize language frequency and count word frequency (https://github.com/laohur/UnicodeTokenizer/blob/master/test/count_lang/count_word.py)
from ZiTokenizer.ZiTokenizer import ZiTokenizer
# build
tokenizer = ZiTokenizer(dir) # dir includ "word_frequency.tsv"
tokenizer.build(min_ratio=2e-6, min_freq=1)
# use
tokenizer = ZiTokenizer(dir)
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'\x0000𧭏"
tokens = tokenizer.tokenize(line)
print(tokens)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiTokenizer-0.0.0.tar.gz
(5.8 kB
view hashes)
Built Distribution
Close
Hashes for ZiTokenizer-0.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eaf0baa87b672ec40c255950750ec25e227237b81a4abff3d83107d54e6d19d8 |
|
MD5 | 50df8ed9603c5562355a9f12f286f393 |
|
BLAKE2b-256 | 18a33f04ab074575ecdaceab4b131502c5745194d0722027bb7094f26d2a1ddb |